#Image Generation

110개의 포스트

[논문리뷰] High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

본 연구는 고품질 이미지 생성 모델의 Inference Latency 문제와 다단계 생성 과정에서의 정보 손실을 해결하는 것을 목표로 합니다.

#Review #Image Generation #Knowledge Distillation #Diffusion Models #Model Compression #Latent Diffusion #Efficiency

2026년 6월 11일

[논문리뷰] Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting

본 연구는 다수의 LoRA를 결합하여 복합적인 개념을 생성할 때 발생하는 의미적 간섭(Interference)과 그에 따른 화질 및 충실도 저하 문제를 해결합니다.

#Review #LoRA #Diffusion Models #Multi-Concept Composition #Prompt-Aware Weighting #Training-Free #Image Generation

2026년 6월 3일

[논문리뷰] Geometry-Aware Image Flow Matching

기존의 Continuous Normalizing Flows (CNF), Diffusion models (DM), Flow Matching (FM)과 같은 발전된 생성 모델들은 이미지 데이터를 고차원 Euclidean space의 벡터로 간주하는 Euclidean geometry 가정을 기반으로 합니다.

#Review #Flow Matching #Spherical Geometry #Image Generation #Riemannian Manifold #Optimal Transport #Hyperspherical Projection #Generative Models

2026년 5월 25일

[논문리뷰] GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

본 논문은 오픈 엔드 이미지 생성이 단순한 텍스트 프롬프트 기반의 task를 넘어, 모델의 내부 지식과 외부 리소스를 효과적으로 결합해야 하는 복잡한 에이전트 과정임을 강조합니다.

#Review #Image Generation #Agentic Workflow #Self-Evolving #Visual Experience Distillation #Tool-Orchestrated #On-Policy Distillation #Multimodal Agent

2026년 5월 21일

[논문리뷰] UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

본 논문은 현대 AI 생태계에서 이미지 생성과 생성된 이미지 탐지가 서로 밀접하게 연관되어 있음에도 불구하고, 기존 연구들이 이들을 독립적으로 최적화한다는 점을 핵심 문제로 정의합니다.

#Review #Multimodal Large Language Models #AI-Generated Image Detection #Image Generation #Co-evolutionary Learning #Unified Architecture #Feature Alignment

2026년 4월 23일

[논문리뷰] Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

본 논문은 통합된 Multimodal 모델인 BAGEL-7B를 기반으로, 텍스트 토큰과 비주얼 토큰을 Autoregressively 생성하는 Process-Driven 아키텍처를 구축하였다 . 제안 모델은 4단계 루프(Plan → Sketch → Inspect → Refine)를 통해 각 단계에서 생성된 중간 비주얼 상태를 스스로 평가하고 수정한다.

#Review #Multimodal Foundation Models #Process-Driven Generation #Interleaved Reasoning #Chain-of-Thought #Visual Grounding #Image Generation

2026년 4월 8일

[논문리뷰] ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

최근 Diffusion, Autoregressive, 하이브리드 아키텍처의 발전으로 이미지 생성 및 편집 분야는 크게 도약했으나, 기존 벤치마크들은 특정 작업에만 국한되거나 좁은 도메인에 편향되어 실무적인 포괄성이 부족합니다 .

#Review #Image Generation #Image Editing #Benchmark #Human Evaluation #Explainable AI #Multimodal Learning

2026년 3월 30일

[논문리뷰] Gen-Searcher: Reinforcing Agentic Search for Image Generation

최신 텍스트-이미지 생성 모델들은 놀라운 시각적 품질을 보여주지만, 학습 과정에서 습득한 고정된 지식에 의존한다는 근본적인 한계를 지닙니다. 특히 실시간 정보가 필요하거나 지식 집약적인 프롬프트가 주어질 경우, 모델은 올바른 시각적 참조 없이 이미지를 생성하여 factual error나 시각적 왜곡을 초래합니다.

#Review #Agentic AI #Image Generation #Multi-hop Search #Reinforcement Learning #Grounded Generation #Multimodal Agent

2026년 3월 30일

[논문리뷰] DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

최근 diffusion model은 T2I generation과 text-guided editing 분야에서 비약적인 발전을 이루었으나, 대부분 수십억 개의 파라미터를 필요로 하여 온디바이스 환경에서의 배포에 한계가 있다.

#Review #Diffusion Models #On-device AI #Image Generation #Image Editing #Unified Architecture #Task-progressive Pretraining

2026년 3월 30일

[논문리뷰] Representation Alignment for Just Image Transformers is not Easier than You Think

Representation Alignment (REPA)는 Latent Space Diffusion Transformer의 학습을 가속화하는 효과적인 방법으로 제시되었으나, Just Image Transformers (JiT)와 같은 Pixel-space Diffusion 모델에 이를 적용할 경우 오히려 성능 저하를 야기합니다.

#Review #Representation Alignment #Pixel-space Diffusion #Just Image Transformers #Feature Hacking #Masked Transformer Adapter #Diffusion Models #Image Generation

2026년 3월 26일

[논문리뷰] MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation

최근 multi-reference image generation 시스템은 하나의 이미지 내에서 여러 entity를 세밀하게 제어하는 기능에 대한 기대를 높이고 있다.

#Review #Multi-subject Generation #Attribute Misbinding #Image Generation #Benchmark #Evaluation Protocol #Deep Learning #Computer Vision

2026년 3월 24일

[논문리뷰] WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

최근 Flow Matching 모델은 Latent Autoencoder의 재구성 병목 현상을 피하기 위해 픽셀 공간에서 직접 작동합니다. 그러나 픽셀 매니폴드(manifold)의 의미론적 연속성이 부족하여 최적 운송 경로가 심하게 얽히게 됩니다.

#Review #Image Generation #Flow Matching #Trajectory Conflict #Diffusion Transformers #Waypoint Diffusion Transformers #Just-Pixel AdaLN

2026년 3월 17일

[논문리뷰] Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Diffusion models과 autoregressive models의 발전으로 T2I generation 및 image editing task에서 상당한 진전이 있었으나, 이러한 모델들의 성능 향상을 위한 RL 기반 접근 방식은 reward model 의 신뢰성 문제에 직면해 있습니다.

#Review #Reinforcement Learning #Reward Modeling #Image Editing #Image Generation #MLLM #Data Curation #Fidelity #Instruction Following

2026년 3월 12일

[논문리뷰] UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

본 연구는 기존 통합 멀티모달 모델의 한계를 해결하고자 합니다. 특히, 이산적인 시각 토크나이저 사용으로 인한 세부 의미 정보 손실 문제와, 연속적인 고차원 시각 표현을 직접 모델링할 때 발생하는 학습 불안정성 및 느린 수렴 문제를 극복하는 것을 목표로 합니다.

#Review #Unified Multimodal Model #Image Generation #Image Understanding #Semantic Compression #Continuous Representation #Diffusion Model #Transformer #Image Editing

2026년 3월 11일

[논문리뷰] InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

통합 멀티모달 모델(UMM)이 강한 의미론적 이해와 강력한 생성 능력 사이에서 겪는 본질적인 상충 관계를 해결하고자 합니다. 이 논문은 InternVL-U 라는 경량의 4B 매개변수 UMM을 제안하여, 이해, 추론, 생성, 편집 능력을 하나의 통합 프레임워크 내에서 민주화하는 것을 목표로 합니다.

#Review #Unified Multimodal Models #Multimodal Large Language Model #Image Generation #Image Editing #Chain-of-Thought #Data Synthesis #Low-parameter Models

2026년 3월 10일

[논문리뷰] Making Reconstruction FID Predictive of Diffusion Generation FID

변이형 오토인코더(VAE)의 재구성 FID (rFID) 와 잠재 확산 모델(LDM)의 생성 FID (gFID) 사이의 낮은 상관관계, 즉 '재구성-생성 딜레마'를 해결하는 것을 목표로 합니다.

#Review #Latent Diffusion Models #VAE #FID #Generative Models #Evaluation Metrics #Image Generation #Reconstruction-Generation Dilemma #Interpolation

2026년 3월 8일

[논문리뷰] Dynamic Chunking Diffusion Transformer

본 논문은 Diffusion Transformer (DiT)에서 고정된 패치화를 학습된 동적 청킹(dynamic chunking) 메커니즘 으로 대체하여 이미지 생성 품질을 유지하면서 연산 효율성을 극대화 하는 것을 목표로 합니다.

#Review #Diffusion Transformer #Dynamic Chunking #Adaptive Patching #Image Generation #Computational Efficiency #Token Reduction #Spatial Segmentation #Load Balancing

2026년 3월 8일

[논문리뷰] Enhancing Spatial Understanding in Image Generation via Reward Modeling

본 연구는 복잡한 공간 관계가 포함된 텍스트 프롬프트에서 현재 Text-to-Image(T2I) 모델 이 직면하는 한계를 해결하고, 생성된 이미지의 공간적 정확도를 향상시키는 것을 목표로 합니다.

#Review #Image Generation #Reward Modeling #Spatial Understanding #Reinforcement Learning #Visual Language Models #Text-to-Image #Preference Learning

2026년 3월 1일

[논문리뷰] SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

확산 모델의 느린 추론 속도를 개선하기 위해 기존 캐싱 방법론이 원시 특징(raw feature) 차이 에만 의존하여 콘텐츠와 노이즈를 혼합하고, 이로 인해 스펙트럼 진화(spectral evolution) 를 간과하는 문제를 해결하고자 합니다.

#Review #Diffusion Models #Model Acceleration #Feature Caching #Spectral Analysis #Generative AI #Image Generation #Video Generation #Latency Reduction

2026년 2월 25일

[논문리뷰] Image Generation with a Sphere Encoder

기존 확산 모델(diffusion models) 및 자기회귀 모델(autoregressive models)의 느리고 비용이 많이 드는 이미지 생성 방식의 한계를 극복하고, 단 한 번의 순방향 패스(forward pass)만으로도 선명한 이미지를 생성할 수 있는 효율적인 생성 프레임워크를 개발하는 것을 목표로 합니다.

#Review #Image Generation #Sphere Encoder #Autoencoder #Latent Space #Few-Step Generation #Conditional Generation #Diffusion Models #Perceptual Loss

2026년 2월 25일

[논문리뷰] The Diffusion Duality, Chapter II: Ψ-Samplers and Efficient Curriculum

본 논문은 균일 상태 이산 확산 모델(Uniform-State Discrete Diffusion Models, USDMs) 의 샘플링 품질이 스텝 수 증가 시 정체되는 문제점을 해결하는 것을 목표로 합니다.

#Review #Discrete Diffusion #Ψ-Samplers #Predictor-Corrector #Language Modeling #Image Generation #Curriculum Learning #Efficient Training

2026년 2월 24일

[논문리뷰] Unified Latents (UL): How to train your latents

확산 모델을 위한 레이턴트 표현 학습에 있어 정보 내용과 재구성 품질 간의 근본적인 트레이드오프 문제를 해결하고자 합니다.

#Review #Diffusion Models #Latent Representation Learning #VAE #Image Generation #Video Generation #Bitrate Control #Training Efficiency #Diffusion Prior #Diffusion Decoder

2026년 2월 19일

[논문리뷰] Visual Persuasion: What Influences Decisions of Vision-Language Models?

본 연구는 Vision-Language Model (VLM) 이 시각적 요인에 의해 의사결정에 어떻게 영향을 받는지 체계적으로 이해하는 것을 목표로 합니다.

#Review #Vision-Language Models #Visual Persuasion #Prompt Optimization #Image Generation #AI Agent Behavior #Interpretability #Behavioral Evaluation

2026년 2월 17일

[논문리뷰] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

본 논문은 기존 통합 멀티모달 모델들이 단일 패스로만 작동하여 반복적인 개선 없이 출력을 생성하는 한계를 지적합니다. 복잡한 공간 구성, 다중 객체 상호작용, 진화하는 지침 등 다단계 추론과 자가 수정이 필요한 멀티모달 작업에서 이러한 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal AI #Chain-of-Thought #Test-time Scaling #Unified Models #Iterative Reasoning #Image Generation #Visual Reasoning #Self-Correction

2026년 2월 17일

[논문리뷰] UniWeTok: An Unified Binary Tokenizer with Codebook Size 2^{128} for Unified Multimodal Large Language Model

본 논문은 통합 멀티모달 대규모 언어 모델(MLLM)이 요구하는 고충실도 재구성, 복합적인 의미 추출 및 생성 적합성을 동시에 지원하는 시각적 표현을 제공하는 문제를 해결하고자 합니다.

#Review #Multimodal LLM #Visual Tokenizer #Binary Codebook #Image Generation #Semantic Extraction #Pre-Post Distillation #Hybrid Architecture

2026년 2월 16일

[논문리뷰] BitDance: Scaling Autoregressive Generative Models with Binary Tokens

본 논문은 기존 Autoregressive (AR) 모델의 제한된 토큰 표현력과 비효율적인 샘플링 문제를 해결하여, 고품질 이미지 생성을 위한 확장 가능한 AR 프레임워크인 BitDance 를 제안합니다.

#Review #Autoregressive Models #Binary Tokens #Diffusion Head #Image Generation #Tokenizer #Parallel Prediction #High-Resolution

2026년 2월 16일

[논문리뷰] DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

본 논문은 현재 대규모(~10B 이상) 파라미터를 요구하는 멀티모달 이미지 생성 및 편집 모델의 높은 훈련 비용과 배포 한계를 극복하는 것을 목표로 합니다. 경량의 5B 파라미터 모델(DeepGen 1.0) 을 통해 훨씬 큰 모델과 동등하거나 이를 능가하는 포괄적인 생성 및 편집 능력을 달성하고자 합니다.

#Review #Multimodal Model #Image Generation #Image Editing #Diffusion Models #VLM-DiT Architecture #Stacked Channel Bridging #Reinforcement Learning #Lightweight Models

2026년 2월 12일

[논문리뷰] Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss

본 연구는 오토회귀(Autoregressive) 이미지 생성 모델 이 확산 손실(diffusion loss) 과 결합될 때 발생하는 '조건 불일치(condition inconsistency)' 문제를 해결하고, 이로 인해 누적되는 extraneous 정보가 패치 생성 품질을 저해하는 한계를 극복하는 것을 목표로 합니다.

#Review #Autoregressive Models #Diffusion Models #Image Generation #Condition Refinement #Optimal Transport #Wasserstein Gradient Flow #Score Matching #Patch Denoising

2026년 2월 10일

[논문리뷰] PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

본 논문은 통합 멀티모달 모델(UMMs)이 일상생활과 밀접한 컴퓨터 사용 계획 태스크(planning-oriented computer-use tasks)를 얼마나 잘 지원하는지 평가하는 것을 목표로 합니다.

#Review #Multimodal Models #Image Generation #Image Editing #Benchmark #Computer-Use Tasks #Planning #Evaluation Metrics

2026년 2월 8일

[논문리뷰] Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

본 논문은 적은 추론 단계(few-step inference)로 고품질 이미지를 빠르게 생성하기 위한 Distribution Matching Distillation (DMD) 과정에서 발생하는 모드 붕괴(mode collapse) 문제를 해결하는 것을 목표로 합니다.

#Review #Diffusion Models #Model Distillation #Mode Collapse #Image Generation #Diversity Preservation #Flow Matching #Few-Step Synthesis

2026년 2월 3일

[논문리뷰] Balancing Understanding and Generation in Discrete Diffusion Models

이 논문은 이산 확산 모델(Discrete Diffusion Models, DDM) 분야에서 Masked Diffusion Language Models (MDLM) 의 의미 이해 능력과 Uniform-noise Diffusion Language Models (UDLM) 의 고품질 소수 단계 생성 능력 간의 불균형을 해결하는 것을 목표로 합니다.

#Review #Discrete Diffusion Models #Language Modeling #Image Generation #Masked Diffusion #Uniform Noise #XDLM #Stationary Noise Kernel #Pareto Frontier

2026년 2월 3일

[논문리뷰] UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

본 논문은 복잡한 추론과 세계 지식이 필요한 이미지 합성 태스크에서 기존 통합 멀티모달 모델의 한계를 해결하고자 합니다.

#Review #Multimodal Reasoning #Image Generation #Image Editing #World Knowledge #Self-Reflection #Unified Framework #Text-to-Image

2026년 2월 2일

[논문리뷰] PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

본 논문은 기존 픽셀 확산 모델이 고차원 픽셀 공간의 지각적으로 중요하지 않은 신호를 학습하는 데 어려움을 겪어 잠재 확산 모델보다 성능이 뒤처지는 문제를 해결하고자 합니다.

#Review #Pixel Diffusion #Perceptual Loss #Latent Diffusion #Image Generation #LPIPS #DINOv2 #x-prediction #End-to-End Generation

2026년 2월 2일

[논문리뷰] PaperBanana: Automating Academic Illustration for AI Scientists

AI 과학자들을 위한 학술 출판용 일러스트레이션(방법론 다이어그램 및 통계 플롯) 생성 과정의 노동 집약적인 병목 현상을 해소하고 자동화하는 것을 목표로 합니다. 이를 통해 연구 워크플로우를 가속화하고 높은 품질의 시각적 커뮤니케이션 도구에 대한 접근성을 민주화하고자 합니다.

#Review #Automated Illustration Generation #Agentic Framework #Vision-Language Model #Image Generation #Methodology Diagrams #Statistical Plots #Academic Publishing #Iterative Refinement

2026년 2월 1일

[논문리뷰] DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation

본 연구는 사전 훈련된 Vision Foundation Model (VFM) 기반의 생성형 오토인코더가 겪는 낮은 재구성 충실도(fidelity) 문제를 해결하고, 동시에 효율적인 이미지 생성 능력을 유지하는 것을 목표로 합니다.

#Review #Autoencoder #DINO #Vision Foundation Models #Image Generation #Image Reconstruction #Spherical Manifold #Diffusion Models #Flow Matching

2026년 2월 1일

[논문리뷰] iFSQ: Improving FSQ for Image Generation with 1 Line of Code

이미지 생성 분야의 Autoregressive(AR) 모델과 Diffusion 모델 간의 단절을 해소하고, 이들을 위한 통일된 토크나이저를 구축 하는 것을 목표로 합니다.

#Review #Finite Scalar Quantization (FSQ)#Image Generation #Autoregressive Models #Diffusion Models #Quantization #Tokenization #Representation Alignment (REPA)#Latent Space

2026년 1월 26일

[논문리뷰] AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

본 논문은 기존 멀티모달 대규모 언어 모델(MLLM)이 멀티모달 생성을 위해 외부 전문가 구성 요소(예: 확산 디코더)에 의존하는 한계를 극복하고자 합니다.

#Review #Autoregressive Models #Multimodal AI #Any-to-Any Generation #Unified Model #Speech Generation #Image Generation #Transformer Decoder #Real-time Streaming

2026년 1월 26일

[논문리뷰] OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

본 논문은 이미지 이해(understanding)와 생성(generation) 모두에 활용될 수 있는 단일하고 통합된 시각적 표현을 학습하는 고급 비전 인코더인 OpenVision 3 를 제안합니다.

#Review #Unified Visual Encoder #Image Understanding #Image Generation #VAE #Vision Transformer #Multimodal Learning #Reconstruction #Contrastive Learning

2026년 1월 22일

[논문리뷰] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

의료 영상 이해(semantic abstraction)와 생성(pixel-level reconstruction)이라는 근본적으로 상충하는 목표를 기존 파라미터 공유 방식의 단일 모델에서 통합할 때 발생하는 성능 저하 문제를 해결하고자 합니다.

#Review #Chest X-Ray #Medical Foundation Model #Autoregressive Model #Diffusion Model #Multimodal Learning #Image Understanding #Image Generation #Cross-Modal Attention

2026년 1월 20일

[논문리뷰] MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Transformer의 핵심 모듈인 Self-Attention의 2차 시간 복잡성 으로 인한 확장성 문제를 해결하고자 합니다.

#Review #Linear Attention #Multi-Head Attention #Transformer #Global Context Collapse #Representational Diversity #Image Generation #NLP #Video Generation

2026년 1월 12일

[논문리뷰] Boosting Latent Diffusion Models via Disentangled Representation Alignment

Latent Diffusion Models (LDMs)의 핵심 구성 요소인 Variational Autoencoders (VAEs)가 기존처럼 픽셀 단위 재구성에만 초점을 맞추거나, LDM과 동일한 상위 수준의 의미론적 정렬 대상을 사용하는 한계를 지적합니다.

#Review #Latent Diffusion Models #Variational Autoencoders #Disentangled Representations #Vision Foundation Models #Representation Alignment #Image Generation #Semantic Disentanglement

2026년 1월 12일

[논문리뷰] E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

기존 GRPO(Group Relative Policy Optimization) 기반의 플로우 모델들이 여러 디노이징 타임스텝에 걸쳐 정책을 최적화할 때 발생하는 희소하고 모호한 보상 신호 문제를 해결하는 것이 목표입니다.

#Review #Reinforcement Learning #Flow Models #Entropy-aware Sampling #Group Relative Policy Optimization #SDE #Human Preference Alignment #Image Generation

2026년 1월 7일

[논문리뷰] NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

NextFlow는 단일 decoder-only autoregressive transformer 를 사용하여 멀티모달 이해 및 생성 능력을 통합하는 것을 목표로 합니다.

#Review #Multimodal AI #Decoder-only Transformer #Next-scale Prediction #Image Generation #Image Editing #Reinforcement Learning #Unified Modeling #TokenFlow

2026년 1월 5일

[논문리뷰] Guiding a Diffusion Transformer with the Internal Dynamics of Itself

확산 트랜스포머(Diffusion Transformer) 모델이 저확률 데이터 영역에서 고품질 이미지를 생성하지 못하는 문제를 해결하는 것이 목표입니다.

#Review #Diffusion Models #Transformer #Generative AI #Image Generation #Guidance Strategy #Internal Guidance #Auxiliary Loss #Classifier-Free Guidance

2025년 12월 31일

[논문리뷰] DreamOmni3: Scribble-based Editing and Generation

본 논문은 통합 생성 및 편집 모델에서 텍스트 프롬프트의 한계, 즉 사용자의 의도된 편집 위치 및 미세한 시각적 세부 사항을 정확히 포착하지 못하는 문제를 해결하고자 합니다.

#Review #Image Editing #Image Generation #Scribble-based Control #Multimodal AI #Diffusion Models #Data Synthesis #Human-Computer Interaction #Instruction-based Editing

2025년 12월 30일

[논문리뷰] StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models

Visual Autoregressive (VAR) 모델은 고품질 이미지 생성을 가능하게 하지만, 특히 대규모 스케일 단계에서 상당한 연산 복잡도와 긴 런타임으로 어려움을 겪습니다.

#Review #Visual Autoregressive Models #Image Generation #Model Acceleration #Low-Rank Approximation #Semantic Irrelevance #Stage-Aware Optimization #Text-to-Image Synthesis

2025년 12월 21일

[논문리뷰] REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

본 논문은 최신 이미지 생성 모델인 Latent Diffusion Models (LDMs) 의 고질적인 문제인 느린 의미론적 정보 학습 및 샘플 품질 제한을 해결하고자 합니다.

#Review #Latent Diffusion Models #Vision Foundation Models #Semantic Compression #Global-Local Semantics #Image Generation #Representation Entanglement #Transformer Architecture

2025년 12월 18일

[논문리뷰] Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

본 논문은 Masked Diffusion Models (MDMs)의 주요 비효율성, 즉 KV 캐싱 미지원 과 불필요한 마스크 토큰 처리 로 인한 느린 추론 속도 문제를 해결하고자 합니다. 특히, 멀티모달 태스크 전반에서 성능 저하 없이 효율성을 크게 향상시키는 새로운 모델링 프레임워크 를 제안하는 것이 목표입니다.

#Review #Discrete Diffusion Models #Multimodal Models #Sparse Parameterization #KV Caching #Token Truncation #Image Generation #Image Editing #Visual Reasoning

2025년 12월 16일

[논문리뷰] A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

이 논문은 고수준 추론과 저수준 그라운딩이 긴밀하게 결합된 기존 end-to-end 어포던스 예측 모델들이 새로운 객체나 복잡한 지시에 대한 일반화에 어려움을 겪는 한계를 해결하고자 합니다.

#Review #Affordance Prediction #Zero-Shot Learning #Agentic AI #Foundation Models #Multimodal Reasoning #Visual Grounding #Image Generation #Robotics

2025년 12월 16일

[논문리뷰] Image Diffusion Preview with Consistency Solver

본 논문은 이미지 Diffusion 모델의 느린 추론 속도로 인해 저하되는 사용자 경험 문제를 해결하고자 합니다.

#Review #Diffusion Models #Efficient Sampling #Reinforcement Learning #ODE Solvers #Image Generation #Consistency #Diffusion Preview

2025년 12월 15일

[논문리뷰] Exploring MLLM-Diffusion Information Transfer with MetaCanvas

MLLM이 복잡한 시각 정보를 이해하는 데는 뛰어나지만, 이미지 및 비디오 생성 시에는 그 추론 및 계획 능력이 충분히 활용되지 못해 정밀하고 구조화된 제어에 어려움을 겪는 간극을 해결하고자 합니다.

#Review #Multimodal Large Language Models (MLLMs)#Diffusion Models #Image Generation #Video Generation #Image Editing #Video Editing #Latent Space Planning #Canvas Tokens #Information Transfer

2025년 12월 14일

[논문리뷰] VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

멀티모달 이해, 생성 및 재구성 표현을 단일 토크나이저 내에서 통합하는 핵심 과제를 해결하고자 합니다. 기존의 듀얼 인코더 방식의 복잡성과 이산형 토크나이저의 의미 이해 능력 저하 문제를 극복하고, 연속형 의미 특징 과 이산형 미세 토큰 을 동시에 생성할 수 있는 통합 토크나이저를 제안하는 것이 목표입니다.

#Review #Multimodal Learning #Vector Quantization #Autoencoder #Unified Tokenizer #Image Generation #Image Reconstruction #Vision Transformers #Semantic Features

2025년 12월 11일

[논문리뷰] Rethinking Training Dynamics in Scale-wise Autoregressive Generation

본 연구는 스케일별 자동회귀(AR) 생성 모델이 겪는 (1) 훈련-추론 불일치(exposure bias) 와 (2) 스케일별 학습 난이도 불균형 문제로 인해 저하되는 생성 품질을 해결하는 것을 목표로 합니다.

#Review #Autoregressive Generation #Visual Synthesis #Exposure Bias #Student Forcing #Self-Autoregressive Refinement #Scale-wise Prediction #Image Generation

2025년 12월 8일

[논문리뷰] LongCat-Image Technical Report

컴퓨터 비전 분야에서 다국어 텍스트 렌더링, 사실주의, 배포 효율성, 개발자 접근성 등 기존 주요 모델들의 핵심 과제를 해결하고자 합니다. LongCat-Image 는 브루트 포스 스케일링에 대한 의존성에서 벗어나 최첨단 성능과 효율성 간의 최적의 균형을 이루는 경량 오픈소스 기반 모델을 목표로 합니다.

#Review #Image Generation #Text-to-Image #Image Editing #Diffusion Model #Multilingual Text Rendering #Photorealism #Efficiency #Open-Source

2025년 12월 8일

[논문리뷰] Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

본 논문은 Latent Diffusion Models (LDMs)의 내재적인 문제점인 고수준 의미론(semantics)과 저수준 텍스처(texture) 모델링 간의 불균형을 해결하여 느린 수렴과 최적화되지 않은 생성 품질 문제를 개선하는 것을 목표로 합니다.

#Review #Latent Diffusion Models #Asynchronous Denoising #Semantic Modeling #Texture Modeling #Image Generation #Vision Transformer #VAE #Fast Convergence

2025년 12월 4일

[논문리뷰] Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

본 논문은 Normalizing Flows (NFs) 의 생성 품질이 학습된 의미론적 표현의 부족으로 제한되는 문제를 해결하고자 합니다.

#Review #Normalizing Flows #Representation Alignment #Generative Models #TARFlow #Image Generation #Classification #Training Acceleration #Reverse Pass

2025년 12월 3일

[논문리뷰] Glance: Accelerating Diffusion Models with 1 Sample

본 논문은 이미지 생성 확산 모델의 높은 계산 비용과 많은 추론 단계를 해결하고자 합니다. 특히, 모델의 재훈련 비용과 일반화 성능 저하 없이, 단일 샘플만으로도 효율적인 가속화와 강력한 일반화 능력을 갖춘 경량화된 솔루션을 제공하는 것을 목표로 합니다.

#Review #Diffusion Models #Acceleration #Distillation #LoRA #Few-shot Learning #Phase-aware #Image Generation #Computational Efficiency

2025년 12월 2일

[논문리뷰] The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

본 논문은 기존 참조 기반 이미지 생성 모델이 미세한 디테일에서 일관성을 유지하지 못하고, 텍스트 및 로고 영역에서 부정확하거나 흐릿하게 생성되는 문제를 해결하는 것을 목표로 합니다.

#Review #Image Generation #Image Editing #Diffusion Models #Consistency Correction #Attention Mechanism #Reference-Guided #Agent Framework #Data Curation

2025년 12월 1일

[논문리뷰] Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

본 논문은 반복적인 샘플링 과정과 높은 훈련 비용으로 인해 computationally expensive한 확산 모델의 한계를 극복하는 것을 목표로 합니다.

#Review #Diffusion Models #Image Generation #Distillation #Reinforcement Learning #Few-Step Sampling #Timestep-Aware #Pixel-GAN #Model Efficiency

2025년 12월 1일

[논문리뷰] The Collapse of Patches

본 연구는 이미지 내 패치들 간의 상호 의존성을 분석하여 '패치 붕괴(patch collapse)' 라는 새로운 개념을 제안하고, 이를 통해 이미지의 불확실성을 가장 효율적으로 줄이는 최적의 패치 실현 순서 를 파악하는 것을 목표로 합니다.

#Review #Patch Collapse #Image Generation #Image Classification #Masked Image Modeling #Vision Transformers #PageRank #Uncertainty Reduction #Computational Efficiency

2025년 11월 30일

[논문리뷰] From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

본 논문은 MLLM(Multimodal Large Language Model) 이 이미지 내 객체를 인식하는 '무엇'을 넘어, 인간이 이미지를 주관적으로 인지하는 '어떻게 느끼는지'를 이해하는 능력의 부족을 해결하고자 합니다.

#Review #Multimodal LLM #Human Cognition #Image Perception #Benchmarking #Supervised Fine-tuning #Image Generation #Aesthetics #Memorability

2025년 11월 30일

[논문리뷰] Architecture Decoupling Is Not All You Need For Unified Multimodal Model

본 논문은 통합 멀티모달 모델(UMM)에서 시각 생성 및 이해 태스크 간의 내재된 충돌을 완화하면서도 모델 아키텍처 디커플링에 과도하게 의존하지 않고 성능을 향상시키는 것을 목표로 합니다. 과도한 디커플링이 통합 모델의 상호작용적 추론 능력과 지식 전이 능력을 저해하는 문제를 해결하고자 합니다.

#Review #Unified Multimodal Models #Architecture Decoupling #Cross-Modal Attention #Attention Interaction Alignment (AIA) Loss #Task Conflicts #Image Generation #Image Understanding

2025년 11월 30일

[논문리뷰] Adversarial Flow Models

본 논문은 기존 GANs (Generative Adversarial Networks) 의 훈련 불안정성과 Flow Matching 모델의 저해상도 이산화 오류 및 반복적인 추론 비용 문제를 해결하고자 합니다.

#Review #Generative Models #Adversarial Flow Models #GANs #Flow Matching #Optimal Transport #Single-step Generation #Image Generation #Transformer Architecture

2025년 11월 30일

[논문리뷰] Canvas-to-Image: Compositional Image Generation with Multimodal Controls

본 연구는 최신 확산 모델이 텍스트 프롬프트, 객체 참조, 공간 배치, 포즈 제약, 레이아웃 주석 등 다양한 유형의 제어 신호를 동시에 처리할 때 발생하는 제한적인 합성 능력과 낮은 충실도 문제를 해결하는 것을 목표로 합니다.

#Review #Image Generation #Diffusion Models #Compositional Control #Multimodal Control #Unified Canvas #Multi-Task Learning #Personalization

2025년 11월 27일

[논문리뷰] iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

iMontage는 사전 훈련된 비디오 모델을 재활용하여 고도로 동적인 다대다 이미지 생성을 위한 통합 프레임워크를 제시합니다.

#Review #Image Generation #Video Models #Diffusion Models #Many-to-many #Unified Framework #Temporal Consistency #Image Editing #Positional Embedding

2025년 11월 25일

[논문리뷰] VQ-VA World: Towards High-Quality Visual Question-Visual Answering

본 논문은 시각적 질문에 대한 시각적 답변(VQ-VA) 능력, 즉 이미지를 통해 질문에 응답하는 기능을 오픈 소스 모델에도 도입하는 것을 목표로 합니다.

#Review #Visual Question Answering (VQA)#Image Generation #Data-centric AI #Agentic Pipeline #Multimodal Models #Web-scale Data #Benchmark #LightFusion

2025년 11월 25일

[논문리뷰] DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

기존 픽셀 확산 모델이 Diffusion Transformer (DiT) 하나로 고주파수 신호와 저주파수 의미론을 동시에 모델링하여 발생하는 느린 학습 및 추론 속도, 낮은 이미지 품질 문제를 해결하고자 합니다.

#Review #Pixel Diffusion #Image Generation #Frequency Decoupling #Diffusion Transformer (DiT)#Flow Matching #AdaLN #Text-to-Image Synthesis

2025년 11월 24일

[논문리뷰] Diversity Has Always Been There in Your Visual Autoregressive Models

Visual Autoregressive (VAR) 모델이 겪는 다양성 붕괴(diversity collapse) 문제를 해결하고, 추가적인 훈련 없이 모델의 내재된 생성 다양성을 발현시키면서도 이미지 품질과 텍스트-이미지 정렬을 효과적으로 유지하는 것을 목표로 합니다.

#Review #Visual Autoregressive Models #Diversity Collapse #Generative Diversity #Soft-Suppression Regularization #Soft-Amplification Regularization #Training-Free #Image Generation #Singular Value Decomposition

2025년 11월 23일

[논문리뷰] Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

본 논문은 고품질의 일관되고 제어 가능한 이미지 및 비디오 생성을 위한 AI/ML 분야의 핵심 과제를 해결하고자 합니다. 특히, 최신 이미지 및 10초 비디오 합성을 위한 Kandinsky 5.0 이라는 최첨단 파운데이션 모델 제품군을 개발하여 최고 수준의 품질과 운영 효율성을 달성하는 것을 목표로 합니다.

#Review #Image Generation #Video Generation #Diffusion Models #Flow Matching #Diffusion Transformer #NABLA #RLHF #Supervised Fine-tuning

2025년 11월 19일

[논문리뷰] One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

본 논문은 기존 확산 모델이 고해상도 이미지를 직접 샘플링할 때 발생하는 속도 저하, 비용 증가, 아티팩트 발생 문제를 해결하고, 사후 픽셀 공간 초해상도(SR) 방식의 추가 지연 및 아티팩트를 극복하는 것을 목표로 합니다.

#Review #Latent Diffusion Models #Super-Resolution #Upscaling Adapter #Image Generation #Latent Space #Multi-scale Learning #Cross-VAE

2025년 11월 13일

[논문리뷰] Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance

이 논문은 확산 모델의 샘플링 과정에서 발생하는 품질 및 제어 가능성 문제를 해결하고자 합니다.

#Review #Diffusion Models #Guidance Sampling #Optimal Transport #Sinkhorn Algorithm #Self-Attention #Adversarial Perturbation #Image Generation #ControlNet

2025년 11월 12일

[논문리뷰] When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

본 논문은 중간 시각 이미지를 생성하는 것이 성공적인 추론에 필수적인 시나리오에서 모델을 평가하기 위한 새로운 벤치마크인 MIRA (Multimodal Imagination for Reasoning Assessment) 를 제안합니다.

#Review #Multimodal AI #Visual Reasoning #Chain-of-Thought (CoT)#Benchmark #Image Generation #MLLMs #Visual-CoT

2025년 11월 9일

[논문리뷰] Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

본 논문은 Distribution Matching Distillation (DMD) 을 통해 스코어 기반 생성 모델을 효율적인 few-step 생성기로 증류하는 과정에서 발생하는 한계점들을 해결하고자 합니다.

#Review #Distribution Matching Distillation #Few-step Diffusion #Score Matching #Mixture-of-Experts #Generative Models #Image Generation #Video Generation #Model Distillation

2025년 11월 9일

[논문리뷰] RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

본 논문은 기존 벤치마크들이 통합 멀티모달 모델의 이해 및 생성 능력을 개별적으로 평가하는 한계를 지적하며, 모델의 아키텍처적 통합 이 실제적으로 이러한 역량 간의 시너지 효과 를 유도하는지에 대한 근본적인 질문에 답하는 것을 목표로 합니다.

#Review #Unified Models #Multimodal AI #Benchmark #Capability Synergy #Visual Understanding #Image Generation #Dual-Evaluation Protocol

2025년 9월 30일

[논문리뷰] OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

본 연구는 기존 데이터셋의 한계, 특히 실제 적용에 필요한 체계적인 구조와 난이도 높은 시나리오의 부족으로 인해 이미지 생성 및 편집을 위한 통합 멀티모달 모델의 성능이 제약받는 문제를 해결하고자 합니다.

#Review #Image Generation #Image Editing #Multimodal AI #Dataset #Instruction Following #Taxonomy #GPT-40

2025년 9월 30일

[논문리뷰] HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models

확산 모델이 적은 NFEs(Neural Function Evaluations) 또는 낮은 guidance scale에서 비현실적인 출력과 세부 정보 부족을 보이는 문제를 해결하고, 확산 샘플링의 품질과 효율성을 향상시키는 것을 목표로 합니다.

#Review #Diffusion Models #Sampling #Generative AI #Image Generation #Plug-and-Play #Training-Free #Guidance #Momentum-Based Methods

2025년 9월 29일

[논문리뷰] SD3.5-Flash: Distribution-Guided Distillation of Generative Flows

본 논문은 최첨단 생성 모델, 특히 Rectified Flow 모델 의 높은 연산 요구량으로 인해 발생하는 접근성 문제를 해결하고자 합니다.

#Review #Generative AI #Image Generation #Diffusion Models #Rectified Flow #Model Distillation #Few-Step Generation #Computational Efficiency #Prompt Alignment

2025년 9월 26일

[논문리뷰] Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

본 논문은 기존 멀티모달 Masked Diffusion Model (MDM)의 한계를 극복하고, 이미지 이해, 객체 접지, 이미지 편집, 고해상도(1024px) 텍스트-투-이미지 생성 등 광범위한 멀티모달 태스크를 단일 프레임워크 내에서 처리할 수 있는 통합 MDM 인 Lavida-O를 제안하는 것을 목표로 합니다.

#Review #Multimodal AI #Masked Diffusion Models #Image Understanding #Image Generation #Image Editing #Object Grounding #ElasticMoT #Self-reflection

2025년 9월 25일

[논문리뷰] CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching

조건부 생성 모델에서 속도 네트워크가 데이터 분포의 질량 이동(mass transport) 과 조건 정보 인코딩(conditional injection) 이라는 두 가지 과제를 동시에 처리해야 하는 부담을 완화하는 것이 주요 목표입니다. 이를 통해 모델 학습을 가속화하고 생성 품질을 향상시키고자 합니다.

#Review #Flow Matching #Conditional Generative Models #Reparameterization #Mode Collapse #Image Generation #Latent Space Alignment #Diffusion Models

2025년 9월 24일

[논문리뷰] DiffusionNFT: Online Diffusion Reinforcement with Forward Process

본 논문은 확산 모델의 온라인 강화 학습(RL) 적용 시 발생하는 고유한 문제점, 즉 다루기 어려운 가능도(likelihoods)와 역방향 샘플링 과정의 제약사항을 해결하는 것을 목표로 합니다.

#Review #Diffusion Models #Reinforcement Learning #Online RL #Flow Matching #Forward Process #CFG-free #Image Generation #Negative-Aware FineTuning

2025년 9월 23일

[논문리뷰] Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

본 논문은 생성 모델링(Generative Modeling) , 표현 학습(Representation Learning) , 분류(Classification) 라는 세 가지 핵심 ML 태스크를 단일 통합 원칙으로 해결하는 것을 목표로 합니다.

#Review #Generative Modeling #Representation Learning #Classification #Unified Framework #Latent Space #Flow Matching #Deep Learning #Image Generation

2025년 9월 22일

[논문리뷰] Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

본 논문은 자연어 처리에서 성공적인 자기회귀(Autoregressive, AR) 모델이 이미지 생성 시 고수준 시각적 의미 학습에 어려움을 겪는 문제를 해결하고자 합니다.

#Review #Autoregressive Models #Image Generation #Self-Supervised Learning #Visual Understanding #Masked Image Modeling #Contrastive Learning #Next-Token Prediction #LlamaGen

2025년 9월 19일

[논문리뷰] MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks

본 연구는 기존 지시 기반 이미지 편집(IBIE) 방법론의 한계, 특히 제한된 데이터셋 다양성과 품질로 인한 복잡한 편집 태스크에서의 성능 저하 문제를 해결하고자 합니다.

#Review #Instruction-based Image Editing #Dataset #Multi-modal LLM #Image Generation #Style Transfer #Multi-task Learning #Fine-tuning

2025년 9월 19일

[논문리뷰] Reconstruction Alignment Improves Unified Multimodal Models

논문은 통합 멀티모달 모델(UMM)이 이미지-텍스트 쌍으로 훈련될 때 캡션의 희소성으로 인해 미세한 시각적 디테일을 놓치고, 이해와 생성 간의 정렬이 불완전하다는 문제를 해결하고자 합니다.

#Review #Unified Multimodal Models #Image Generation #Image Editing #Post-training #Self-supervised Learning #Reconstruction Alignment #Visual Embeddings

2025년 9월 10일

[논문리뷰] Transition Models: Rethinking the Generative Learning Objective

본 논문은 반복적인 확산 모델의 높은 품질과 효율적인 소수 단계 모델의 성능 포화 사이의 근본적인 딜레마를 해결하고자 합니다.

#Review #Generative Models #Diffusion Models #Training Objective #Continuous-Time Dynamics #State Transition #Few-Step Generation #Scalable Training #Image Generation

2025년 9월 5일

[논문리뷰] Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

논문은 기존 생성 모델이 의미론적 제어와 사진 같은 사실성 사이의 섬세한 균형을 맞추는 데 어려움을 겪고, 특히 Diffusion Transformer (DiT) 가 복잡한 다중 모드 조건부 설정에서 충분히 탐색되지 않았다는 문제를 해결하고자 합니다.

#Review #Diffusion Transformer #Mixture of Experts #Controllable Generation #Face Generation #Multimodal Synthesis #Semantic Control #Image Generation

2025년 9월 4일

[논문리뷰] OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

논문은 마스크 기반 이미지 편집(Image Fill, Extend, Object Removal, Text Rendering)의 다양한 하위 태스크에서 기존 모델들의 제한적인 범용성과 태스크별 지도 학습 미세 조정(SFT) 의 비효율성을 해결하고자 합니다.

#Review #Image Generation #Mask-Guided Editing #Reinforcement Learning #Human Preference Learning #Vision-Language Models #Multi-Task Learning #Flow Matching

2025년 8월 29일

[논문리뷰] CineScale: Free Lunch in High-Resolution Cinematic Visual Generation

기존 확산 모델이 낮은 해상도 데이터로 훈련되어 고해상도 시각 콘텐츠 생성 시 반복적인 패턴이나 흐릿함, 품질 저하 문제를 겪는 한계를 해결합니다.

#Review #Diffusion Models #High-Resolution Generation #Image Generation #Video Generation #UNet Architecture #DiT Architecture #Scale Fusion #LoRA Fine-tuning

2025년 8월 27일

[논문리뷰] Next Visual Granularity Generation

본 논문은 기존 이미지 생성 모델들이 이미지를 평면적이거나 비구조적인 데이터로 취급하여 미세한 제어 및 오류 누적에 한계가 있다는 문제점을 해결하고자 합니다.

#Review #Image Generation #Granularity Control #Structured Representation #Hierarchical Generation #Coarse-to-fine #Visual Tokenization #Latent Space

2025년 8월 19일

[논문리뷰] Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

본 논문은 GPT-4o 로 생성된 합성 이미지 데이터를 활용하여 오픈소스 이미지 생성 모델이 겪는 성능 격차를 해소하는 것을 목표로 합니다.

#Review #Synthetic Data #Image Generation #GPT-4o #Multimodal Models #Instruction Following #Surreal Image Generation #Dataset #Benchmarking

2025년 8월 14일

[논문리뷰] Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

본 연구는 강력한 추론 능력을 유지하면서도 고품질 시각적 합성 기능을 LLM에 통합하는 것을 목표로 합니다. 특히, 기존 방식들이 높은 훈련 비용을 수반하고 백본 LLM의 이미지 표현 학습 부족으로 어려움을 겪는 문제를 해결하여, 고충실도 및 제어 가능한 이미지 생성을 효율적으로 달성하고자 합니다.

#Review #Multimodal LLM #Diffusion Model #CLIP Latent #Image Generation #Multimodal Understanding #ControlNet #Training Efficiency

2025년 8월 12일

[논문리뷰] Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

본 논문은 이미지 이해, 텍스트-투-이미지 생성, 이미지 편집 기능을 단일 아키텍처 내에서 통합하는 1.5억 개 파라미터 의 자기회귀 모델 인 Skywork UniPic 을 소개합니다.

#Review #Autoregressive Models #Multimodal AI #Image Generation #Image Editing #Visual Understanding #Unified Architecture #Parameter Efficiency

2025년 8월 6일

[논문리뷰] Qwen-Image Technical Report

본 논문은 복잡한 텍스트 렌더링 및 정밀한 이미지 편집 분야에서 기존 텍스트-이미지(T2I) 모델의 한계를 해결하는 것을 목표로 합니다.

#Review #Image Generation #Text-to-Image #Image Editing #Text Rendering #Multimodal Diffusion Transformer #Curriculum Learning #Reinforcement Learning #Foundation Model

2025년 8월 5일

[논문리뷰] Emu3.5: Native Multimodal Models are World Learners

본 논문은 비전과 언어에 걸쳐 다음 상태를 예측하는 대규모 멀티모달 월드 모델인 Emu3.5 를 소개합니다. 자연스러운 멀티모달 능력 을 통해 긴 시퀀스 비전-언어 생성, X2I(Any-to-Image) 생성, 복잡한 텍스트 기반 이미지 생성 및 일반화 가능한 월드 모델링 능력 을 향상시키는 것을 목표로 합니다.

#Review #Multimodal Model #World Model #Vision-Language #Next-Token Prediction #Reinforcement Learning #Discrete Diffusion Adaptation #Image Generation #Any-to-Image

2025년 10월 31일

[논문리뷰] Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

본 논문은 Mixture-of-Experts(MoE)를 Diffusion Transformers(DiTs)에 적용할 때 발생하는 제한적인 성능 향상 문제를 해결하는 것을 목표로 합니다.

#Review #Mixture-of-Experts (MoE)#Diffusion Transformers (DiTs)#Routing Guidance #Semantic Specialization #Contrastive Learning #Image Generation #Flow Matching

2025년 10월 29일

[논문리뷰] Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

이미지 자기회귀(AR) 모델 의 느린 샘플링 속도 문제를 해결하고, 특히 원스텝 샘플링 시 발생하는 성능 저하 및 Distilled Decoding 1 (DD1) 의 사전 정의된 매핑 의존성 한계를 극복하는 것을 목표로 합니다.

#Review #Auto-regressive Models #Image Generation #One-step Sampling #Model Distillation #Conditional Score Distillation #Flow Matching #Generative Models

2025년 10월 28일

[논문리뷰] Visual Diffusion Models are Geometric Solvers

본 논문은 시각적 확산 모델(visual diffusion models)이 기하학적 문제를 해결하는 효과적인 솔루션으로 기능할 수 있음을 증명하는 것을 목표로 합니다.

#Review #Diffusion Models #Geometric Problem Solving #Inscribed Square Problem #Steiner Tree Problem #Maximum Area Polygonization #Image Generation #Pixel Space

2025년 10월 27일

[논문리뷰] AlphaFlow: Understanding and Improving MeanFlow Models

본 논문은 MeanFlow 모델의 성공 원리를 심층적으로 분석하고, MeanFlow 훈련 목표 내에 존재하는 trajectory flow matching 및 trajectory consistency 두 구성 요소 간의 음의 상관관계 로 인한 최적화 충돌 및 수렴 지연 문제를 해결하는 것을 목표로 합니다.

#Review #Generative Models #Flow Matching #Consistency Models #MeanFlow #Curriculum Learning #Few-Step Generation #Image Generation

2025년 10월 24일

[논문리뷰] ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

본 논문은 기존 MLLM 기반 분할 방법론이 픽셀 수준의 미세한 시각적 디테일을 포착하는 데 한계가 있음을 지적하며, Autoregressive Generation 기반의 새로운 패러다임인 ARGenSeg 를 제안합니다.

#Review #Image Segmentation #Autoregressive Generation #Multimodal Large Language Models (MLLMs)#Visual Understanding #VQ-VAE #Multi-scale Prediction #Referring Expression Segmentation #Image Generation

2025년 10월 24일

[논문리뷰] Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

본 연구는 대규모 언어 모델(LLMs)에서 성공적인 추론 시간 스케일링(search) 전략이 연속적인 잠재 공간을 사용하는 확산 모델(Diffusion Models)에서는 제한적인 이점을 보이는 문제를 해결하고자 합니다.

#Review #Visual Autoregressive Models #Diffusion Models #Inference Time Scaling #Beam Search #Image Generation #Text-to-Image Synthesis #Discrete Latent Space

2025년 10월 21일

[논문리뷰] Latent Diffusion Model without Variational Autoencoder

기존 잠재 확산 모델(LDM)이 VAE(Variational Autoencoder) 의 한계로 인해 훈련 비효율성, 느린 추론 속도, 낮은 전이 학습 능력을 보이는 문제를 해결하고자 합니다.

#Review #Latent Diffusion Model #Variational Autoencoder #Self-supervised Learning #DINO Features #Generative Models #Image Generation #Training Efficiency #Unified Representation

2025년 10월 20일

[논문리뷰] BLIP3o-NEXT: Next Frontier of Native Image Generation

본 논문은 BLIP3o-NEXT 라는 오픈소스 기반 모델을 제안하여 차세대 이미지 생성의 발전을 목표로 합니다. 단일 아키텍처 내에서 텍스트-투-이미지 생성 과 이미지 편집 기능을 통합하고, 강력한 이미지 생성 및 편집 능력을 시연하는 것을 주된 목표로 합니다.

#Review #Image Generation #Image Editing #Autoregressive Model #Diffusion Model #Reinforcement Learning #Multimodal AI #Foundation Model #Open-source

2025년 10월 20일

[논문리뷰] UniFusion: Vision-Language Model as Unified Encoder in Image Generation

기존 이미지 생성 모델들이 이미지와 텍스트에 대해 분리된 인코더를 사용하는 한계를 극복하고, 크로스-모달 추론 및 지식 전이 능력을 향상시키는 것을 목표로 합니다.

#Review #Vision-Language Model #Unified Encoder #Image Generation #Diffusion Models #Multimodal Learning #Text-to-Image #Image Editing #Zero-shot Learning

2025년 10월 15일

[논문리뷰] Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

본 연구는 픽셀 공간(pixel-space) 기반 생성 모델이 잠재 공간(latent-space) 기반 모델에 비해 훈련이 어렵고 성능이 낮은 문제점을 해결하여, 성능 및 효율성 격차를 해소하는 것을 목표로 합니다.

#Review #Pixel-space Generative Models #Diffusion Models #Consistency Models #Self-supervised Pre-training #End-to-end Training #Image Generation #FID #Representation Learning

2025년 10월 15일

[논문리뷰] Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

카메라 중심의 장면 이해와 생성을 별개의 문제로 다루던 기존 방식의 한계를 극복하고, 이를 단일 멀티모달 모델 로 통합하는 것을 목표로 합니다.

#Review #Unified Multimodal Model #Camera-Centric #Image Understanding #Image Generation #Spatial Reasoning #Camera Parameters #Instruction Tuning #Multimodal Spatial Intelligence

2025년 10월 13일

[논문리뷰] Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

기존 autoregressive 시각 모델에서 이산 잠재 공간 토크나이저 의 양자화 오류가 의미 표현력과 시각-언어 이해 능력을 저해하는 문제를 해결하고자 합니다.

#Review #Unified Vision-Language Model #Continuous Tokenizer #Autoregressive Generation #Image Understanding #Image Generation #Multimodal AI #In-context Editing

2025년 10월 9일

[논문리뷰] Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

본 논문은 다양한 양상의 데이터(텍스트, 이미지)를 처리할 수 있는 옴니(Omni) 형태의 멀티모달 생성 및 이해 모델 인 Lumina-DiMOO를 제안합니다.

#Review #Multi-modal LLM #Discrete Diffusion #Image Generation #Image Understanding #Omni-modal #Interactive Retouching #Generative AI #Reinforcement Learning

2025년 10월 9일

[논문리뷰] Heptapod: Language Modeling on Visual Signals

이 논문은 시각 생성 모델에서 외부 의미론적 정보 주입 및 CFG(Classifier-Free Guidance)에 대한 의존성을 비판하며, 재구성 중심의 토크나이저 와 Transformer의 내재적 의미 학습 이라는 언어 모델링의 기본 원칙으로 회귀하는 것을 목표로 합니다.

#Review #Autoregressive Models #Image Generation #Language Modeling #Causal Transformer #2D Distribution Prediction #Visual Tokenization #Self-Supervised Learning #Generative Models

2025년 10월 9일

[논문리뷰] Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

기존 확산(Diffusion) 및 플로우(Flow) 기반 생성 모델의 비평형, 시간-조건부 동역학 의 한계를 극복하고, 단일 시간 불변 평형 기울기 를 학습하는 새로운 생성 모델링 프레임워크인 Equilibrium Matching (EqM) 을 제안하는 것이 목표입니다.

#Review #Generative Models #Equilibrium Dynamics #Energy-Based Models (EBMs)#Flow Matching #Diffusion Models #Optimization-Based Sampling #Image Generation

2025년 10월 8일

[논문리뷰] Factuality Matters: When Image Generation and Editing Meet Structured Visuals

본 연구는 최신 시각 생성 모델들이 차트, 다이어그램, 수학 도형과 같은 구조화된 시각 자료 생성 및 편집에서 보이는 한계를 해결하고자 합니다. 이러한 자료들은 구성 계획 , 텍스트 렌더링 , 멀티모달 추론 을 통한 사실적 정확성 을 요구하며, 이 분야에 대한 체계적인 탐구가 부족하다는 문제를 인식했습니다.

#Review #Structured Visuals #Image Generation #Image Editing #Multimodal Reasoning #Factual Fidelity #Chain-of-Thought #Evaluation Benchmark #Diffusion Models

2025년 10월 7일