#Controllable Generation

17개의 포스트

[논문리뷰] Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

본 연구는 TTS 언어 모델의 내부 동작이 '블랙박스'로 남아있어, 특정 음성 속성을 정교하게 제어하기 어렵다는 문제를 해결합니다. 기존의 음성 모델은 특정 스타일이나 화자 변환을 위해 전체 모델을 재학습하거나 프롬프트 엔지니어링에 의존해야 하며, 이는 제어의 정밀도와 효율성 측면에서 한계가 있습니다.

#Review #Sparse Autoencoders #Text-to-Speech #Mechanistic Interpretability #Latent Space #Controllable Generation

2026년 6월 9일

[논문리뷰] GenClaw: Code-Driven Agentic Image Generation

본 논문은 기존의 end-to-end 방식의 image generation 모델이 겪는 제어 가능성 및 추론 능력의 한계를 해결하고자 합니다. 기존 모델들은 프롬프트 재작성을 통해 반복적인 '블랙박스' 식 시행착오를 거치며, 복잡한 공간 관계나 텍스트 레이아웃을 정밀하게 제어하는 데 실패하는 경우가 많습니다 .

#Review #Agentic Image Generation #Code-Driven #SVG #Multimodal Reasoning #Layered Representation #Controllable Generation

2026년 5월 28일

[논문리뷰] CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

본 연구는 기존 비디오 생성 모델들이 사용자의 창의적 의도를 정확히 해석하지 못하고, 제어 가능성(Controllability)이 제한적이라는 문제 해결을 목표로 합니다. 기존 모델들은 단순한 텍스트-비디오 매핑에 의존하여 복잡한 물리적 제약이나 구체적인 카메라 움직임을 구현하는 데 한계를 보입니다.

#Review #Video Generation #Controllable Generation #Reasoning-Driven #Cognitive Intent #Multimodal Understanding #Latent Diffusion Models

2026년 5월 19일

[논문리뷰] Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

본 논문은 기존 controllable diffusion 모델들의 파편화로 인한 시스템적 병목 현상을 해결하고자 합니다. 현재의 제어 방법들은 특정 백본에 종속적인 구조를 가지며, 각기 다른 학습 파이프라인과 런타임 훅을 사용하여 인프라 재사용이나 다중 제어 기법의 결합이 매우 어렵습니다.

#Review #Diffusion Models #Controllable Generation #Plugin Framework #KV-Cache #Template Model #Modular Design

2026년 4월 29일

[논문리뷰] Lighting-grounded Video Generation with Renderer-based Agent Reasoning

본 논문은 3D scene proxy를 통해 조명을 제어하는 LiVER 프레임워크를 제안한다. 먼저 Renderer-based Agent가 텍스트 명령을 분석하여 3D 구조를 생성하고, 이를 2D 렌더 패스(diffuse, rough/glossy GGX)로 변환하여 물리적 단서를 추출한다 .

#Review #Video Generation #Controllable Generation #Lighting-grounded #3D Scene Proxy #Diffusion Models #Physical Realism #Renderer-based Agent

2026년 4월 9일

[논문리뷰] Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers

본 논문은 Linear Attention 기반 모델을 위한 통합 게이트 조건 주입 모듈인 GateControl을 제안합니다. 이 방식은 학습 가능한 게이트를 통해 토큰별로 중요한 조건 정보만을 선택적으로 보존함으로써, 기존의 Multimodal Attention 없이도 강력한 제어 성능을 달성합니다.

#Review #Diffusion Transformer #Linear Attention #Controllable Generation #Gated Condition Injection #On-device AI

2026년 4월 2일

[논문리뷰] DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

레퍼런스 기반 오디오-비디오 생성(R2AV), 비디오 편집(RV2AV), 오디오 기반 비디오 애니메이션(RA2V)과 같은 인간 중심 태스크들을 개별적으로 처리하는 기존 모델의 한계를 극복하는 것을 목표로 합니다.

#Review #Audio-Video Generation #Human-Centric AI #Diffusion Transformer #Multi-Task Learning #Identity Disentanglement #Controllable Generation #Speaker Confusion

2026년 2월 25일

[논문리뷰] Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

본 논문은 대규모 (비전) 언어 모델(LLMs/VLMs)의 추론 및 강화 학습(RL) 훈련 과정에서 발생하는 탐색 비효율성 문제를 해결하는 것을 목표로 합니다.

#Review #Latent Variable Models #Variational Autoencoder (VAE)#Reinforcement Learning (RL)#Exploration #Large Language Models (LLMs)#Vision-Language Models (VLMs)#Controllable Generation #Reasoning Strategies

2025년 12월 22일

[논문리뷰] Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

기존 3D 생성 모델이 이미지 또는 텍스트 조건화에 주로 의존하며 세분화된 크로스-모달 제어가 부족 하여 실용적 적용이 제한되는 문제를 해결하고자 합니다. 다양한 형태의 제어 신호 를 통합하는 통일된 프레임워크를 통해 3D 에셋 생성의 제어 가능성 과 기하학적 정확도 를 향상시키는 것을 목표로 합니다.

#Review #3D Generation #Controllable Generation #Multi-modal Conditioning #Diffusion Models #Point Clouds #Voxels #Bounding Boxes #Skeletons #Hunyuan3D

2025년 9월 26일

[논문리뷰] PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

기존 비디오 생성 모델들이 겪는 물리적 현실성 부족과 3D 제어의 한계를 극복하는 것을 목표로 합니다. 논문은 물리적 매개변수와 외부 힘을 명시적으로 제어하여 물리 기반(physics-grounded) 이미지-투-비디오 생성 을 가능하게 하는 PhysCtrl 프레임워크를 제안합니다.

#Review #Video Generation #Physics-Grounded #Controllable Generation #Diffusion Models #Point Cloud Trajectories #Material Simulation #Generative Physics

2025년 9월 25일

[논문리뷰] X-Part: high fidelity and structure coherent shape decomposition

기존 파트 기반 3D 형태 생성 방식이 낮은 제어 가능성과 의미론적으로 불분명한 분해 성능을 보이는 문제를 해결하는 것을 목표로 합니다.

#Review #3D Shape Decomposition #Diffusion Models #Part-level Generation #Controllable Generation #Bounding Box Prompts #Semantic Features #Interactive Editing #Generative AI

2025년 9월 15일

[논문리뷰] Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

논문은 기존 생성 모델이 의미론적 제어와 사진 같은 사실성 사이의 섬세한 균형을 맞추는 데 어려움을 겪고, 특히 Diffusion Transformer (DiT) 가 복잡한 다중 모드 조건부 설정에서 충분히 탐색되지 않았다는 문제를 해결하고자 합니다.

#Review #Diffusion Transformer #Mixture of Experts #Controllable Generation #Face Generation #Multimodal Synthesis #Semantic Control #Image Generation

2025년 9월 4일

[논문리뷰] Reinforcement Learning with Rubric Anchors

이 논문은 확인 가능한 보상(RLVR) 을 사용하는 기존 강화 학습 패러다임이 자동 검증이 가능한 특정 도메인(예: 수학, 코딩)에 국한되는 한계를 해결하고자 합니다.

#Review #Reinforcement Learning #Large Language Models #Rubric-based Reward #RLVR Extension #Human-centric AI #Controllable Generation #Reward Hacking Mitigation

2025년 8월 19일

[논문리뷰] WithAnyone: Towards Controllable and ID Consistent Image Generation

본 논문은 텍스트-투-이미지 생성 모델에서 레퍼런스 인물의 ID(Identity)를 일관성 있게 유지하면서도, 레퍼런스 이미지를 단순히 복사하는 듯한 'copy-paste' 아티팩트 를 줄이고 생성된 이미지의 표현, 포즈, 조명 등의 다양성 및 제어 가능성 을 높이는 것을 목표로 합니다.

#Review #Identity-Consistent Generation #Text-to-Image Diffusion #Copy-Paste Artifacts #Contrastive Learning #Multi-Identity Dataset #Controllable Generation #ID-Preservation

2025년 10월 17일

[논문리뷰] TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control

기존의 controllable diffusion model이 고정된 아키텍처와 정적인 컨디셔닝 전략을 사용하여 동적인 denoising 과정에 비효율적이라는 문제를 해결합니다.

#Review #Diffusion Models #Conditional Generation #LoRA #Hypernetwork #Dynamic Weight Adaptation #Generative AI #Controllable Generation

2025년 10월 13일

[논문리뷰] Video-As-Prompt: Unified Semantic Control for Video Generation

이 논문은 비디오 생성 분야에서 통합적이고 일반화 가능한 의미론적 제어라는 중요한 과제를 해결하고자 합니다. 기존 방법론들이 부적절한 픽셀 단위 사전 정보를 강요하여 아티팩트를 생성하거나, 특정 조건에 대한 파인튜닝이나 태스크별 아키텍처에 의존하여 일반화가 어렵다는 문제를 극복하는 것을 목표로 합니다.

#Review #Video Generation #Semantic Control #Diffusion Transformers #In-Context Learning #Mixture-of-Transformers #Video-As-Prompt #Controllable Generation #Large-scale Dataset

2025년 10월 27일

[논문리뷰] IF-VidCap: Can Video Caption Models Follow Instructions?

비디오 캡셔닝 분야에서 멀티모달 대규모 언어 모델(MLLM) 이 사용자의 특정 지시사항(예: 출력 형식, 길이, 내용 제약)을 얼마나 잘 따르는지 평가하는 새로운 벤치마크를 제시하는 것이 목표입니다.

#Review #Video Captioning #Instruction Following #MLLMs #Benchmark #Controllable Generation #Multimodal Evaluation #Fine-tuning

2025년 10월 22일