#Multi-modal Generation

4개의 포스트

[논문리뷰] MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

기존 VLA 모델들은 hierarchical 구조나 autoregressive 패러다임에 의존함으로써 발생하는 아키텍처 오버헤드, 장기적 시간 일관성 결여, 그리고 환경 역학(environment dynamics)을 파악하는 명시적 메커니즘 부족이라는 한계에 직면해 있습니다.

#Review #Vision-Language-Action (VLA)#Discrete Diffusion #Multi-modal Generation #Robotic Manipulation #Action Chunking #World Model #Hybrid Attention

2026년 4월 1일

[논문리뷰] SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

이 논문은 텍스트, 이미지, 비디오, 마스크, 오디오 참조를 포함한 다양한 입력을 처리하고, 비디오-오디오 생성, 인페인팅 및 편집 기능을 단일 프레임워크 내에서 통합적으로 지원하는 멀티모달 비디오 파운데이션 모델 을 개발하는 것을 목표로 합니다.

#Review #Multi-modal Generation #Video-Audio Synthesis #Video Inpainting #Video Editing #Diffusion Transformer #MMLM #Super-resolution #Frame Interpolation

2026년 2월 25일

[논문리뷰] MultiRef: Controllable Image Generation with Multiple Visual References

이 연구는 텍스트 프롬프트나 단일 이미지 참조에 의존하는 기존 이미지 생성 모델의 한계를 극복하고, 다중 시각 참조(multiple visual references)를 활용한 제어 가능한 이미지 생성 이라는 새로운 문제에 초점을 맞춥니다.

#Review #Controllable Image Generation #Multi-modal Generation #Visual References #Image-to-Image #Benchmark #Dataset #MLLM-as-a-Judge

2025년 8월 20일

[논문리뷰] OmniNWM: Omniscient Driving Navigation World Models

본 논문은 기존 자율주행 월드 모델이 가진 제한된 상태 모달리티, 짧은 시퀀스 길이, 부정확한 액션 제어, 보상 인식 부족 등의 문제를 해결하여, 자율주행을 위한 종합적이고 전지적인(omniscient) 파노라마 내비게이션 월드 모델 을 개발하는 것을 목표로 합니다.

#Review #Autonomous Driving #World Models #Multi-modal Generation #3D Occupancy #Plücker Ray-maps #Action Control #Dense Rewards #Long-term Forecasting

2025년 10월 23일