#Multimodal Model

3개의 포스트

[논문리뷰] DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

본 논문은 현재 대규모(~10B 이상) 파라미터를 요구하는 멀티모달 이미지 생성 및 편집 모델의 높은 훈련 비용과 배포 한계를 극복하는 것을 목표로 합니다. 경량의 5B 파라미터 모델(DeepGen 1.0) 을 통해 훨씬 큰 모델과 동등하거나 이를 능가하는 포괄적인 생성 및 편집 능력을 달성하고자 합니다.

#Review #Multimodal Model #Image Generation #Image Editing #Diffusion Models #VLM-DiT Architecture #Stacked Channel Bridging #Reinforcement Learning #Lightweight Models

2026년 2월 12일

[논문리뷰] Qwen3-Omni Technical Report

본 논문은 텍스트, 이미지, 오디오, 비디오 등 다양한 모달리티 전반에 걸쳐 단일 멀티모달 모델(Qwen3-Omni) 이 기존 단일 모달 모델과 비교하여 성능 저하 없이 최첨단 성능을 유지 하는 것을 목표로 합니다. 또한, 교차 모달 추론 능력 과 실시간 시청각 상호작용 을 향상시키는 것을 주된 연구 목적으로 삼습니다.

#Review #Multimodal Model #Thinker-Talker Architecture #Mixture-of-Experts #Low-latency #Audio Understanding #Cross-modal Reasoning #State-of-the-Art #Real-time Interaction

2025년 9월 23일

[논문리뷰] Emu3.5: Native Multimodal Models are World Learners

본 논문은 비전과 언어에 걸쳐 다음 상태를 예측하는 대규모 멀티모달 월드 모델인 Emu3.5 를 소개합니다. 자연스러운 멀티모달 능력 을 통해 긴 시퀀스 비전-언어 생성, X2I(Any-to-Image) 생성, 복잡한 텍스트 기반 이미지 생성 및 일반화 가능한 월드 모델링 능력 을 향상시키는 것을 목표로 합니다.

#Review #Multimodal Model #World Model #Vision-Language #Next-Token Prediction #Reinforcement Learning #Discrete Diffusion Adaptation #Image Generation #Any-to-Image

2025년 10월 31일