#Unified Architecture

13개의 포스트

[논문리뷰] HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

본 논문은 기존 Multimodal Large Language Models(MLLMs)가 Visual Encoder와 LLM 사이의 불균형 및 정보 정렬(Alignment) 미흡으로 인해 발생하는 성능 저하 문제를 해결합니다.

#Review #Multimodal Learning #Visual Tokenizer #Unified Architecture #Large Language Models #Representation Learning #Vision-Language Integration

2026년 6월 11일

[논문리뷰] ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

본 연구는 기존 멀티모달 모델들이 시각적 인코더와 언어 모델을 단순히 결합하는 방식에서 벗어나, 모달리티 간의 진정한 통합을 달성하고자 합니다.

#Review #Autoregressive Model #Large Multimodal Model #Discrete Representation #Visual Tokenization #Unified Architecture

2026년 6월 9일

[논문리뷰] UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

본 논문은 현대 AI 생태계에서 이미지 생성과 생성된 이미지 탐지가 서로 밀접하게 연관되어 있음에도 불구하고, 기존 연구들이 이들을 독립적으로 최적화한다는 점을 핵심 문제로 정의합니다.

#Review #Multimodal Large Language Models #AI-Generated Image Detection #Image Generation #Co-evolutionary Learning #Unified Architecture #Feature Alignment

2026년 4월 23일

[논문리뷰] Context Unrolling in Omni Models

본 논문은 다양한 모달리티를 원천 학습하여 모델이 스스로 추론 경로를 구조화하도록 유도하는 Context Unrolling 프레임워크를 제안한다. 모델은 작업 관련 컨텍스트를 선택적으로 활성화하여 공유 작업 공간에 투입하며, 이는 최종 예측 전후로 긴밀하게 작동한다 .

#Review #Multimodal Foundation Model #Context Unrolling #Unified Architecture #Cross-modal Reasoning #Spatial Intelligence #Mixture-of-Experts

2026년 4월 23일

[논문리뷰] LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

본 연구는 통합된 multimodal 이해와 생성을 위해 독립적인 아키텍처 대신 dLLM 기반의 단일 프레임워크를 구축하는 것을 목표로 합니다.

#Review #Multimodal Foundation Model #Diffusion Large Language Model #SigLIP-VQ #Unified Architecture #Block-wise Masked Diffusion

2026년 4월 22일

[논문리뷰] UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems

본 논문은 기존 Recommendation 시스템의 Scaling 아키텍처들이 서로 파편화되어 최적의 효율성을 달성하지 못하는 문제를 해결합니다.

#Review #Recommendation Systems #Scaling Laws #UniMixer #Feature Interaction #TokenMixer #Unified Architecture

2026년 4월 1일

[논문리뷰] DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

최근 diffusion model은 T2I generation과 text-guided editing 분야에서 비약적인 발전을 이루었으나, 대부분 수십억 개의 파라미터를 필요로 하여 온디바이스 환경에서의 배포에 한계가 있다.

#Review #Diffusion Models #On-device AI #Image Generation #Image Editing #Unified Architecture #Task-progressive Pretraining

2026년 3월 30일

[논문리뷰] Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

본 논문은 기존의 멀티모달 모델들이 데이터 학습량 이 많고 배포에 필요한 리소스 가 커서 엣지 디바이스에 적용하기 어렵다는 문제점을 해결하고자 합니다. 통합된 멀티모달 아키텍처 를 통해 시각적 이해와 생성을 동시에 수행하면서, 모바일 기기에서 실시간 추론 이 가능하도록 효율적인 모델 을 구축하는 것을 목표로 합니다.

#Review #Multimodal AI #Vision-Language Models #Diffusion Models #Mobile Devices #Edge Computing #Model Efficiency #Unified Architecture #Real-time Inference

2026년 2월 23일

[논문리뷰] ERNIE 5.0 Technical Report

ERNIE 5.0은 텍스트, 이미지, 비디오, 오디오에 걸쳐 통합된 멀티모달 이해 및 생성 을 위한 본질적으로 자기회귀(autoregressive) 기반 파운데이션 모델 을 개발하는 것을 목표로 합니다.

#Review #Multimodal Foundation Model #Autoregressive #Mixture-of-Experts #Elastic Training #Reinforcement Learning #Unified Architecture #Sparse MoE #Efficient Deployment

2026년 2월 4일

[논문리뷰] MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

기존 통합 멀티모달 LLM이 시각적 이해와 생성 능력 사이의 성능 트레이드오프, 특히 텍스트가 풍부한 벤치마크에서의 저하를 겪는 문제를 해결하는 것을 목표로 합니다.

#Review #Multimodal LLM #Hybrid Tokenizer #Text-to-Image Generation #Visual Question Answering #Autoregressive Model #Diffusion Decoder #Unified Architecture #Model Scaling

2025년 9월 22일

[논문리뷰] Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

본 논문은 이미지 이해, 텍스트-투-이미지 생성, 이미지 편집 기능을 단일 아키텍처 내에서 통합하는 1.5억 개 파라미터 의 자기회귀 모델 인 Skywork UniPic 을 소개합니다.

#Review #Autoregressive Models #Multimodal AI #Image Generation #Image Editing #Visual Understanding #Unified Architecture #Parameter Efficiency

2025년 8월 6일

[논문리뷰] Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

본 연구는 Ming-Omni 의 업그레이드 버전인 Ming-Flash-Omni 를 제안하여, 희소한 Mixture-of-Experts (MoE) 아키텍처를 기반으로 시각, 음성, 언어 전반에 걸쳐 더욱 강력하고 통합된 멀티모달 지능을 구현하는 것을 목표로 합니다.

#Review #Multimodal AI #Sparse MoE #Unified Architecture #Perception #Generation #Contextual ASR #Image Editing #Generative Segmentation

2025년 10월 30일

[논문리뷰] Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

기존 MLLM이 시각 작업을 위해 텍스트로 좌표를 생성하는 등 간접적인 표현 방식 에 의존하여 성능이 제한되고 분할(Segmentation)과 같은 밀집 예측(Dense Prediction) 작업 이 어려웠던 문제를 해결하는 것입니다.

#Review #Multimodal Large Language Models (MLLMs)#Visual Reference Tokens (VRTs)#Dense Prediction #Referring Expression Comprehension (REC)#Open-Vocabulary Detection (OVD)#Image Captioning #Unified Architecture #Autoregressive Generation

2025년 10월 9일