#Mixture-of-Transformers

6개의 포스트

[논문리뷰] Cosmos 3: Omnimodal World Models for Physical AI

Physical AI 에이전트 학습을 위한 기존의 파편화된 파이프라인은 이해(Understanding)와 생성(Generation) 모듈이 분리되어 있어 데이터 효율성과 확장성이 낮습니다.

#Review #World Model #Physical AI #Mixture-of-Transformers #Omnimodal #Data-Driven Specialization #Synthetic Data #Action-Conditioned Generation

2026년 6월 3일

[논문리뷰] EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

본 논문은 기존의 Diffusion 기반 3D 생성 모델들이 의미론적 이해(semantic understanding)와 기하학적 추론(geometric reasoning)을 분리하여 처리함으로써 발생하는 한계를 해결하고자 합니다.

#Review #Multimodal Large Language Models #Mixture-of-Transformers #3D Native Generation #Context-aware Editing #Flow Matching #Sparse Voxel Representation

2026년 6월 1일

[논문리뷰] Representation Forcing for Bottleneck-Free Unified Multimodal Models

본 논문은 기존 UMM이 frozen VAE에 의존하여 발생하는 structural bottleneck 문제를 해결하기 위해 Representation Forcing (RF)을 제안한다 .

#Review #Unified Multimodal Models #Representation Forcing #Pixel-space Diffusion #Vector Quantization #End-to-End Learning #Bottleneck-Free #Mixture-of-Transformers

2026년 5월 31일

[논문리뷰] HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

본 논문은 모달리티 적응형 컴퓨팅을 위한 MoT 아키텍처와 비전-언어 연결을 강화하는 Visual Latent Tokens를 핵심 방법론으로 제안합니다 . 시각적 인지 능력 향상을 위해 HY-ViT 2.0 인코더를 탑재하고, 고품질 embodied 데이터를 활용한 반복적인 사후 학습 패러다임을 설계했습니다.

#Review #Embodied Foundation Models #Mixture-of-Transformers #Visual Latent Tokens #On-policy Distillation #Chain-of-Thought #Real-world Agents

2026년 4월 9일

[논문리뷰] UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

본 논문은 VLA 모델을 자율주행에 적용할 때 발생하는 공간 인지와 의미론적 추론 간의 근본적인 충돌 문제를 해결하고자 합니다. 기존의 VLA 시스템들은 주로 사전 학습된 2D VLM을 기반으로 하는데, 이는 강력한 의미론적 이해 능력을 갖춘 반면 자율주행에 필수적인 공간 인지 능력이 부족하다는 한계를 지닙니다.

#Review #Vision-Language-Action Models #Autonomous Driving #Mixture-of-Transformers #Sparse Perception #Representation Interference #End-to-End Planning

2026년 4월 2일

[논문리뷰] Video-As-Prompt: Unified Semantic Control for Video Generation

이 논문은 비디오 생성 분야에서 통합적이고 일반화 가능한 의미론적 제어라는 중요한 과제를 해결하고자 합니다. 기존 방법론들이 부적절한 픽셀 단위 사전 정보를 강요하여 아티팩트를 생성하거나, 특정 조건에 대한 파인튜닝이나 태스크별 아키텍처에 의존하여 일반화가 어렵다는 문제를 극복하는 것을 목표로 합니다.

#Review #Video Generation #Semantic Control #Diffusion Transformers #In-Context Learning #Mixture-of-Transformers #Video-As-Prompt #Controllable Generation #Large-scale Dataset

2025년 10월 27일