#World Modeling

16개의 포스트

[논문리뷰] UniVR: Thinking in Visual Space for Unified Visual Reasoning

본 논문은 현재 AI 모델들이 추론 및 계획 능력을 주로 텍스트 공간 내에서 수행함으로써 발생하는 한계를 지적합니다. 텍스트는 추상적인 표현에 불과하여 물리적 법칙, 복잡한 동적 변화, 공간적 관계를 완벽히 담아내지 못하며, 이로 인해 모델이 실제 시각적 환경에서 일관성 있는 행동을 수행하는 데 어려움을 겪습니다 .

#Review #Visual Reasoning #Reinforcement Learning #Unified Generative Model #Long-horizon Planning #Physical Consistency #World Modeling

2026년 7월 16일

[논문리뷰] Learning Transferable Dynamics Priors from Action to World Modeling

본 논문은 대규모 로봇 데이터를 활용하여 범용적인 Dynamics Priors를 학습하고, 이를 통해 로봇 학습의 시뮬레이터와 정책 성능을 동시에 향상시키는 것을 목표로 합니다.

#Review #Robot Learning #World Modeling #Diffusion Models #Dynamics Priors #Action-Conditioned #Policy Evaluation #Sim-to-Real

2026년 6월 29일

[논문리뷰] Next Forcing: Causal World Modeling with Multi-Chunk Prediction

본 논문은 기존 Autoregressive 모델이 긴 시퀀스를 생성할 때 발생하는 높은 Latency와 연산 비효율성 문제를 해결한다. 전통적인 모델은 토큰을 하나씩 생성해야 하므로, 복잡한 환경을 시뮬레이션하거나 긴 문맥을 생성할 때 병목 현상이 발생한다.

#Review #World Modeling #Multi-Chunk Prediction #Causal Modeling #Autoregressive Generation #Sequence Modeling

2026년 6월 9일

[논문리뷰] World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

본 논문은 기존의 WAM (World-Action Model)과 VLA (Vision-Language-Action Model)가 가진 한계를 극복하기 위해 제안되었다.

#Review #Embodied AI #World Modeling #Language Reasoning #Action Synthesis #Autoregressive Transformer #Test-Time Scaling #Cross-Embodiment

2026년 6월 4일

[논문리뷰] Policy and World Modeling Co-Training for Language Agents

본 논문은 LLM Agent가 표준 RL 학습 과정에서 보상 최적화에만 치중하여 환경의 결과 예측 능력을 결여하는 문제를 해결합니다. 기존 연구들은 별도의 시뮬레이터나 복잡한 다단계 학습, 혹은 추론 시 추가 연산을 요구하여 시스템 복잡도를 높이는 한계가 있었습니다.

#Review #Language Agents #Reinforcement Learning #World Modeling #Co-Training #On-policy RL #Clipped MAE #Reward-adaptive Loss

2026년 6월 1일

[논문리뷰] UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

본 논문은 시각적 앵커링을 통해 이질적인 동작들을 공통 잠재 공간으로 정렬하는 UniT를 제안합니다. UniT는 시각적, 동작적, 융합적 세 가지 브랜치로 구성된 트리 브랜치(tri-branch) 아키텍처를 가지며, 모든 브랜치는 Residual Quantization(RQ-VAE)을 통해 공유 코드북(shared codebook)으로 양자화됩니다 .

#Review #Humanoid Robotics #Vision-Language-Action Models #Cross-Embodiment Transfer #Latent Action Tokenizer #World Modeling #Visual Anchoring #Cross-Reconstruction

2026년 4월 23일

[논문리뷰] WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

기존 비디오 월드 모델들은 액션에 조건화된 역학(action-conditioned dynamics)을 학습하는 데 어려움을 겪고 있는데, 이는 현재 데이터셋이 요구 사항을 충족하지 못하기 때문입니다.

#Review #World Modeling #Action-Conditioned Generation #Dataset #Generative ARPG #Explicit State Annotation #Video Generation #Long-Horizon Consistency

2026년 3월 24일

[논문리뷰] DreamWorld: Unified World Modeling in Video Generation

기존 비디오 생성 모델들이 시각적 사실성만을 추구하고 세계에 대한 일관된 이해가 부족한 한계를 해결하는 것이 목표입니다. 물리적 상식, 3D 및 시간적 일관성과 같은 이질적인 세계 지식 을 비디오 생성기에 통합하고, 이로 인해 발생하는 시각적 불안정성과 시간적 깜빡임 문제를 완화하고자 합니다.

#Review #Video Generation #World Modeling #Diffusion Models #Multi-modal Integration #Temporal Consistency #Spatial Geometry #Semantic Consistency #Constraint Annealing

2026년 3월 5일

[논문리뷰] Beyond Language Modeling: An Exploration of Multimodal Pretraining

본 논문은 기존 언어 모델링의 한계를 넘어, 비전 신호를 퍼스트 클래스 시민 으로 통합한 통합 멀티모달 사전 훈련(unified multimodal pretraining) 의 설계 공간을 탐색하고 경험적 명확성을 제공하는 것을 목표로 합니다.

#Review #Multimodal Pretraining #Vision-Language Models #Mixture-of-Experts (MoE)#Representation Autoencoders (RAE)#World Modeling #Scaling Laws #Diffusion Models #Unified Architectures

2026년 3월 3일

[논문리뷰] FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

본 논문은 Vision-Language-Action (VLA) 모델이 세계 모델링 능력을 향상시키는 데 직면한 두 가지 주요 문제(픽셀 단위 재구성에 대한 과도한 강조와 예측된 미래 관찰에 대한 의존으로 인한 오류 누적)를 해결하고자 합니다.

#Review #World Modeling #Generalist Policies #Representation Alignment #Diffusion Models #Robotics #Fine-tuning #Egocentric Data #VLA

2026년 2월 19일

[논문리뷰] Self-Improving World Modelling with Latent Actions

본 논문은 액션이 레이턴트 변수로 취급되는 상태-온리 시퀀스 로부터 LLM(Large Language Models) 및 VLM(Vision-Language Models)의 내재적 월드 모델링 능력을 향상시키는 것을 목표로 합니다.

#Review #World Modeling #Latent Actions #Self-Improvement #Reinforcement Learning #LLMs #VLMs #Inverse Dynamics Model #Forward World Modelling

2026년 2월 8일

[논문리뷰] X-Streamer: Unified Human World Modeling with Audiovisual Interaction

컴퓨터 비전, 음성 및 텍스트를 아우르는 다중 모달 인터랙티브 인간 에이전트 시스템에서 기존의 모듈형 파이프라인 방식이 야기하는 컨텍스트 불일치, 지연 및 오류 누적 문제를 해결하고자 합니다.

#Review #Digital Human #Multimodal AI #Real-time Streaming #Video Generation #Diffusion Models #Transformer Architecture #Audiovisual Synchronization #World Modeling

2025년 9월 29일

[논문리뷰] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

본 논문은 대규모의 실세계 동적 비디오 데이터셋에 부족한 명시적인 공간 정보 및 풍부한 의미론적 주석의 부재 문제를 해결하고자 합니다. 이는 3D 재구성, 세계 모델링, 그리고 동적 장면 합성과 같은 AI/ML 분야의 발전을 저해하며, 물리적으로 일관성 있는 모델 학습을 위한 핵심 자원의 필요성을 강조합니다.

#Review #Video Dataset #Spatial Annotation #Camera Pose Estimation #Depth Map #Structured Caption #Motion Instruction #3D Vision #World Modeling

2025년 9월 12일

[논문리뷰] CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving

자율 주행을 위한 포괄적인 세계 모델을 구축하기 위해, 다양한 제어 입력 하에 장기간의 다중 시점 비디오를 생성하고 동시에 4D 장면 재구성 기능을 제공하는 것을 목표로 합니다. 특히, 기존 비디오 생성 모델들이 명시적인 3D 정보 를 다루지 못해 자율 주행 시나리오에 적용하기 어려운 한계를 극복하고자 합니다.

#Review #Autonomous Driving #Video Generation #Diffusion Models #Spatial-Temporal Reconstruction #3D Gaussian Splatting #Variational Autoencoder #World Modeling #Multi-View Video

2025년 10월 16일

[논문리뷰] Agent Learning via Early Experience

본 논문은 보상이 없거나 불명확한 환경에서 언어 에이전트 가 스스로 경험을 통해 학습하고 개선하는 데 따르는 어려움을 해결하고자 합니다.

#Review #Language Agents #Early Experience #Reward-Free Learning #World Modeling #Self-Reflection #Imitation Learning #Reinforcement Learning #Out-of-Domain Generalization

2025년 10월 10일

[논문리뷰] StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

로봇 시스템에서 효율적인 세계 모델링과 의사 결정을 위해 표현적이고 압축적인 상태 표현 을 개발하는 것이 핵심 목표입니다. 기존 방법론들이 과도한 중복성이나 핵심 정보 부족으로 겪던 한계를 극복하고, 로봇의 시각적 정보를 효과적으로 요약하여 행동에 직접 연결될 수 있는 표현을 학습하고자 합니다.

#Review #Robot Learning #State Representation #Motion Representation #Diffusion Models #Unsupervised Learning #World Modeling #Vision-Language Models #Latent Action

2025년 10월 9일