Review

[논문리뷰] Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

The paper 'Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration' by Danil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, and Konstantin Sobolev from MSU and FusionBrain Lab, AXXX, discusses a new method called…

2026년 3월 26일

[논문리뷰] BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

Understanding animal species through multimodal data (visual, textual, acoustic) is a growing challenge at the intersection of computer vision and ecology.

2026년 3월 26일

[논문리뷰] AVControl: Efficient Framework for Training Audio-Visual Controls

비디오 및 오디오 생성 과정의 정교한 제어는 실제 창의적인 애플리케이션에 필수적이다. 그러나 depth, pose, camera trajectories, audio transformations 등 다양한 modalities에 걸친 control의 범위는 매우 광대하다.

#Review #Audio-Visual Generation #Video Control #LoRA #Parallel Canvas Conditioning #Diffusion Models #Modularity #Efficiency

2026년 3월 26일

[논문리뷰] When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

최근 멀티모달 대규모 언어 모델(MLLMs)은 추론 작업에서 강력한 성능을 보여주었지만, 이러한 발전은 주로 고품질의 주석 처리된 데이터나 교사 모델(teacher-model) 증류(distillation)에 의존하고 있어 비용이 많이 들고 확장이 어렵습니다.

#Review #Unsupervised Self-Evolution #Multimodal Reasoning #Consistency-Based Reward #Judge Modulation #Group Relative Policy Optimization (GRPO)#Policy Updates #Mathematical Reasoning #Large Language Models

2026년 3월 25일

[논문리뷰] Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

기존의 Multimodal Large Language Models (MLLMs)는 2D 시각 신호에 과도하게 고정되어 3D 환경에 대한 구조화된 추상화를 구축하지 못함으로써 3D 공간 추론(spatial reasoning)에서 어려움을 겪습니다.

#Review #Multimodal Large Language Models (MLLMs)#Spatial Reasoning #Textual Representation #Allocentric Context #Egocentric Video #Prompting Methods #VSI-Bench #OST-Bench

2026년 3월 25일

[논문리뷰] UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

Multimodal Large Language Models (MLLMs)의 발전과 함께 자율 모바일 GUI Agent에 대한 관심이 증가하고 있지만, 기존 방법론들은 비효율적인 실패 궤적(failed trajectory) 학습과 장기(long-horizon) GUI 태스크에서 희소한 보상(sparse rewards)에 따른 모호한 Credit Assignment 문제에 직면하고 있습니다.

#Review #GUI Agent #Self-Evolving Learning #Rejection Fine-Tuning (RFT)#Group Relative Self-Distillation (GRSD)#Credit Assignment #Sparse Rewards #Mobile Automation #Multimodal Large Language Models (MLLMs)

2026년 3월 25일

[논문리뷰] Toward Physically Consistent Driving Video World Models under Challenging Trajectories

자율 주행 시뮬레이션에서 비디오 월드 모델(Video World Models)은 실세계 데이터 수집의 비싼 비용과 고품질 물리 시뮬레이터의 대안으로 중요성이 커지고 있습니다. 기존 주행 월드 모델들은 일반적으로 실제 주행 데이터셋, 주로 안전하고 일반적인 시나리오에 훈련되어 있습니다.

#Review #Driving World Models #Physical Consistency #Video Generation #Challenging Trajectories #Autonomous Driving #Heterogeneous Dataset

2026년 3월 25일

[논문리뷰] T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

기존 LLM red-teaming 연구는 주로 모델에서 유해한 텍스트 출력(harmful text outputs)을 유도하는 데 초점을 맞추었으나, 이는 Model Context Protocol (MCP)과 같은 통합 표준을 통해 다단계 도구 실행(multi-step tool execution)이 가능한 LLM Agents의 새로운 안전 위험을 간과하고 있습니다.

#Review #LLM Agents #Red-Teaming #Vulnerability Discovery #Trajectory-aware Search #MAP-Elites #Tool Call Graph #Attack Realization Rate

2026년 3월 25일

[논문리뷰] StreamingClaw Technical Report

Embodied Intelligence, AI Hardware, Autonomous Driving, Intelligent Cockpits와 같은 Applications은 Real-time Perception–Decision–Action Closed Loop에 크게 의존하며, 이는 Real-time Streaming Video Understanding에 대한 엄격한 요구사항을 부과한다.

#Review #Streaming Video Understanding #Embodied Intelligence #Multi-agent Systems #Long-term Memory #Proactive Interaction #Real-time Inference #OpenClaw

2026년 3월 25일

[논문리뷰] PLDR-LLMs Reason At Self-Organized Criticality

본 연구는 Large Language Models (LLMs)에서 reasoning 능력이 어떻게 발현되며 이를 어떻게 효과적으로 정량화할 수 있는지에 대한 핵심 문제를 다룬다.

#Review #PLDR-LLMs #Self-Organized Criticality #Reasoning #Deductive Outputs #Order Parameter #Phase Transitions #Generalization #Attention Mechanism

2026년 3월 25일

[논문리뷰] OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

Proprietary Systems인 Seedance-2.0 과 같은 모델들은 Omni-capable Video Generation 분야에서 놀라운 성공을 거두었지만, Open-source 대안들은 그에 비해 상당히 뒤쳐져 있습니다.

#Review #Unified Video Generation #Multimodal Composition #Reasoning-Augmented #IntelligentVBench #MLLM #MMDiT #DeepStacking #Free-form Inputs

2026년 3월 25일

[논문리뷰] LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis

Novel View Synthesis (NVS)는 기존 뷰들을 기반으로 새로운 시점 이미지를 생성하는 중요한 태스크이다.

#Review #Novel View Synthesis (NVS)#Latent Geometry #Real-time Rendering #3D Inductive Biases #Encoder-Decoder #VGGT #Generalization #Diffusion Models

2026년 3월 25일

[논문리뷰] GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Multimodal Large Language Models (MLLMs)가 로봇공학부터 가상 세계에 이르기까지 3D 환경 내 자율 에이전트의 perceptual backbone으로 점점 더 많이 활용되고 있다.

2026년 3월 25일

[논문리뷰] EVA: Efficient Reinforcement Learning for End-to-End Video Agent

기존 멀티모달 대규모 언어 모델(MLLM) 기반 비디오 이해 시스템은 비디오를 수동적인 인식기로 처리하여, 전체 비디오나 균일하게 샘플링된 프레임을 어떠한 적응적 추론 없이 처리하는 한계가 있습니다.

#Review #Video Agent #Reinforcement Learning #MLLM #Planning-before-Perception #Tool Use #KTO #GRPO

2026년 3월 25일

[논문리뷰] CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

최근 Multimodal Agentic Pipelines이 Human-Computer Interaction을 변화시키고 있지만, 대부분 Short-Horizon 또는 General-Purpose Application에 초점을 맞추고 있으며, 특히 Healthcare 분야에서 Long-Horizon Automation은 크게 탐구되지 않은 상태이다.

#Review #Multi-Agent Framework #Healthcare Automation #Long-Horizon Tasks #Actor-Critic #Tool Grounding #Dual-Memory #CareFlow #GUI Agents

2026년 3월 25일

[논문리뷰] Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments

최근 LLM(Large Language Models)의 발전은 복잡한 태스크에서 추론, 계획 및 실행이 가능한 에이전트 시스템을 가능하게 했지만, 불확실한 환경에서 자원을 효과적으로 할당할 수 있는지에 대한 여부는 불분명하다. resource allocation 은 단기적인 반응적 의사결정과 근본적으로 다르다.

#Review #LLM Agents #Resource Allocation #Enterprise Simulation #Financial Management #Uncertainty #Long-Horizon Decision-Making #CFO

2026년 3월 25일

[논문리뷰] CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

지능형 에이전트가 복잡한 데스크톱 워크플로우를 자동화할 수 있다는 비전은 연속적이고 고품질의 인간 데모 비디오 부족으로 인해 진전이 지연되고 있다.

#Review #Computer-Use Agents #Video Demonstrations #Human Annotation #Desktop Applications #Visual Grounding #Action Prediction #Multi-layered Reasoning #Foundation Action Models

2026년 3월 25일

[논문리뷰] 6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

Video Diffusion Transformers (DiTs)는 탁월한 비디오 생성 능력을 보여주지만, 높은 메모리 사용량과 막대한 계산 비용으로 인해 실제 배포에 심각한 제약을 받는다.

#Review #Video Diffusion Transformers #Mixed-Precision Quantization #Inference Acceleration #Temporal Delta Cache #NVFP4 #INT8 #Post-Training Quantization #Memory Reduction

2026년 3월 25일

[논문리뷰] WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

기존 비디오 월드 모델들은 액션에 조건화된 역학(action-conditioned dynamics)을 학습하는 데 어려움을 겪고 있는데, 이는 현재 데이터셋이 요구 사항을 충족하지 못하기 때문입니다.

#Review #World Modeling #Action-Conditioned Generation #Dataset #Generative ARPG #Explicit State Annotation #Video Generation #Long-Horizon Consistency

2026년 3월 24일

[논문리뷰] VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

기존의 Large Vision-Language Models (LVLMs) 효율성 개선 접근 방식은 주로 visual token reduction에 기반한다.

#Review #LVLM Efficiency #Sparse Interaction #Cross-Attention #Self-Attention #Adaptive Inference #Visual Feature Refinement #Computational Cost Reduction

2026년 3월 24일