#Temporal Reasoning

19개의 포스트

[논문리뷰] Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

본 논문은 현대의 MLLM(Multimodal Large Language Models)이 VideoQA와 같은 피상적인 시각적 단서 인식에는 뛰어나지만, 영상 튜토리얼로부터 깊은 절차적 지식을 습득하고 이를 복잡한 하위 작업에 일반화하는 능력은 부족하다는 점을 문제로 제기합니다 .

#Review #VideoQA #Video-Guided Agent #Keyframe Extraction #In-Context Learning #GUI Agents #Procedural Knowledge #Temporal Reasoning

2026년 6월 29일

[논문리뷰] OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

본 논문은 기존 GUI 에이전트 벤치마크가 정적 스크린샷 위주로 구성되어 있어, 실시간 환경에서 요구되는 동적 오디오 및 비디오 처리 능력을 평가하지 못한다는 한계를 해결하고자 한다 .

#Review #GUI Agents #Multimodal Benchmark #Smartphone Environments #Temporal Reasoning #Auditory Processing #Action Grounding

2026년 5월 19일

[논문리뷰] ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

최근의 MLLMs 는 입력 정보의 정밀도(fidelity)를 높여 성능을 향상시키지만, 이는 과도한 visual token의 증가로 이어져 고해상도와 긴 시간적 맥락(long temporal context)을 동시에 유지하는 것을 불가능하게 만듭니다.

#Review #Multimodal Large Language Models (MLLMs)#Input-side Adaptation #Contextual Bandit #Cost-Aware Policy Optimization (CAPO)#Visual Budgeting #Efficient Inference #Temporal Reasoning

2026년 3월 30일

[논문리뷰] Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

본 논문은 컴퓨팅 자원이 제한된 환경(모바일, 엣지 디바이스)에서 VLM(Vision Language Model) 배포를 저해하는 모델 크기 확장의 문제를 해결하고자 합니다.

#Review #Vision Language Model (VLM)#LLM-based Vision Encoder #Efficient AI #Multimodal Understanding #Generative Pretraining #Resource-constrained Deployment #Temporal Reasoning

2026년 3월 8일

[논문리뷰] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

논문은 기존 비디오 이해 데이터셋이 자연스러운 장기적 일상생활을 반영하지 못하고 짧은 클립 위주라는 한계를 지적하며, 진정한 다중 모드 평생 이해(Multimodal Lifelong Understanding) 태스크를 엄격하게 정의하는 것을 목표로 합니다.

#Review #Multimodal Lifelong Understanding #Video Dataset #Agentic AI #Dynamic Memory Management #Long-Context MLLMs #Temporal Reasoning #Concept Drift

2026년 3월 5일

[논문리뷰] RIVER: A Real-Time Interaction Benchmark for Video LLMs

대부분의 Multimodal Large Language Models (MLLMs)이 오프라인 패러다임으로 작동하여 실시간 상호작용 능력이 부족하다는 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Real-time Interaction #Video Understanding #Benchmark #Temporal Reasoning #Long-term Memory #Proactive Response

2026년 3월 4일

[논문리뷰] Chain of World: World Model Thinking in Latent Motion

기존 VLA(Vision-Language-Action) 모델이 예측 능력 부족과 시각적 중복성 재구성에 따른 비효율성을 보이는 한계를 극복하고, 잠재 액션 모델의 연속적인 동적 모델링 및 세계 지식 부족 문제를 해결하고자 합니다.

#Review #Vision-Language-Action Models #World Models #Latent Motion #Embodied Intelligence #Temporal Reasoning #Disentangled Representation #Robotics #Pretraining

2026년 3월 3일

[논문리뷰] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

기존 Video Language Models (VideoLMs)의 밀집 RGB 프레임 인코딩으로 인한 높은 계산 오버헤드 및 희소 키프레임 샘플링으로 인한 제한적인 시간 범위 문제를 해결하는 것이 목표입니다.

#Review #Video Language Models #Codec Primitives #Efficient Tokenization #Motion Vectors #Residuals #Temporal Reasoning #Long-Context Understanding #Video Compression

2026년 2월 15일

[논문리뷰] How Much 3D Do Video Foundation Models Encode?

본 논문은 대규모 비디오 데이터로 사전 훈련된 Video Foundation Models (VidFMs) 내에 글로벌 3D 이해도가 자연스럽게 내재되어 있는지를 정량적으로 탐구하는 것을 목표로 합니다.

#Review #Video Foundation Models #3D Understanding #3D Reconstruction #Model Agnostic #Feature Probing #Diffusion Models #Temporal Reasoning

2025년 12월 25일

[논문리뷰] Streaming Video Instruction Tuning

이 논문은 실시간 비디오 스트림을 이해하고 동적인 지시에 반응하는 일반 목적의 대화형 AI 어시스턴트인 Streamo 를 개발하는 것을 목표로 합니다.

#Review #Streaming Video Understanding #Large Language Models (LLMs)#Instruction Tuning #Multi-task Learning #Real-time AI Assistant #Temporal Reasoning #Focal Loss #Video Question Answering

2025년 12월 24일

[논문리뷰] HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

기존 VideoQA 벤치마크가 단일 단서나 언어 사전 지식에 의존하는 경향이 있어 다중 증거 통합 능력을 제대로 평가하지 못하는 문제를 해결하고자 합니다.

#Review #Video Question Answering #Multi-evidence Integration #Video-LLMs #Benchmark #Temporal Reasoning #Frame Selection #Evidential Requirement #MRFS

2025년 12월 21일

[논문리뷰] HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

대부분의 Vision-Language-Action (VLA) 모델이 Markov 속성을 가정하여 장기 태스크에서 temporal myopia 와 일관성 부족 을 겪는 문제를 해결하는 것이 목표입니다.

#Review #Vision-Language-Action #Motion Representation #Temporal Reasoning #Long-Horizon Manipulation #Hindsight #Foresight #Robotics

2025년 12월 10일

[논문리뷰] Unified Video Editing with Temporal Reasoner

기존 비디오 편집 모델들이 겪는 정밀도(expert models)와 통합성/마스크-프리(in-context learning models) 간의 트레이드오프를 해결하는 것을 목표로 합니다.

#Review #Video Editing #Diffusion Models #Temporal Reasoning #Chain-of-Thought #In-Context Learning #ROPE #Multi-instance Editing

2025년 12월 8일

[논문리뷰] StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

본 연구는 대규모 언어 모델(MLLMs)이 스트리밍 비디오 환경에서 인간의 시선(gaze) 신호를 활용하여 시간적 추론 및 선제적 이해를 얼마나 효과적으로 수행하는지 평가하는 것을 목표로 합니다.

#Review #Streaming Video Understanding #Gaze-Guided AI #Temporal Reasoning #Proactive AI #MLLMs #Eye Tracking #Benchmark #Human-Computer Interaction

2025년 12월 1일

[논문리뷰] VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

본 연구는 기존 비디오 벤치마크들이 장거리 이동 및 다일(multi-day) 활동과 같은 거시적 규모의 지리 공간-시간적 시나리오 를 충분히 다루지 못한다는 한계를 지적하며, MLLM(Multimodal Large Language Models)의 확장된 지리 공간 및 시간적 이해 능력 을 평가하는 새로운 벤치마크 VIR-Bench를 제시합니다.

#Review #Multimodal LLMs #Video Understanding #Geospatial Reasoning #Temporal Reasoning #Travel Itinerary Reconstruction #Benchmark #Agent System #VLOG

2025년 9월 24일

[논문리뷰] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

본 논문은 Video Large Language Models ( VideoLLMs )가 비디오-텍스트 정보(spatiotemporal inputs)를 어떻게 내부적으로 추출하고 전파하여 비디오 질의응답 (VideoQA) 태스크에서 Temporal Reasoning을 수행하는지 그 메커니즘을 밝히는 것을 목표로 합니다.

#Review #Video Large Language Models #VideoQA #Mechanistic Interpretability #Attention Knockout #Temporal Reasoning #Information Flow #Model Interpretability #Logit Lens

2025년 10월 27일

[논문리뷰] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

본 연구는 대규모 멀티모달 모델(LMM)이 시간에 따라 변화하는 사실적 지식을 정확하게 이해하는 데 어려움을 겪는 문제를 해결하고자 합니다.

#Review #Large Multimodal Models (LMMs)#Time-Sensitive Knowledge #Temporal Reasoning #Knowledge Editing #Multimodal Benchmarking #Temporal Awareness #Dynamic Knowledge

2025년 10월 23일

[논문리뷰] OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

본 연구는 인간처럼 여러 모달리티에 걸쳐 세상을 인지하고 추론할 수 있는 강력한 오픈소스 옴니모달 LLM(Omni-Modal LLM) 인 OmniVinci 를 구축하는 것을 목표로 합니다.

#Review #Omni-Modal LLM #Multimodal Understanding #Vision-Audio Alignment #Temporal Reasoning #Data Curation #Foundation Models #Contrastive Learning #Rotary Time Embedding

2025년 10월 20일

[논문리뷰] ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

본 논문은 기존 이미지 편집 모델의 물리적 일관성 부족 문제를 해결하고, 특히 월드 시뮬레이션 관련 작업에서 편집된 객체가 장면의 맥락과 물리적으로 일관되게 유지되도록 하는 것을 목표로 합니다.

#Review #Image Editing #Video Generation #Temporal Reasoning #World Simulation #Physical Consistency #Diffusion Models #Generative Models

2025년 10월 7일