#Temporal Grounding

13개의 포스트

[논문리뷰] OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

본 논문은 Omni-modal 모델들이 복잡한 사용자 지시 사항을 준수하는 능력인 Instruction Following에 대한 체계적인 평가 도구가 부족하다는 점을 해결하고자 합니다.

#Review #Omni-modal Large Language Models #Instruction Following #Video Captioning #Temporal Grounding #Constraint Framework #Format-Content Tradeoff

2026년 6월 8일

[논문리뷰] Watch, Remember, Reason: Human-View Video Understanding with MLLMs

본 연구는 짧은 클립 위주의 비디오 이해에서 벗어나 분 단위 이상의 장기적이고 다중 모달이 얽힌 복잡한 비디오 환경으로 변화하는 트렌드를 다룹니다.

#Review #Multimodal Large Language Models #Video Understanding #Temporal Grounding #Memory Modeling #Long-video Reasoning #Efficient Perception

2026년 6월 7일

[논문리뷰] Towards One-to-Many Temporal Grounding

본 연구는 기존 Temporal Grounding 연구들이 주로 단일 세그먼트 검색(One-to-One)에 치중되어 있어, 실세계의 반복적인 이벤트 구조를 다루지 못한다는 한계를 해결합니다.

#Review #Temporal Grounding #MLLM #One-to-Many #Reinforcement Learning #Event Cardinality #Benchmark

2026년 6월 4일

[논문리뷰] Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

이 연구는 기존 비디오-명령어 데이터가 불완전하고 세분화된 정보 및 신뢰성 있는 주석이 부족하여 범용적인 비디오 이해 MLLM 의 성능을 제약하는 문제를 해결하고자 합니다.

#Review #Video Understanding #Multimodal Large Language Models (MLLMs)#Instruction Tuning #Data Curation #Attribute-Structured Data #Quality Verification #Temporal Grounding #Video Captioning

2026년 2월 15일

[논문리뷰] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

비디오 이해 태스크에서 Chain-of-Thought (CoT) 추론의 필요성과 이점을 재평가하고, 기존 CoT 방식이 때로는 직접 답변보다 성능이 낮고 비효율적임을 지적합니다. 이를 바탕으로, 필요한 경우에만 추론을 수행하여 효율성과 정확성을 동시에 개선하는 적응형 비디오 추론 프레임워크 를 개발하는 것이 목표입니다.

#Review #Video Understanding #Chain-of-Thought (CoT)#Reinforcement Learning (RL)#Adaptive Reasoning #Early Exit #Multimodal LLM #Video QA #Temporal Grounding

2026년 1월 8일

[논문리뷰] Factorized Learning for Temporally Grounded Video-Language Models

기존 비디오-언어 모델(VLLMs)이 이벤트 수준의 정확한 temporal grounding 과 텍스트 응답 생성에서 겪는 한계를 해결하는 것을 목표로 합니다.

#Review #Video-Language Models #Temporal Grounding #Factorized Learning #Preference Optimization #Evidence Referencing #Video Understanding #Dense Captioning

2025년 12월 31일

[논문리뷰] Video-BrowseComp: Benchmarking Agentic Video Research on Open Web

본 논문은 기존 벤치마크들이 텍스트 및 정적 멀티모달 정보 탐색에 초점을 맞추고 동적인 웹 비디오 콘텐츠를 간과하는 문제점을 해결하고자 합니다.

#Review #Agentic AI #Video Understanding #Web Browsing #Benchmark #Multimodal LLMs #Temporal Grounding #Cross-Source Reasoning #Information Seeking

2025년 12월 29일

[논문리뷰] LongVideoAgent: Multi-Agent Reasoning with Long Videos

본 논문은 기존 MLLM(Multimodal Large Language Models)이 긴 길이의 비디오에서 발생하는 정보 압축 손실, 제한된 도구 세트, 그리고 미세한 시간적 추론 능력 부족 문제를 해결하는 것을 목표로 합니다.

#Review #Multi-Agent System #Long Video Understanding #Video Question Answering #Reinforcement Learning #Large Language Models #Temporal Grounding #Multimodal Reasoning #Tool-Augmented AI

2025년 12월 23일

[논문리뷰] LongVT: Incentivizing 'Thinking with Long Videos' via Native Tool Calling

논문은 대규모 멀티모달 모델(LMMs)이 장시간 비디오(hours-long)에서 증거가 희박하고 시간적으로 분산된 정보를 처리할 때 발생하는 환각 현상과 부정확한 추론 문제를 해결하고자 합니다.

#Review #Long Video Understanding #Multimodal LLMs #Tool Calling #Reinforcement Learning #Chain-of-Thought #Temporal Grounding #Video Question Answering

2025년 12월 1일

[논문리뷰] VideoSSR: Video Self-Supervised Reinforcement Learning

본 연구는 Multimodal Large Language Models (MLLMs)의 비디오 이해 능력을 향상시키기 위해, 기존 비디오 데이터셋의 높은 주석 비용, 복잡성 부족, 그리고 주석 과정에서의 편향성이라는 한계를 극복하는 것을 목표로 합니다.

#Review #Video Understanding #Self-Supervised Learning #Reinforcement Learning #MLLMs #Pretext Tasks #Verifiable Rewards #Data Generation #Temporal Grounding

2025년 11월 11일

[논문리뷰] TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

이 논문은 비디오 시간적 접지(temporal grounding) 작업에서 멀티모달 대규모 언어 모델(MLLMs) 의 효율성을 개선하는 것을 목표로 합니다. 기존 강화 학습( RL ) 방법론, 특히 GRPO 가 큰 시간 검색 공간에서 비효율적인 탐색과 불안정한 정책 업데이트를 겪는 문제를 해결하고자 합니다.

#Review #Video LLMs #Temporal Grounding #Reinforcement Learning #Off-policy Learning #Reward Shaping #Chain-of-Thought #Multimodal LLMs

2025년 9월 23일

[논문리뷰] When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

본 논문은 기존 Video-LLM의 한계인 불명확한 시간 인코딩, 프레임 수준의 낮은 연속성, 그리고 관심 엔티티에 대한 언어-비전 정렬 불일치를 극복하는 것을 목표로 합니다. 특히 긴 비디오에서 발생하는 이벤트의 정밀한 시간적 위치 파악과 엔티티 수준의 견고한 정렬을 통해 비디오 이해 능력을 향상시키고자 합니다.

#Review #Video-LLM #Diffusion Model #Temporal Grounding #Object Segmentation #Long Video Understanding #Multimodal AI #Video Question Answering

2025년 8월 22일

[논문리뷰] Video-Thinker: Sparking 'Thinking with Videos' via Reinforcement Learning

본 논문은 기존 이미지 추론에서 성공적으로 활용된 'Thinking with Images' 패러다임을 비디오 추론 태스크로 확장하는 것을 목표로 합니다.

#Review #Video Reasoning #Multimodal Large Language Models #Reinforcement Learning #Chain-of-Thought #Video Understanding #Temporal Grounding #Video Captioning #Autonomous Tool Use

2025년 10월 30일