#Long Video Understanding

9개의 포스트

[논문리뷰] Small Vision-Language Models are Smart Compressors for Long Video Understanding

저자들은 SVLM을 로컬 압축기로 활용하여 긴 비디오를 쿼리 의존적인 메모리 토큰으로 변환하는 Tempo 프레임워크를 제안합니다 . Tempo는 각 세그먼트에서 쿼리와 시각적 정보를 결합한 교차 모달 증류(cross-modal distillation)를 수행하며, ATA 기법을 통해 추론 시점의 토큰 예산(예: 4K/8K)을 엄격히 준수합니다.

#Review #Multimodal Large Language Models #Long Video Understanding #Visual Token Compression #Adaptive Token Allocation #Cross-modal Distillation

2026년 4월 9일

[논문리뷰] VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

arXiv에 게시된 'VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding' 논문에 대한 자세한 리뷰입니다.

#Review #Long Video Understanding #Multimodal Large Language Models #Video Question Answering #Graph Neural Networks #Active Inference #Belief Propagation #Spatio-Temporal Graph

2026년 3월 23일

[논문리뷰] LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

arXiv에 게시된 'LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding' 논문에 대한 자세한 리뷰입니다.

#Review #Long Video Understanding #MLLM Agent #Active Learning #Reinforcement Learning #Chain-of-Thought #Video Navigation #Computational Efficiency

2026년 3월 1일

[논문리뷰] LongVideoAgent: Multi-Agent Reasoning with Long Videos

Renjie Pi이 arXiv에 게시한 'LongVideoAgent: Multi-Agent Reasoning with Long Videos' 논문에 대한 자세한 리뷰입니다.

#Review #Multi-Agent System #Long Video Understanding #Video Question Answering #Reinforcement Learning #Large Language Models #Temporal Grounding #Multimodal Reasoning #Tool-Augmented AI

2025년 12월 23일

[논문리뷰] LongVT: Incentivizing 'Thinking with Long Videos' via Native Tool Calling

arXiv에 게시된 'LongVT: Incentivizing 'Thinking with Long Videos' via Native Tool Calling' 논문에 대한 자세한 리뷰입니다.

#Review #Long Video Understanding #Multimodal LLMs #Tool Calling #Reinforcement Learning #Chain-of-Thought #Temporal Grounding #Video Question Answering

2025년 12월 1일

[논문리뷰] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

arXiv에 게시된 'TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding' 논문에 대한 자세한 리뷰입니다.

#Review #Long Video Understanding #Hybrid Mamba-Transformer #Vision-Language Model #Token Compression #Vision-to-Text Aggregation #Efficient LLM #Multimodal AI

2025년 11월 20일

[논문리뷰] Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

Lionel Ni이 arXiv에 게시한 'Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding' 논문에 대한 자세한 리뷰입니다.

#Review #Long Video Understanding #Reinforcement Learning #Multi-Turn Reasoning #MLLMs #Video Segment Selection #Bi-level Reward #Question Answering

2025년 9월 5일

[논문리뷰] ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Xuanyu Zheng이 arXiv에 게시한 'ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding' 논문에 대한 자세한 리뷰입니다.

#Review #Long Video Understanding #Hallucination #Semantic Aggregation #Video MLLM #Benchmark #DPO #Positional Encoding #VideoQA

2025년 9월 3일

[논문리뷰] When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

Rui Guo이 arXiv에 게시한 'When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding' 논문에 대한 자세한 리뷰입니다.

#Review #Video-LLM #Diffusion Model #Temporal Grounding #Object Segmentation #Long Video Understanding #Multimodal AI #Video Question Answering

2025년 8월 22일