#Video Captioning

5개의 포스트

[논문리뷰] OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

본 논문은 Omni-modal 모델들이 복잡한 사용자 지시 사항을 준수하는 능력인 Instruction Following에 대한 체계적인 평가 도구가 부족하다는 점을 해결하고자 합니다.

#Review #Omni-modal Large Language Models #Instruction Following #Video Captioning #Temporal Grounding #Constraint Framework #Format-Content Tradeoff

2026년 6월 8일

[논문리뷰] Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

이 연구는 기존 비디오-명령어 데이터가 불완전하고 세분화된 정보 및 신뢰성 있는 주석이 부족하여 범용적인 비디오 이해 MLLM 의 성능을 제약하는 문제를 해결하고자 합니다.

#Review #Video Understanding #Multimodal Large Language Models (MLLMs)#Instruction Tuning #Data Curation #Attribute-Structured Data #Quality Verification #Temporal Grounding #Video Captioning

2026년 2월 15일

[논문리뷰] TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

본 논문은 기존 오디오-비주얼 캡셔닝이 갖는 시간적 기반 부재 및 시각 중심적 한계 를 해결하고자 합니다.

#Review #Video Captioning #Multi-Scene Videos #Time-Aware #Structural Captions #Audio-Visual Understanding #Large Language Models #Reinforcement Learning #OmniDCBench

2026년 2월 11일

[논문리뷰] Video-Thinker: Sparking 'Thinking with Videos' via Reinforcement Learning

본 논문은 기존 이미지 추론에서 성공적으로 활용된 'Thinking with Images' 패러다임을 비디오 추론 태스크로 확장하는 것을 목표로 합니다.

#Review #Video Reasoning #Multimodal Large Language Models #Reinforcement Learning #Chain-of-Thought #Video Understanding #Temporal Grounding #Video Captioning #Autonomous Tool Use

2025년 10월 30일

[논문리뷰] IF-VidCap: Can Video Caption Models Follow Instructions?

비디오 캡셔닝 분야에서 멀티모달 대규모 언어 모델(MLLM) 이 사용자의 특정 지시사항(예: 출력 형식, 길이, 내용 제약)을 얼마나 잘 따르는지 평가하는 새로운 벤치마크를 제시하는 것이 목표입니다.

#Review #Video Captioning #Instruction Following #MLLMs #Benchmark #Controllable Generation #Multimodal Evaluation #Fine-tuning

2025년 10월 22일