#Adaptive Token Allocation

1개의 포스트

[논문리뷰] Small Vision-Language Models are Smart Compressors for Long Video Understanding

저자들은 SVLM을 로컬 압축기로 활용하여 긴 비디오를 쿼리 의존적인 메모리 토큰으로 변환하는 Tempo 프레임워크를 제안합니다 . Tempo는 각 세그먼트에서 쿼리와 시각적 정보를 결합한 교차 모달 증류(cross-modal distillation)를 수행하며, ATA 기법을 통해 추론 시점의 토큰 예산(예: 4K/8K)을 엄격히 준수합니다.

#Review #Multimodal Large Language Models #Long Video Understanding #Visual Token Compression #Adaptive Token Allocation #Cross-modal Distillation

2026년 4월 9일