#Multimodal Retrieval

14개의 포스트

[논문리뷰] Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

본 논문은 VideoRAG 시스템이 직면한 평가의 불투명성과 최적의 검색 전략 부재 문제를 해결하고자 합니다.

#Review #VideoRAG #Egocentric Video #V-RAGBench #CARVE #Chunk-Adaptive Reranking #Multimodal Retrieval #Long-form Video Understanding

2026년 6월 14일

[논문리뷰] Benchmarking Composed Image Retrieval for Applied Earth Observation

본 논문은 Earth Observation(EO) 아카이브 탐색 시 사용자의 구체적인 의도를 반영하기 어려운 기존의 단일 모달(이미지 혹은 텍스트) 검색 방식의 한계를 해결하고자 한다.

#Review #Remote Sensing Image Retrieval #Composed Image Retrieval #Multimodal Retrieval #Vision-Language Models #Earth Observation #Benchmarking

2026년 5월 31일

[논문리뷰] Your Embedding Model is SMARTer Than You Think

본 논문은 single-vector multimodal retriever가 rich하고 sequential한 token sequence를 단일 global representation으로 압축하면서 발생하는 근본적인 information bottleneck 문제를 해결하고자 합니다.

#Review #Multimodal Retrieval #Single-Vector Embeddings #Multi-Vector Embeddings #Late Interaction #Information Bottleneck #Hidden States #Contrastive Learning #Plug-and-Play

2026년 5월 25일

[논문리뷰] V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

기존 MLLM 기반 검색 시스템이 정적 시각 인코딩에 의존하고 시각적 증거를 능동적으로 검증하지 못해 시각적으로 모호한 경우 추론 오류가 발생하는 문제를 해결하고자 합니다. 시각적 검사에 기반한 증거 기반 에이전트 추론 프로세스 를 통해 범용 멀티모달 검색의 정확성과 신뢰성을 향상시키는 것을 목표로 합니다.

#Review #Multimodal Retrieval #Agentic AI #Large Language Models (LLMs)#Visual Tools #Chain-of-Thought (CoT)#Reinforcement Learning #Curriculum Learning #Evidence-Driven Reasoning

2026년 2월 5일

[논문리뷰] OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

이 논문은 Vision-Language Model (VLM) 기반 Computer-Using Agents (CUAs) 가 긴 작업 흐름에서 견고성 을 유지하고 새로운 도메인으로 일반화 하는 데 겪는 문제를 해결하는 것을 목표로 합니다.

#Review #Computer-Using Agent (CUA)#Multi-Agent Framework #Long-horizon Tasks #Memory Management #Multimodal Retrieval #Reflection #Generalization

2026년 1월 12일

[논문리뷰] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

본 논문은 텍스트, 이미지, 문서 이미지, 비디오 등 다양한 양식의 데이터를 통합 하여 고정밀 멀티모달 검색을 수행하는 Qwen3-VL-Embedding 및 Qwen3-VL-Reranker 모델 시리즈를 소개합니다.

#Review #Multimodal Retrieval #Multimodal Ranking #Foundation Models #Embedding Models #Reranking Models #Contrastive Learning #Knowledge Distillation #Matryoshka Representation Learning #Quantization-Aware Training

2026년 1월 11일

[논문리뷰] Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

본 연구는 이질적인 검색기(retriever)로부터 얻은 후보군들을 융합할 때, 기존의 랭크 기반 융합 방식들이 콘텐츠를 무시하고 랭크나 스코어 신호에만 의존하는 한계를 극복하고자 합니다.

#Review #Video Retrieval #Vision-Language Models (VLMs)#Zero-Shot Learning #List-wise Reranking #Rank Fusion #Prompt Engineering #S-Grid #Multimodal Retrieval

2025년 11월 9일

[논문리뷰] X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning

본 논문은 기존 임베딩 모델 기반 텍스트-비디오 검색 시스템의 한계, 즉 낮은 데이터 품질의 영향 및 랭킹 결과에 대한 설명 부족 문제를 해결하는 것을 목표로 합니다. 특히, 검색 모델의 동작과 텍스트-비디오 데이터 품질을 평가하기 위해 랭킹 결과를 해석할 수 있는 설명 가능한 검색 시스템 인 X-CoT를 제안합니다.

#Review #Text-to-Video Retrieval #LLM #Chain-of-Thought #Explainable AI #Multimodal Retrieval #Bradley-Terry Model #Video Annotation

2025년 9월 29일

[논문리뷰] MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

기존 멀티모달 검색 방법론들이 단일 벡터 임베딩의 표현력 한계에 부딪히거나, 다수의 토큰으로 인한 다중 벡터 방식의 계산 비용 문제로 확장성에 제약을 받는 문제를 해결하고자 합니다. 유연한 테스트 시간 임베딩 세분화 제어를 통해 확장 가능하며 높은 정확도를 유지하는 멀티모달 검색 패러다임을 개발하는 것이 주 목표입니다.

#Review #Multimodal Retrieval #Late Interaction #Meta Tokens #Matryoshka Representation Learning #Test-Time Scaling #Vision-Language Models #Dense Retrieval #Efficiency

2025년 9월 23일

[논문리뷰] VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

본 논문은 기존 벤치마크의 영어 단일 언어 및 단일 페이지 제한을 넘어, 다국어 장문 문서 에서 질문 기반 멀티모달 검색(multimodal retrieval) 을 평가하기 위한 새로운 벤치마크인 VisR-Bench 를 제안합니다.

#Review #Multimodal Retrieval #Retrieval-Augmented Generation #Long Document Understanding #Multilingual NLP #Visual QA #Benchmark #MLLMs #Table Understanding

2025년 8월 12일

[논문리뷰] Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

본 연구는 기존 RAG 시스템이 단일 모드 텍스트나 제한된 다중 모드 설정에만 초점을 맞춰, 실제 환경의 혼합 모드(mixed-modal) 질의 및 문서 처리에 한계가 있다는 문제를 해결하고자 합니다.

#Review #Universal RAG #Multimodal Retrieval #Mixed-Modal Data Generation #Vision-Language Models #Contrastive Learning #Matryoshka Representation Learning

2025년 10월 21일

[논문리뷰] FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

기존 비전-언어 모델(VLM)이 대규모 전역 정렬에는 능숙하지만, 객체 속성, 공간 관계, 미묘한 언어 표현 등 세분화된 디테일 을 포착하고 비영어권 환경(특히 중국어) 에서 다국어 지원이 부족하다는 문제점을 해결하는 것을 목표로 합니다.

#Review #Vision-Language Alignment #Fine-grained Understanding #Bilingual Model #Contrastive Learning #Multimodal Retrieval #Open-Vocabulary Detection #Region-Text Matching

2025년 10월 16일

[논문리뷰] MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

기존 멀티모달 검색 벤치마크의 한계(일반 도메인, 단순 의미 매칭, 단일 이미지/단일 모달 문서)를 극복하고, 전문가 수준의 다학제적 지식과 심층적인 추론 을 요구하는 현실적인 멀티모달 검색 벤치마크를 구축하는 것을 목표로 합니다.

#Review #Multimodal Retrieval #Benchmark #Reasoning #Multidisciplinary #Expert-Level #Image-Text Interleaving #Contradiction Retrieval

2025년 10월 13일

[논문리뷰] TalkPlay-Tools: Conversational Music Recommendation with LLM Tool Calling

본 논문은 기존 대규모 언어 모델(LLM) 기반 추천 시스템의 제한적인 추천 행동과 단일 검색 방법론의 한계를 극복하고자 합니다. 사용자의 복잡한 의도를 해석하고 다양한 데이터 소스를 통합하여 정교한 음악 추천을 제공하는 통합 검색-재순위화 파이프라인 을 목표로 합니다.

#Review #Conversational Recommendation #LLM Tool Calling #Music Recommendation #Multimodal Retrieval #Information Retrieval #Retrieval-Reranking #Semantic IDs

2025년 10월 6일