#Cross-modal Alignment

8개의 포스트

[논문리뷰] PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

본 논문은 기존의 Embodied Navigation 연구들이 Vision-Language Navigation (VLN)과 Object Goal Navigation (ObjNav)을 분리된 문제로 다루며, 이들 사이의 연계를 위해 과도한 Cross-modal 학습이나 대규모 VLM 모델에 의존하고 있다는 점을 문제로 지적한다 .

#Review #Embodied Navigation #Platonic Representation Hypothesis #Topological Map #Blind Matching #Zero-shot Navigation #Cross-modal Alignment

2026년 6월 2일

[논문리뷰] LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

본 논문은 최신 VLM들이 텍스트 질문을 그에 대응하는 렌더링된 이미지로 교체했을 때 발생하는 성능 저하 문제, 즉 carrier sensitivity 문제를 해결하고자 합니다.

#Review #Vision-Language Models #Modality Gap #Carrier Sensitivity #Local Modality Substitution #Supervised Fine-Tuning #Cross-modal Alignment

2026년 5월 28일

[논문리뷰] LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

본 논문은 기존의 Explicit Text CoT 기반 MLLM이 고차원 오디오-비주얼 정보를 텍스트라는 좁은 병목으로 압축함에 따라, 다중 모달 간의 세밀한 시간적 정렬과 의미적 연결을 놓치는 문제를 해결하고자 한다.

#Review #Multimodal Large Language Models #Audio-Visual Reasoning #Latent Reasoning #Cross-modal Alignment #Chain-of-Thought #Instruction Tuning

2026년 5월 21일

[논문리뷰] T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

텍스트-오디오-비디오 (T2AV) 생성 모델의 평가 방식이 파편화되어 있고, 단일 모달 메트릭에 의존하며 복잡한 프롬프트에서 크로스-모달 정렬, 지시 준수 및 인지적 사실성을 제대로 포착하지 못하는 문제를 해결하고자 합니다. 본 연구는 T2AV 시스템의 포괄적인 평가를 위한 통합 벤치마크 를 제시하는 것을 목표로 합니다.

#Review #Text-to-Audio-Video Generation #Multimodal Evaluation #Benchmark #MLLM-as-a-Judge #Cross-modal Alignment #Instruction Following #Perceptual Realism #Audio Realism

2025년 12월 24일

[논문리뷰] Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

본 논문은 인공지능 분야의 근본적인 도전 과제인 멀티모달 추론 의 한계를 극복하는 것을 목표로 합니다. 특히, 최첨단 GPT-03 과 같은 모델도 시각 정보 통합에 어려움을 겪는 과학 분야의 멀티모달 시나리오에서 시각-텍스트 모달리티 간의 격차를 해소 하고 견고한 추론 성능을 확보하고자 합니다.

#Review #Multimodal Reasoning #Science AI #Caption-assisted Reasoning #SeePhys Challenge #Large Language Models #Visual Question Answering #Physics Problems #Cross-modal Alignment

2025년 9월 17일

[논문리뷰] Scaling Language-Centric Omnimodal Representation Learning

본 논문은 MLLM(Multimodal Large Language Model) 기반 임베딩 모델의 우수한 성능이 전통적인 CLIP-스타일 모델 에 비해 가지는 근본적인 이유를 탐구합니다.

#Review #Multimodal Embeddings #MLLMs #Contrastive Learning #Cross-modal Alignment #Generative Pretraining #Representation Learning #Scaling Laws

2025년 10월 15일

[논문리뷰] Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

본 논문은 기존 프롬프트 최적화 방법론이 텍스트 모달리티에만 국한되어 Multimodal Large Language Models (MLLMs) 의 잠재력을 완전히 활용하지 못하는 한계를 해결하고자 합니다.

#Review #Multimodal AI #Prompt Optimization #MLLMs #Bayesian Optimization #Cross-modal Alignment #Prompt Engineering #Generative AI #Exploration-Exploitation

2025년 10월 13일

[논문리뷰] Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

본 논문은 기존 의료 AI 모델의 모달리티별 단편화 문제를 해결하고, 의료 이미지(방사선, 병리학)와 임상 보고서 간의 통합적인 생성 능력 을 갖춘 범용 의료 AI 에이전트를 개발하는 것을 목표로 합니다.

#Review #Discrete Diffusion Models #Multimodal Large Language Models (MLLMs)#Medical Image Generation #Medical Report Generation #Multimodal Generation #Medical AI #Cross-modal Alignment

2025년 10월 8일