#Multimodal Foundation Models

9개의 포스트

[논문리뷰] MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

본 논문은 의료 AI 모델의 성능을 제한하는 핵심 원인인 대규모 고품질 의료 멀티모달 데이터의 부족 문제를 해결하고자 합니다.

#Review #Multimodal Foundation Models #Medical Data Curation #PubMed Central #Image-Text Pairs #Vision-Language Models #Clinical Transfer Validation #High-Fidelity Pipeline

2026년 7월 12일

[논문리뷰] Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

본 논문은 실시간 오디오-비디오 인터랙션의 단절성과 모듈 간의 지연 시간 문제를 해결하기 위해 Wan-Streamer를 제안한다. 기존 연구들은 VAD, ASR, LLM, TTS 등을 결합한 캐스케이드(cascaded) 방식을 사용하여, 모듈 경계에서의 대기 시간과 오차 누적 문제에 직면해 있다 .

#Review #End-to-End #Real-time Interaction #Multimodal Foundation Models #Full-duplex #Streaming Inference #Block-causal Attention #Thinker-Performer Pipeline

2026년 6월 24일

[논문리뷰] GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

실세계 이미지 복원(IR) 모델은 학습 데이터 부족으로 인해 실제 환경에서의 일반화 성능이 현저히 떨어지는 고질적인 병목 현상을 겪고 있습니다. 합성 데이터는 실세계의 복잡한 열화(degradation) 과정을 제대로 모델링하지 못하며, 실제 촬영된 데이터는 비용과 확장성 및 장면 다양성 확보에 한계가 있습니다.

#Review #Image Restoration #Generative Ground Truth #Multimodal Foundation Models #Generalization #Dataset Construction #Quality Control

2026년 5월 31일

[논문리뷰] A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

본 논문은 LALMs 분야의 급격한 발전에도 불구하고, 모델의 성능 평가 기준과 범용적 활용에 대한 통합적인 체계가 부족하다는 점을 해결하고자 한다.

#Review #Large Audio Language Models #Audio-Language Pretraining #Multimodal Foundation Models #Audio Reasoning #Model Alignment #Generalization #Trustworthiness

2026년 5월 20일

[논문리뷰] Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

본 논문은 통합된 Multimodal 모델인 BAGEL-7B를 기반으로, 텍스트 토큰과 비주얼 토큰을 Autoregressively 생성하는 Process-Driven 아키텍처를 구축하였다 . 제안 모델은 4단계 루프(Plan → Sketch → Inspect → Refine)를 통해 각 단계에서 생성된 중간 비주얼 상태를 스스로 평가하고 수정한다.

#Review #Multimodal Foundation Models #Process-Driven Generation #Interleaved Reasoning #Chain-of-Thought #Visual Grounding #Image Generation

2026년 4월 8일

[논문리뷰] Learning Situated Awareness in the Real World

본 논문은 기존의 멀티모달 파운데이션 모델(MFM) 벤치마크들이 환경 중심의 공간 관계에만 초점을 맞추고, 에이전트의 시점, 자세, 움직임에 따른 관찰자 중심의 상황 인식(situated awareness) 을 간과하는 문제점을 해결하고자 합니다.

#Review #Situated Awareness #Egocentric Vision #Spatial Reasoning #Multimodal Foundation Models #Video Understanding #Benchmark #Real-world Data

2026년 2월 18일

[논문리뷰] MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

본 논문은 실세계 임상 애플리케이션에서 일반 목적의 의료 이해 및 추론을 발전시키기 위한 MedXIAOHE 라는 의료 비전-언어 파운데이션 모델을 제안합니다.

#Review #Medical LLMs #Multimodal Foundation Models #Continual Pre-training #Entity-Aware Learning #Reinforcement Learning #Medical Diagnosis #Instruction Following #Unified Benchmarking

2026년 2월 15일

[논문리뷰] OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion

본 논문은 텍스트 전용 번역 LLM이 겪는 지연 시간과 멀티모달 컨텍스트 활용 불가능성, 그리고 MMFM이 가진 다국어 번역 성능 및 커버리지의 한계를 해결하고자 합니다.

#Review #Multimodal Translation #Speech Translation #Simultaneous Translation #Large Language Models #Multimodal Foundation Models #Modular Fusion #End-to-End #Gated Fusion #OCR

2025년 12월 1일

[논문리뷰] Scaling Spatial Intelligence with Multimodal Foundation Models

본 연구는 최신 멀티모달 파운데이션 모델(Multimodal Foundation Models, MLLMs)이 가진 공간 지능(spatial intelligence)의 부족함을 해결하고, SenseNova-SI 계열 모델을 통해 대규모 데이터 스케일링을 통해 공간 지능을 효과적으로 육성하는 방법을 탐구하는 것을 목표로 합니다.

#Review #Spatial Intelligence #Multimodal Foundation Models #Data Scaling #Perspective-taking #Visual Question Answering #Emergent Capabilities #Embodied AI #Benchmark Evaluation

2025년 11월 20일