#Multi-modal Learning

12개의 포스트

[논문리뷰] M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement

본 논문은 기존의 Retinex 기반 딥러닝 기법들이 RGB 정보에만 의존하여 장면의 기하학적 구조나 조명 분포를 효과적으로 해석하지 못한다는 한계를 해결하고자 합니다.

#Review #Low-light Image Enhancement #Retinex Theory #Multi-modal Learning #Transformer #Cross-attention #Depth Estimation #Semantic Features

2026년 5월 13일

[논문리뷰] MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

이 논문은 이질적인 센서 노이즈, 카메라 의존적 편향, 그리고 노이즈가 많은 교차 소스 3D 데이터의 모호성으로 인해 확장이 어려웠던 Metric Depth Estimation 의 문제를 해결하고자 합니다.

#Review #Metric Depth Estimation #Pretraining #Foundation Models #Sparse Prompts #Heterogeneous Data #Zero-Shot Learning #Multi-modal Learning

2026년 1월 29일

[논문리뷰] Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

본 논문은 최신 Vision-Language Models (VLMs)에 내재된 인기도 편향(popularity bias)을 탐구하고 노출하는 것을 목표로 합니다.

#Review #Vision-Language Models (VLMs)#Popularity Bias #Ordinal Regression #Building Age Estimation #Multi-modal Learning #Benchmark Dataset #Explainable AI

2025년 12월 24일

[논문리뷰] An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

본 논문은 급변하는 Vision-Language-Action (VLA) 모델 분야에 대한 명확하고 구조화된 가이드를 제공하는 것을 목표로 합니다.

#Review #Vision-Language-Action Models #Embodied Intelligence #Robotics #Foundation Models #Multi-modal Learning #Reinforcement Learning #Sim-to-Real Transfer #Human-Robot Interaction

2025년 12월 21일

[논문리뷰] UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

기존 비디오 생성 모델들이 단일 모달리티 조건화 및 제한된 모달 다양성으로 인해 세계를 총체적으로 이해하는 데 한계 가 있음을 지적하며, 이를 극복하기 위해 다중 모달리티(세분화 마스크, 인간 골격, DensePose, 광학 흐름, 깊이 맵) 및 다중 훈련 패러다임 을 통합하여 세계 인식 비디오 생성 을 향상시키는 것을 목표로 합니다.

#Review #Video Generation #Multi-modal Learning #Multi-task Learning #Zero-shot Generalization #Diffusion Models #World Models #Video Understanding

2025년 12월 8일

[논문리뷰] RynnVLA-002: A Unified Vision-Language-Action and World Model

본 논문은 기존 VLA 모델(액션 다이내믹스 이해 부족, 상상력 및 물리 지식 결여)과 월드 모델(직접적인 액션 생성 불가)의 한계를 극복하기 위해, VLA 모델과 월드 모델을 단일 프레임워크로 통합 하는 것을 목표로 합니다.

#Review #Vision-Language-Action (VLA) Model #World Model #Robotics #Unified Framework #Multi-modal Learning #Action Generation #Attention Mask #Continuous Control

2025년 11월 23일

[논문리뷰] Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

본 연구는 Extreme Multi-label Classification (XMC)에서 Large Language Models (LLMs) 의 잠재력을 효과적으로 활용하고, 시각적 정보 를 효율적으로 통합하여 성능을 향상하는 것을 목표로 합니다.

#Review #Extreme Multi-label Classification (XMC)#Large Language Models (LLMs)#Multi-modal Learning #Dual-decoder Learning #Vision Transformers #Contrastive Learning #Prompt Engineering

2025년 11월 18일

[논문리뷰] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

이 논문은 이질적인 과학적 표현과 자연어를 통합하여 다양한 과학 분야에 걸친 복잡한 과학적 추론을 수행하는 최초의 과학 추론 대규모 언어 모델(LLM) 인 SciReasoner 를 제안합니다.

#Review #Scientific Reasoning #Foundation Models #Multi-modal Learning #Cross-domain Generalization #Chain-of-Thought #Reinforcement Learning #Scientific Discovery #Molecular Design

2025년 9월 26일

[논문리뷰] PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era

본 논문은 기존 핀홀(pinhole) 비전에 비해 연구가 뒤처진 옴니디렉셔널(omnidirectional) 비전의 잠재력을 발현하고, 데이터 병목 현상, 모델 역량 한계, 애플리케이션 공백과 같은 주요 문제를 해결하여 신체화된 AI(Embodied AI) 시대에 포괄적인 환경 인식을 달성하는 것을 목표로 합니다.

#Review #Omnidirectional Vision #Embodied AI #Panoramic Perception #Multi-modal Learning #Dataset Development #Robot Navigation #Spatial Reasoning #System Architecture

2025년 9월 18일

[논문리뷰] Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings

본 연구는 2D 벡터 엔지니어링 도면(SVG 형식)으로부터 파라메트릭 CAD 모델을 자동으로 생성 하는 문제를 해결하는 것을 목표로 합니다.

#Review #CAD Generation #Vector Graphics #Sequence-to-Sequence Learning #Transformer Architecture #Engineering Drawings #Multi-modal Learning #Soft Target Loss #Dual Decoder

2025년 9월 5일

[논문리뷰] Collaborative Multi-Modal Coding for High-Quality 3D Generation

본 논문은 기존 3D 생성 모델들이 단일 모달리티(예: RGB 이미지)에 의존하여 훈련 데이터의 범위가 제한되고 멀티모달 데이터의 상호 보완적 이점을 간과하는 문제를 해결하고자 합니다.

#Review #3D Generation #Multi-modal Learning #Diffusion Models #Triplane Representation #Collaborative Coding #Image-to-3D #Latent Space

2025년 8월 29일

[논문리뷰] A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding

논문은 기존 바운딩 박스 기반 시각 그라운딩의 한계를 극복하고, 자율주행 환경에서 자연어 설명을 기반으로 객체의 정확한 3D 점유(occupancy) 정보 를 파악하는 것을 목표로 합니다.

#Review #3D Occupancy Grounding #Multi-modal Learning #Natural Language Understanding #Autonomous Driving #Voxel-based Prediction #Benchmark Dataset #Coarse-to-Fine

2025년 8월 7일