#Cross-Modal Attention

4개의 포스트

[논문리뷰] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

의료 영상 이해(semantic abstraction)와 생성(pixel-level reconstruction)이라는 근본적으로 상충하는 목표를 기존 파라미터 공유 방식의 단일 모델에서 통합할 때 발생하는 성능 저하 문제를 해결하고자 합니다.

#Review #Chest X-Ray #Medical Foundation Model #Autoregressive Model #Diffusion Model #Multimodal Learning #Image Understanding #Image Generation #Cross-Modal Attention

2026년 1월 20일

[논문리뷰] LTX-2: Efficient Joint Audio-Visual Foundation Model

기존 텍스트-투-비디오(T2V) 모델이 오디오 정보 없이 '침묵하는' 영상을 생성하는 한계를 해결하고자 합니다. 이 연구는 고품질의 시간적으로 동기화된 오디오-비주얼 콘텐츠를 텍스트 프롬프트로부터 생성하는 오픈 소스 통합 파운데이션 모델(T2AV) 인 LTX-2 를 개발하는 것을 목표로 합니다.

#Review #Multimodal AI #Text-to-Audio-Video #Diffusion Transformer #Cross-Modal Attention #Classifier-Free Guidance #Efficient Inference #Foundation Model

2026년 1월 6일

[논문리뷰] Architecture Decoupling Is Not All You Need For Unified Multimodal Model

본 논문은 통합 멀티모달 모델(UMM)에서 시각 생성 및 이해 태스크 간의 내재된 충돌을 완화하면서도 모델 아키텍처 디커플링에 과도하게 의존하지 않고 성능을 향상시키는 것을 목표로 합니다. 과도한 디커플링이 통합 모델의 상호작용적 추론 능력과 지식 전이 능력을 저해하는 문제를 해결하고자 합니다.

#Review #Unified Multimodal Models #Architecture Decoupling #Cross-Modal Attention #Attention Interaction Alignment (AIA) Loss #Task Conflicts #Image Generation #Image Understanding

2025년 11월 30일

[논문리뷰] D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

온라인 밈(meme)에서 암묵적이고 문화적으로 민감한 다크 유머를 이해하고 탐지하는 문제를 해결하는 것을 목표로 합니다. 기존 자원 및 방법론의 부족을 다루기 위해 다중모드 콘텐츠에서 다크 유머의 존재, 타겟 범주 및 강도를 식별하는 포괄적인 프레임워크를 제시합니다.

#Review #Dark Humor Detection #Multimodal Reasoning #Vision-Language Models (VLMs)#Iterative Reasoning Refinement #Meme Analysis #Content Moderation #Cross-Modal Attention #Dataset Annotation

2025년 9월 9일