#Image Captioning

5개의 포스트

[논문리뷰] GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection

본 논문은 이미지-텍스트 쌍에서 풍자(sarcasm)를 효과적으로 탐지하기 위해 기존 방법론의 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal Sarcasm Detection #Large Language Models #Multimodal LLMs #Discrepancy Modeling #Image Captioning #Gated Fusion #Semantic Incongruity

2026년 1월 28일

[논문리뷰] CaptionQA: Is Your Caption as Useful as the Image Itself?

본 논문은 기존 MLLM 평가 방식이 캡션의 실제 활용성, 즉 다운스트림 태스크에서 이미지를 대체할 수 있는 능력 을 간과한다고 지적합니다.

#Review #Image Captioning #Caption Evaluation #Multimodal LLM #Utility-based Benchmark #Question Answering (QA)#Domain-specific Taxonomy #Hallucination #MLLM Evaluation

2025년 11월 30일

[논문리뷰] CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

본 연구는 기존 SFT(Supervised Fine-Tuning) 기반 이미지 캡셔닝 모델의 한계(고비용 데이터, 제한된 일반화 및 다양성)를 극복하고자 합니다.

#Review #Image Captioning #Reinforcement Learning #Verifiable Rewards #LVLMs #VQA #Data Curation #Caption Quality

2025년 9월 29일

[논문리뷰] Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

기존 MLLM이 시각 작업을 위해 텍스트로 좌표를 생성하는 등 간접적인 표현 방식 에 의존하여 성능이 제한되고 분할(Segmentation)과 같은 밀집 예측(Dense Prediction) 작업 이 어려웠던 문제를 해결하는 것입니다.

#Review #Multimodal Large Language Models (MLLMs)#Visual Reference Tokens (VRTs)#Dense Prediction #Referring Expression Comprehension (REC)#Open-Vocabulary Detection (OVD)#Image Captioning #Unified Architecture #Autoregressive Generation

2025년 10월 9일

[논문리뷰] From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

이 논문은 비전-언어 확산 모델에서 발생하는 train-inference 불일치 로 인한 오류 연쇄(error cascade) 문제를 해결하는 것을 목표로 합니다. 특히 병렬 디코딩 시 초기 토큰 오류가 전체 생성 컨텍스트를 오염시켜 구문 오류 및 의미론적 환각 을 유발하는 문제를 극복하고자 합니다.

#Review #Discrete Diffusion Models #Vision-Language Models #Error Cascades #Self-Correction #Refinement Framework #Parallel Generation #Image Captioning #Hallucination Mitigation

2025년 10월 27일