#Hallucination

18개의 포스트

[논문리뷰] Online Self-Calibration Against Hallucination in Vision-Language Models

본 논문은 기존의 offline 선호도 정렬 방식이 LVLM의 hallucination 문제를 해결하는 데 오히려 역효과를 낼 수 있다는 Supervision-Perception Mismatch 문제를 제기한다.

#Review #Vision-Language Models #Hallucination #Monte Carlo Tree Search #Preference Alignment #DPO #Generative-Discriminative Gap #Online Learning

2026년 5월 3일

[논문리뷰] Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

본 논문은 LLM의 유창함 이면에 존재하는 사실적 부정확성 및 환각(Hallucination) 문제를 해결하기 위해 DAVinCI 프레임워크를 제안한다.

#Review #Attribution #Verification #Dual Framework #Hallucination #Confidence Calibration #Natural Language Inference

2026년 4월 23일

[논문리뷰] Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

저자들은 다양한 규모의 MRM 및 MLM 백본을 대상으로 CoT와 Non-CoT 프롬프트를 비교 평가하는 방법론을 수행하였습니다. 실험 결과, 17개 중 대다수의 모델에서 CoT 프롬프트를 사용했을 때 시각적 공간 추론 정확도가 평균적으로 하락하는 경향이 관찰되었습니다 .

#Review #Multimodal Reasoning Models #Chain-of-Thought #Visual Spatial Reasoning #Shortcut Learning #Hallucination #No-Image Ablation

2026년 4월 21일

[논문리뷰] TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas

기존 Text-to-SQL 파싱 방법론들은 Full Schema Assumption 하에서 Large Language Models (LLMs) 의 발전과 함께 remarkable progress를 이루었습니다.

#Review #Text-to-SQL #Unknown Schema #Multi-Turn Reinforcement Learning #Tool Integration #POMDP #Dual-Track GRPO #Schema Grounding #Hallucination

2026년 3월 17일

[논문리뷰] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

본 논문은 복잡한 논리적 분해가 필요 없는 단순한 단일 홉 사실 질문에서 LLM의 추론이 어떻게 파라메트릭 지식 회상에 영향을 미치는지 밝히는 것을 목표로 합니다. 추론이 직관과 달리 모델의 지식 경계를 확장하는 메커니즘을 이해하고, 이를 통해 모델 정확도를 개선할 수 있는 실용적인 전략을 제시하고자 합니다.

#Review #LLMs #Reasoning #Parametric Knowledge #Factual Recall #Hallucination #Computational Buffer #Factual Priming #Chain-of-Thought

2026년 3월 10일

[논문리뷰] CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

기존 LLM 에이전트 벤치마크가 이상적인 설정에서의 태스크 완료에만 초점을 맞추고 실제 환경에서의 신뢰성, 일관성, 한계 인식 을 간과하는 문제를 해결하고자 합니다.

#Review #LLM Agents #Benchmarks #Tool-use #Consistency #Uncertainty Handling #Hallucination #In-car Assistant #Policy Adherence

2026년 2월 5일

[논문리뷰] CaptionQA: Is Your Caption as Useful as the Image Itself?

본 논문은 기존 MLLM 평가 방식이 캡션의 실제 활용성, 즉 다운스트림 태스크에서 이미지를 대체할 수 있는 능력 을 간과한다고 지적합니다.

#Review #Image Captioning #Caption Evaluation #Multimodal LLM #Utility-based Benchmark #Question Answering (QA)#Domain-specific Taxonomy #Hallucination #MLLM Evaluation

2025년 11월 30일

[논문리뷰] MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

본 연구는 Large Reasoning Models (LRMs)에서 발생하는 '추론-답변 불일치(reasoning-answer hit gap)' 문제를 해결하는 것을 목표로 합니다. 이는 모델이 추론 과정에서 올바른 사실을 식별함에도 불구하고 최종 답변에 이를 통합하지 못하여 사실적 정확도가 저하되는 현상을 말합니다.

#Review #Large Reasoning Models #Factuality Alignment #Meta-Reasoning #Kahneman-Tversky Optimization #Chain-of-Thought #Hallucination #Process-Level Alignment

2025년 11월 9일

[논문리뷰] FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

본 논문은 최신 대규모 추론 모델(LRMs) 을 자동으로 검증 가능한 텍스트 및 시각 질문 에 대해 오염 없는(contamination-free) 방식으로 평가하는 예비 보고서입니다.

#Review #Large Reasoning Models #LLM Evaluation #Multimodal AI #Reasoning Behaviors #Hallucination #Contamination-Free #AI Safety #Instruction Following

2025년 9월 23일

[논문리뷰] Improving Context Fidelity via Native Retrieval-Augmented Reasoning

논문은 대규모 언어 모델(LLMs)이 제공된 컨텍스트에 대한 충실도(context fidelity)를 유지하지 못하고, 질문에 대한 답변 생성 시 일관성 없는 결과를 내거나 환각(hallucination)을 일으키는 문제를 해결하고자 합니다.

#Review #Context Fidelity #Retrieval-Augmented Generation (RAG)#Large Language Models (LLMs)#Reinforcement Learning (RL)#Supervised Fine-Tuning (SFT)#Hallucination #Question Answering #In-context Retrieval #Curriculum Learning

2025년 9월 18일

[논문리뷰] Measuring Epistemic Humility in Multimodal Large Language Models

본 논문은 멀티모달 대규모 언어 모델(MLLM)의 환각(hallucination) 문제를 해결하고, 특히 모델이 불확실한 상황에서 잘못된 정보를 확신하지 않고 '모르는 것을 모른다고 인정하는' 능력 , 즉 인식론적 겸손(epistemic humility) 을 측정하는 새로운 벤치마크를 제시하는 것을 목표로 합니다.

#Review #Multimodal Large Language Models #Hallucination #Epistemic Humility #Benchmark #False-Option Rejection #Visual Question Answering #Scene Graph

2025년 9월 16일

[논문리뷰] Why Language Models Hallucinate

본 논문은 대규모 언어 모델(LLM)이 '환각' 현상, 즉 그럴듯하지만 틀린 정보를 자신감 있게 생성하는 이유를 통계적으로 분석하고, 이러한 문제가 최신 모델에서도 지속되는 근본적인 원인을 밝히는 것을 목표로 합니다.

#Review #Language Models #Hallucination #Pretraining #Post-training #Evaluation Metrics #Binary Classification #Uncertainty Quantification #Calibration

2025년 9월 8일

[논문리뷰] ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Video MLLM(Multimodal Large Language Models)이 긴 비디오에서 보이는 Semantic Aggregation Hallucination (SAH) 문제를 해결하는 데 목표를 둡니다.

#Review #Long Video Understanding #Hallucination #Semantic Aggregation #Video MLLM #Benchmark #DPO #Positional Encoding #VideoQA

2025년 9월 3일

[논문리뷰] ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

본 논문은 대규모 언어 모델(LLM) 기반의 심층 연구(Deep Research) 에이전트가 생성하는 연구 보고서의 내용 품질을 체계적으로 평가하기 위한 벤치마크인 ReportBench 를 제안합니다.

#Review #Deep Research Agents #LLM Evaluation #Academic Survey #Factual Accuracy #Citation Verification #Report Generation #Benchmark #Hallucination

2025년 8월 27일

[논문리뷰] SpotEdit: Evaluating Visually-Guided Image Editing Methods

이 논문은 기존 벤치마크의 단순성과 실제 편집 과제에 대한 낮은 대표성이라는 한계를 극복하기 위해, 시각적으로 안내되는 이미지 편집(Visually-Guided Image Editing) 모델을 체계적이고 세밀하게 평가하기 위한 포괄적인 벤치마크인 SpotEdit 을 소개합니다.

#Review #Visually-Guided Image Editing #Multimodal Models #Benchmark #Hallucination #Diffusion Models #Autoregressive Models #Evaluation Metrics

2025년 8월 26일

[논문리뷰] Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

본 논문은 기존 수학 벤치마크가 잘 정의된 문제 해결 능력에만 초점을 맞추는 한계를 지적하며, Large Reasoning Models (LRMs) 이 정보가 불충분한 문제에 직면했을 때 능동적으로 정보를 요청하는 능력 을 평가하는 것을 목표로 합니다.

#Review #Large Reasoning Models (LRMs)#Information Seeking #Incomplete Problems #Mathematical Reasoning #Supervised Fine-tuning (SFT)#Overthinking #Hallucination #CRITIC-math

2025년 8월 19일

[논문리뷰] CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

본 논문은 웨어러블 AI 시나리오를 위한 Multi-Modal Retrieval-Augmented Generation (MM-RAG) 시스템의 포괄적인 평가를 위한 벤치마크가 부족하다는 문제를 해결합니다.

#Review #Multi-modal RAG #Benchmark #Wearable AI #Multi-turn Conversation #Egocentric Images #Knowledge Graph #Web Search #Hallucination

2025년 10월 31일

[논문리뷰] FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain

본 논문은 금융 도메인에서 대규모 언어 모델(LLM)의 신뢰성을 종합적으로 평가하기 위한 FINTRUST 벤치마크를 제시합니다.

#Review #LLM Trustworthiness #Finance Domain #Benchmark #Alignment Evaluation #Financial AI #Hallucination #Privacy #Fairness

2025년 10월 20일