#DPO

7개의 포스트

[논문리뷰] Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

본 연구는 DPO와 RLHF 간의 이론적 동치성이 모든 경우에 성립하는 것이 아니라, 특정 가정에 의존하는 조건부 동치성임을 밝힙니다.

#Review #DPO #RLHF #Constrained Preference Optimization #Bradley-Terry Model #Alignment #Soft Margin Ranking #Absolute Advantage

2026년 5월 20일

[논문리뷰] Online Self-Calibration Against Hallucination in Vision-Language Models

본 논문은 기존의 offline 선호도 정렬 방식이 LVLM의 hallucination 문제를 해결하는 데 오히려 역효과를 낼 수 있다는 Supervision-Perception Mismatch 문제를 제기한다.

#Review #Vision-Language Models #Hallucination #Monte Carlo Tree Search #Preference Alignment #DPO #Generative-Discriminative Gap #Online Learning

2026년 5월 3일

[논문리뷰] OmniGAIA: Towards Native Omni-Modal AI Agents

본 연구는 현재 바이모달 상호작용에 국한된 멀티모달 LLM의 한계를 넘어, 인간의 지능처럼 영상, 오디오, 이미지 모달리티 전반에 걸쳐 통합적으로 인지하고 추론하며 외부 도구를 사용하는 네이티브 옴니모달 AI 에이전트 를 개발하고 평가하는 것을 목표로 합니다.

#Review #Omni-modal AI #Multi-modal Agents #Tool-Integrated Reasoning #Benchmark #Event Graph #Active Perception #Trajectory Synthesis #DPO

2026년 2월 26일

[논문리뷰] TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution

본 연구는 GUI 자동화의 핵심 과제인 GUI 플래닝의 확장성 문제를 해결하는 것을 목표로 합니다. 기존 방식의 스텝 중복과 낮은 궤적 다양성, 그리고 인간 주석 의존성으로 인한 데이터 부족 문제를 극복하고, 고품질의 대규모 GUI 궤적 데이터를 효율적으로 합성하는 방법론을 제시합니다.

#Review #GUI Automation #Computer-Use Agents #Trajectory Synthesis #Tree-Structured Exploration #Multi-Agent Framework #Reinforcement Learning #DPO #Data Efficiency

2026년 2월 10일

[논문리뷰] ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Video MLLM(Multimodal Large Language Models)이 긴 비디오에서 보이는 Semantic Aggregation Hallucination (SAH) 문제를 해결하는 데 목표를 둡니다.

#Review #Long Video Understanding #Hallucination #Semantic Aggregation #Video MLLM #Benchmark #DPO #Positional Encoding #VideoQA

2025년 9월 3일

[논문리뷰] Personalized Safety Alignment for Text-to-Image Diffusion Models

현재 텍스트-투-이미지(T2I) 확산 모델의 안전 메커니즘이 사용자의 다양한 연령, 정신 건강, 개인 신념 등의 선호도를 고려하지 않고 일률적인 기준을 적용하여 발생하는 한계를 해결하고자 합니다.

#Review #Personalized Safety Alignment #Text-to-Image Diffusion Models #DPO #User Preferences #Content Moderation #Generative AI #Cross-Attention #Safety Alignment

2025년 8월 5일

[논문리뷰] DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

대규모 언어 모델(LLM) 배포 환경에서 희소한 명시적 만족(SAT) 피드백 대신, 풍부하게 발생하는 암묵적인 사용자 불만족(DSAT) 신호를 효과적으로 활용하여 모델 성능을 개선하는 확장 가능하고 효율적인 선호 학습 방법론을 개발하는 것이 목표입니다.

#Review #Preference Learning #LLMs #User Feedback #Dissatisfaction Signals #DPO #Iterative Training #RLHF #Exploration

2025년 10월 8일