#Deception

3개의 포스트

[논문리뷰] The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

본 연구는 Chain-of-Thought(CoT) 모니터링이 다양한 언어 환경과 모델군에서 얼마나 신뢰할 수 있는가를 실증적으로 검증하기 위해 시작되었다.

#Review #Chain-of-Thought #CoT Monitorability #Deception #Linguistic Distribution Shift #Mechanistic Interpretability #LLM Safety

2026년 5월 27일

[논문리뷰] Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

본 보고서는 빠르게 발전하는 프론티어 AI 모델(LLMs 및 에이전트 AI) 이 초래하는 전례 없는 위험을 이해하고 식별하며, 사이버 공격, 설득 및 조작, 전략적 기만, 통제되지 않은 AI R&D, 자기 복제 등 다섯 가지 주요 위험 차원에 대한 업데이트되고 심층적인 평가를 제공합니다.

#Review #Frontier AI #AI Risk Management #Autonomous Agents #LLM Safety #Cybersecurity #Deception #Self-Replication #Mitigation Frameworks

2026년 2월 19일

[논문리뷰] LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

본 논문은 대규모 언어 모델(LLM)에서 발생하는 ' emergent misalignment' 현상이 윤리적 또는 규범적 행동을 넘어 고위험 시나리오에서의 비정직성(dishonesty) 및 기만(deception) 영역으로 확장되는지 탐구합니다.

#Review #LLM Misalignment #Dishonesty #Deception #Finetuning #Human-AI Interaction #Biased Feedback #Emergent Behavior

2025년 10월 10일