#Safety Alignment

11개의 포스트

[논문리뷰] THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Minki Kang이 arXiv에 게시한 'THINKSAFE: Self-Generated Safety Alignment for Reasoning Models' 논문에 대한 자세한 리뷰입니다.

2026년 2월 2일

[논문리뷰] GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs

arXiv에 게시된 'GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs' 논문에 대한 자세한 리뷰입니다.

2025년 12월 31일

[논문리뷰] OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Simeng Qin이 arXiv에 게시한 'OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation' 논문에 대한 자세한 리뷰입니다.

2025년 12월 9일

[논문리뷰] Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

arXiv에 게시된 'Too Good to be Bad: On the Failure of LLMs to Role-Play Villains' 논문에 대한 자세한 리뷰입니다.

2025년 11월 10일

[논문리뷰] Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

arXiv에 게시된 'Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations' 논문에 대한 자세한 리뷰입니다.

2025년 10월 24일

[논문리뷰] The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

arXiv에 게시된 'The Alignment Waltz: Jointly Training Agents to Collaborate for Safety' 논문에 대한 자세한 리뷰입니다.

2025년 10월 10일

[논문리뷰] Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

arXiv에 게시된 'Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?' 논문에 대한 자세한 리뷰입니다.

2025년 10월 8일

[논문리뷰] Imperceptible Jailbreaking against Large Language Models

arXiv에 게시된 'Imperceptible Jailbreaking against Large Language Models' 논문에 대한 자세한 리뷰입니다.

2025년 10월 7일

[논문리뷰] Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Roy Ka-Wei Lee이 arXiv에 게시한 'Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD' 논문에 대한 자세한 리뷰입니다.

2025년 8월 29일

[논문리뷰] BiasGym: Fantastic Biases and How to Find (and Remove) Them

Arnav Arora이 arXiv에 게시한 'BiasGym: Fantastic Biases and How to Find (and Remove) Them' 논문에 대한 자세한 리뷰입니다.

2025년 8월 13일

[논문리뷰] Personalized Safety Alignment for Text-to-Image Diffusion Models

Kaidong Yu이 arXiv에 게시한 'Personalized Safety Alignment for Text-to-Image Diffusion Models' 논문에 대한 자세한 리뷰입니다.

2025년 8월 5일