[논문리뷰] THINKSAFE: Self-Generated Safety Alignment for Reasoning ModelsMinki Kang이 arXiv에 게시한 'THINKSAFE: Self-Generated Safety Alignment for Reasoning Models' 논문에 대한 자세한 리뷰입니다.#Review#Large Reasoning Models#Safety Alignment#Self-Distillation#Refusal Steering#Distributional Shift#Chain-of-Thought#Reinforcement Learning2026년 2월 1일댓글 수 로딩 중
[논문리뷰] GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMsarXiv에 게시된 'GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs' 논문에 대한 자세한 리뷰입니다.#Review#MoE LLM#Safety Alignment#Adversarial Attack#Neuron Pruning#Gate-level Profiling#Transfer Attack#Vision Language Model2025년 12월 30일댓글 수 로딩 중
[논문리뷰] OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense EvaluationSimeng Qin이 arXiv에 게시한 'OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal LLMs#Jailbreak Attack#Attack-Defense Evaluation#Benchmark#Safety Alignment#Vulnerability Analysis#Risk Taxonomy#Evaluation Metrics2025년 12월 8일댓글 수 로딩 중
[논문리뷰] Too Good to be Bad: On the Failure of LLMs to Role-Play VillainsarXiv에 게시된 'Too Good to be Bad: On the Failure of LLMs to Role-Play Villains' 논문에 대한 자세한 리뷰입니다.#Review#LLM#Role-playing#Safety Alignment#Villain#Persona Simulation#Moral Alignment#Benchmark#Character Fidelity2025년 11월 9일댓글 수 로딩 중
[논문리뷰] Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional VariationsarXiv에 게시된 'Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations' 논문에 대한 자세한 리뷰입니다.#Review#LALM Safety#Speaker Emotion#Safety Alignment#Jailbreaking#Audio-Language Models#Emotional Variation#Unsafe Rate#Non-refusal Rate2025년 10월 24일댓글 수 로딩 중
[논문리뷰] The Alignment Waltz: Jointly Training Agents to Collaborate for SafetyarXiv에 게시된 'The Alignment Waltz: Jointly Training Agents to Collaborate for Safety' 논문에 대한 자세한 리뷰입니다.#Review#LLM Safety#Multi-agent Reinforcement Learning#Safety Alignment#Overrefusal#Adversarial Attacks#Feedback Agent#Conversation Agent#Dynamic Improvement Reward2025년 10월 10일댓글 수 로딩 중
[논문리뷰] Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?arXiv에 게시된 'Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?' 논문에 대한 자세한 리뷰입니다.#Review#Safety Alignment#Large Reasoning Models#Mechanistic Interpretability#Refusal Cliff#Attention Heads#Data Selection#Linear Probing2025년 10월 8일댓글 수 로딩 중
[논문리뷰] Imperceptible Jailbreaking against Large Language ModelsarXiv에 게시된 'Imperceptible Jailbreaking against Large Language Models' 논문에 대한 자세한 리뷰입니다.#Review#Large Language Models#Jailbreaking#Imperceptible Attacks#Unicode Variation Selectors#Adversarial Suffixes#Safety Alignment#Prompt Injection2025년 10월 7일댓글 수 로딩 중
[논문리뷰] Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PDRoy Ka-Wei Lee이 arXiv에 게시한 'Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD' 논문에 대한 자세한 리뷰입니다.#Review#Persuasion Dynamics#Large Language Models (LLMs)#Robustness#Gullibility#Receptiveness#Direct Preference Optimization (DPO)#Safety Alignment#Multi-turn Dialogue2025년 8월 29일댓글 수 로딩 중
[논문리뷰] BiasGym: Fantastic Biases and How to Find (and Remove) ThemArnav Arora이 arXiv에 게시한 'BiasGym: Fantastic Biases and How to Find (and Remove) Them' 논문에 대한 자세한 리뷰입니다.#Review#Bias Mitigation#LLMs#Mechanistic Interpretability#Fine-tuning#Attention Steering#Stereotype Analysis#Safety Alignment2025년 8월 13일댓글 수 로딩 중
[논문리뷰] Personalized Safety Alignment for Text-to-Image Diffusion ModelsKaidong Yu이 arXiv에 게시한 'Personalized Safety Alignment for Text-to-Image Diffusion Models' 논문에 대한 자세한 리뷰입니다.#Review#Personalized Safety Alignment#Text-to-Image Diffusion Models#DPO#User Preferences#Content Moderation#Generative AI#Cross-Attention#Safety Alignment2025년 8월 5일댓글 수 로딩 중