[논문리뷰] MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language ModelsYiran Chen이 arXiv에 게시한 'MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal LLMs#Safety Evaluation#Red Teaming#Adversarial Attacks#Modality Switching#LLM Alignment#Compliance#ASR2026년 3월 4일댓글 수 로딩 중
[논문리뷰] Exposing the Systematic Vulnerability of Open-Weight Models to Prefill AttacksarXiv에 게시된 'Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks' 논문에 대한 자세한 리뷰입니다.#Review#Large Language Models#Prefill Attacks#AI Safety#Red Teaming#Vulnerability#Open-Weight Models#Jailbreaking#Generative AI2026년 2월 16일댓글 수 로딩 중
[논문리뷰] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak AttacksarXiv에 게시된 'SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks' 논문에 대한 자세한 리뷰입니다.#Review#Multi-Turn Jailbreaks#LLM Safety#Red Teaming#Reinforcement Learning#Intent Drift#Response-Agnostic Generation#Self-Tuning2026년 2월 8일댓글 수 로딩 중
[논문리뷰] TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety AlignmentarXiv에 게시된 'TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment' 논문에 대한 자세한 리뷰입니다.#Review#LLM Safety Alignment#Reinforcement Learning#Self-Play#Red Teaming#Adversarial Training#Multi-Role Framework#Reward Hacking Mitigation2026년 1월 27일댓글 수 로딩 중
[논문리뷰] Adaptive Attacks on Trusted Monitors Subvert AI Control ProtocolsMaksym Andriushchenko이 arXiv에 게시한 'Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols' 논문에 대한 자세한 리뷰입니다.#Review#AI Control Protocols#LLM Monitors#Adaptive Attacks#Prompt Injection#Jailbreaking#Red Teaming#Scalable Oversight2025년 10월 13일댓글 수 로딩 중
[논문리뷰] Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful PromptsLiming Fang이 arXiv에 게시한 'Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts' 논문에 대한 자세한 리뷰입니다.#Review#LLM Jailbreaking#Red Teaming#Malicious Content Detection#Developer Messages#D-Attack#DH-CoT#Adversarial Attacks#Dataset Cleaning2025년 8월 25일댓글 수 로딩 중