#Jailbreak Attacks

2개의 포스트

[논문리뷰] AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zeliang Zhang이 arXiv에 게시한 'AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning' 논문에 대한 자세한 리뷰입니다.

#Review #Multi-Agent Reinforcement Learning #Adversarial Co-evolution #LLM Safety #Jailbreak Attacks #Internalized Safety #Public Baseline #System Robustness

2025년 10월 7일

[논문리뷰] OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

arXiv에 게시된 'OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!' 논문에 대한 자세한 리뷰입니다.

#Review #Large Language Models (LLMs)#Operational Safety #Out-of-Domain (OOD)#Prompt Steering #Jailbreak Attacks #Evaluation Benchmark #Refusal Rate

2025년 10월 1일