#Adversarial Attacks

9개의 포스트

[논문리뷰] MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

본 연구는 기존의 텍스트 중심 안전성 평가와 레드팀 활동의 한계를 극복하고, 멀티모달 LLM의 정렬(alignment)이 오디오, 이미지, 비디오 입력에 대해 일반화되는지 체계적으로 테스트하기 위한 통합 플랫폼 을 제공하는 것을 목표로 합니다. 특히, 모달리티 전환이 다중 턴 공격에 미치는 영향을 규명하고자 합니다.

#Review #Multimodal LLMs #Safety Evaluation #Red Teaming #Adversarial Attacks #Modality Switching #LLM Alignment #Compliance #ASR

2026년 3월 4일

[논문리뷰] Visual Memory Injection Attacks for Multi-Turn Conversations

본 논문은 대규모 시각-언어 모델(LVLM)의 다중 턴 대화 환경에서의 보안 취약점을 해결하고자 합니다.

#Review #LVLM #Adversarial Attacks #Multi-Turn Conversations #Visual Memory Injection #Stealthy Attacks #Benign Anchoring #Context-Cycling

2026년 2월 18일

[논문리뷰] Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

본 논문은 Vision-Language Model (VLM)의 autoregressive 생성 과정에서 모든 토큰이 모델 불안정성에 동일하게 기여한다는 기존 가정에 도전합니다.

#Review #Vision-Language Models #Adversarial Attacks #Entropy-Guided Attacks #Token Vulnerability #Harmful Content #Cross-Model Transferability #Autoregressive Generation

2026년 1월 8일

[논문리뷰] M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

본 논문은 텍스트-투-이미지 확산 모델의 개념 삭제(concept erasure) 방법들이 텍스트 프롬프트 외의 다른 입력 양식(모달리티)에 대해 얼마나 취약한지 평가하고, 이러한 취약점을 개선할 수 있는 새로운 추론 시간 방어 메커니즘을 제안하는 것을 목표로 합니다.

#Review #Diffusion Models #Concept Erasure #Multimodal Evaluation #Adversarial Attacks #Robustness #Textual Inversion #Latent Inversion #Cross-Attention

2026년 1월 5일

[논문리뷰] Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Vision-Language Model (VLM)의 견고성과 성능 간의 상충 관계를 해결하고, 특히 함수어(function words) 가 교차-모달 적대적 공격에 대한 VLM의 취약성을 유발한다는 가설을 검증하고자 합니다.

#Review #Vision-Language Models #Adversarial Robustness #Function Words #Cross-Attention #Adversarial Attacks #Differential Attention #Vision-Language Alignment

2025년 12월 10일

[논문리뷰] Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

이 논문은 Large Language Models (LLMs) 의 후처리 훈련에 사용되는 분산형 Group Relative Policy Optimization (GRPO) 시스템의 보안 취약점을 탐구합니다.

#Review #Decentralized RL #GRPO #LLM Post-training #Adversarial Attacks #Data Poisoning #Defense Mechanisms #In-context Attack #Out-of-context Attack

2025년 11월 13일

[논문리뷰] Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

본 논문은 상업용 블랙박스 LLM에 대한 효과적인 탈옥(jailbreak) 공격 방법론을 개발하고, 기존 레드팀 데이터셋의 부적절한 프롬프트(Benign, Non-obvious Harmful, Non-Triggering harmful-response) 문제를 해결하여 LLM 평가의 정확성을 높이는 것을 목표로 합니다.

#Review #LLM Jailbreaking #Red Teaming #Malicious Content Detection #Developer Messages #D-Attack #DH-CoT #Adversarial Attacks #Dataset Cleaning

2025년 8월 25일

[논문리뷰] The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

대규모 언어 모델(LLM)이 유용하면서도 안전하게 작동하는 것 사이의 근본적인 긴장을 해소하는 것을 목표로 합니다. 특히, 적대적 공격에 취약하여 위험한 콘텐츠를 생성하거나, 양성이지만 민감한 프롬프트에 대해 과도하게 거절(overrefusal)하는 문제를 해결하고자 합니다.

#Review #LLM Safety #Multi-agent Reinforcement Learning #Safety Alignment #Overrefusal #Adversarial Attacks #Feedback Agent #Conversation Agent #Dynamic Improvement Reward

2025년 10월 10일

[논문리뷰] WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

이 논문은 웹 에이전트를 대상으로 하는 프롬프트 인젝션 공격에 대한 탐지 방법들을 체계적으로 벤치마킹하여, 웹 에이전트 환경에서의 탐지 성능을 종합적으로 평가하고 이해하는 것을 목표로 합니다.

#Review #Prompt Injection #Web Agents #Multimodal AI #Adversarial Attacks #Detection Benchmarking #Large Language Models #Image-based Detection #Text-based Detection

2025년 10월 6일