#Jailbreaking

8개의 포스트

[논문리뷰] Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

본 논문은 오픈-웨이트 대규모 언어 모델(LLM)이 프리필(prefill) 공격 에 체계적으로 취약하다는 점을 폭로하는 것을 목표로 합니다.

#Review #Large Language Models #Prefill Attacks #AI Safety #Red Teaming #Vulnerability #Open-Weight Models #Jailbreaking #Generative AI

2026년 2월 16일

[논문리뷰] FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

금융 에이전트(LLM 기반)가 투자 분석, 위험 평가, 자동화된 의사결정 등 고위험 및 고규제 환경에서 새로운 보안 위험을 초래하는 문제를 해결하고자 합니다.

#Review #Financial AI Agents #Security Benchmark #Execution-Grounded #LLM Safety #Prompt Injection #Jailbreaking #Compliance #Vulnerability Assessment

2026년 1월 21일

[논문리뷰] Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

본 논문은 RLHF(Reinforcement Learning from Human Feedback), 시스템 프롬프트, 입력/출력 콘텐츠 필터 등 다양한 방어 메커니즘이 적용된 Vision-Language Models (VLMs) 의 안전성 취약점 을 체계적으로 드러내는 것을 목표로 합니다.

#Review #Vision-Language Models (VLMs)#Adversarial Attack #Jailbreaking #Reward Hacking #Content Moderation Bypass #Cross-Model Transferability #Safety Vulnerabilities

2025년 11월 23일

[논문리뷰] Jailbreaking in the Haystack

본 연구는 장문(long-context) 언어 모델(LMs)의 확장된 컨텍스트 창이 가지는 안전성 함의를 분석하고, 심지어 양성(benign) 컨텍스트 내에서도 안전 기능이 어떻게 저하되는지 탐구하는 것을 목표로 합니다.

#Review #Jailbreaking #LLM Safety #Long-Context Models #Positional Bias #Attack Success Rate (ASR)#Prompt Engineering #Compute Efficiency #AI Agents

2025년 11월 9일

[논문리뷰] Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

본 연구는 신뢰할 수 없는 LLM 에이전트가 안전 메커니즘을 우회하여 AI 제어 프로토콜을 전복시키는 문제를 다룹니다. 특히, 공격자 모델이 프로토콜과 모니터 모델에 대한 지식을 가진 적응형 공격(adaptive attacks) 에 초점을 맞춰, LLM 모니터를 핵심 실패 지점으로 악용하는 새로운 공격 벡터를 제시합니다.

#Review #AI Control Protocols #LLM Monitors #Adaptive Attacks #Prompt Injection #Jailbreaking #Red Teaming #Scalable Oversight

2025년 10월 13일

[논문리뷰] Imperceptible Jailbreaking against Large Language Models

본 논문은 기존의 가시적인 텍스트 수정 방식과 달리 눈에 보이지 않는(imperceptible) 방식으로 LLM의 안전 장치를 우회하는 새로운 제일브레이크 공격 기법을 제안합니다.

#Review #Large Language Models #Jailbreaking #Imperceptible Attacks #Unicode Variation Selectors #Adversarial Suffixes #Safety Alignment #Prompt Injection

2025년 10월 7일

[논문리뷰] Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

본 논문은 대규모 오디오-언어 모델(LALMs)의 안전성 취약성을 탐구하며, 특히 화자의 감정 변화 가 모델의 안전성 정렬에 미치는 영향을 체계적으로 조사하는 것을 목표로 합니다.

#Review #LALM Safety #Speaker Emotion #Safety Alignment #Jailbreaking #Audio-Language Models #Emotional Variation #Unsafe Rate #Non-refusal Rate

2025년 10월 24일

[논문리뷰] Agentic Reinforcement Learning for Search is Unsafe

본 논문은 에이전트형 강화 학습(RL)으로 훈련된 검색 모델의 안전성, 특히 유해한 요청에 대한 거부 능력과 기존 지시 튜닝(Instruction Tuning)으로부터 물려받은 안전성 속성이 어떻게 변화하는지 평가하는 것을 목표로 합니다.

#Review #Agentic Reinforcement Learning #LLM Safety #Tool Use #Search Models #Jailbreaking #Instruction Tuning #Vulnerability

2025년 10월 21일