#Adversarial Robustness

10개의 포스트

[논문리뷰] Empirical Evidence for Simply Connected Decision Regions in Image Classifiers

본 논문은 현대의 deep neural network가 학습한 결정 영역이 단순히 path connected할 뿐만 아니라, 더 강력한 위상적 성질인 simply connected를 만족하는지 규명하고자 한다.

#Review #Deep Neural Networks #Decision Regions #Topology #Simply Connected #Coons Patches #Adversarial Robustness

2026년 5월 10일

[논문리뷰] The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

저자들은 Qwen-3.5-9B를 기반으로 Sentinel-Bench라는 평가 프레임워크를 구축하여 System 1과 System 2 간의 성능을 정량적으로 비교 분석하였다. 동일한 파라미터 환경에서 reasoning toggle만을 조정하여 840번의 독립적인 추론을 수행하였다.

#Review #Small Language Models #Decentralized Autonomous Organizations #Inference-time Compute #System 1 vs System 2 #Sentinel-Bench #Adversarial Robustness #Cognitive Collapse

2026년 4월 21일

[논문리뷰] Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

이 논문은 대규모 언어 모델(LLMs)의 안전성 평가가 단일 시도(single-shot) 또는 저예산 공격에만 초점을 맞춰 실제 위협을 과소평가하는 문제를 해결하고자 합니다.

#Review #LLM Safety #Adversarial Robustness #Best-of-N Sampling #Statistical Estimation #Beta-Binomial Model #Jailbreak #Risk Amplification

2026년 2월 1일

[논문리뷰] On the Evidentiary Limits of Membership Inference for Copyright Auditing

본 논문은 LLM(Large Language Model) 학습 데이터의 저작권 감사에서 MIA(Membership Inference Attack) 가 신뢰할 수 있는 기술적 증거로 사용될 수 있는지 여부를 조사합니다.

#Review #Membership Inference Attacks #Copyright Auditing #Large Language Models #Adversarial Robustness #Paraphrasing #Sparse Autoencoders #Semantic Preservation #LLM Security

2026년 1월 20일

[논문리뷰] A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

본 논문은 GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, Seedream 4.5 등 7개 최신 AI 모델의 안전성을 종합적이고 다차원적으로 평가하는 것을 목표로 합니다.

#Review #AI Safety #Large Language Models #Multimodal LLMs #Benchmark Evaluation #Adversarial Robustness #Multilingual Evaluation #Regulatory Compliance #Image Generation Safety

2026년 1월 15일

[논문리뷰] COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

본 논문은 범용적인 유해성 평가를 넘어, LLM이 기업 및 조직 특유의 허용 목록(allowlist) 및 거부 목록(denylist) 정책 을 얼마나 잘 준수하는지 체계적으로 평가하기 위한 COMPASS 프레임워크를 제안합니다.

#Review #LLM Evaluation #Policy Alignment #Organizational Policies #AI Safety #Adversarial Robustness #Refusal Behavior #Prompt Engineering #Fine-tuning

2026년 1월 5일

[논문리뷰] Robust and Calibrated Detection of Authentic Multimedia Content

본 논문은 기존 딥페이크 탐지 방법론의 한계, 즉 생성 모델의 재합성 가능성(resynthesis indistinguishability) 으로 인한 높은 오탐율과 적대적 공격에 대한 취약성 을 극복하는 것을 목표로 합니다.

#Review #Deepfake Detection #Content Authenticity #Generative Models #Adversarial Robustness #Image Inversion #Plausible Deniability #Diffusion Models #Multimedia Forensics

2025년 12월 17일

[논문리뷰] Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Vision-Language Model (VLM)의 견고성과 성능 간의 상충 관계를 해결하고, 특히 함수어(function words) 가 교차-모달 적대적 공격에 대한 VLM의 취약성을 유발한다는 가설을 검증하고자 합니다.

#Review #Vision-Language Models #Adversarial Robustness #Function Words #Cross-Attention #Adversarial Attacks #Differential Attention #Vision-Language Alignment

2025년 12월 10일

[논문리뷰] LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context

본 연구는 중국어 환경에서 대규모 언어 모델(LLMs)의 안전성 평가를 위한 동적(dynamic) 이며 문화적으로 적합한(culturally-relevant) 벤치마크인 LiveSecBench 를 제안하는 것을 목표로 합니다.

#Review #LLM Safety #AI Safety Benchmark #Chinese Context #Dynamic Evaluation #Cultural Relevance #Adversarial Robustness #ELO Rating System

2025년 11월 9일

[논문리뷰] Soft Instruction De-escalation Defense

본 논문은 외부 환경과 상호작용하는 LLM 기반 에이전트 시스템 이 겪는 프롬프트 인젝션 공격에 대한 취약성을 해결하는 것을 목표로 합니다. 특히, 신뢰할 수 없는 데이터 내의 악의적인 명령을 효과적으로 무력화하면서도 에이전트의 유용성을 저해하지 않는 방어 메커니즘을 제안합니다.

#Review #Prompt Injection #LLM Security #Agentic Systems #Iterative Sanitization #Instruction Control #Adversarial Robustness #Large Language Models

2025년 10월 27일