#Formal Verification

13개의 포스트

[논문리뷰] FVSpec: Real-World Property-Based Tests as Lean Challenges

본 논문은 AI 모델 및 에이전트의 실제 소프트웨어 형식 검증(formal software verification) 능력을 평가하기 위한 벤치마크 부족 문제를 해결하고자 합니다.

#Review #Formal Verification #Property-Based Testing #Lean 4 #LLM Pipeline #Benchmark #AI Safety #Structural Faithfulness

2026년 6월 14일

[논문리뷰] MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

본 논문은 대규모 언어 모델이 수학적 증명 문제에서 겪는 Hallucination과 Logical Inconsistency 문제를 해결하는 것을 핵심 목표로 합니다.

#Review #Mathematical Reasoning #Reinforcement Learning #Test-Time Scaling #Generative-Verifier #Formal Verification #Scalable Alignment

2026년 6월 11일

[논문리뷰] Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

본 논문은 AI 코딩 에이전트가 생성한 코드의 정확성을 보장하기 위해 형식적 검증을 도입하려 할 때, 정작 그 코드의 기준이 되는 '형식적 명세' 자체의 오류가 발생하는 문제를 해결하고자 한다.

#Review #Formal Verification #Specification Autoformalization #Agentic Environment #Verus #Codeforces #Executable Specifications

2026년 5월 27일

[논문리뷰] Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

본 논문은 LLM이 생성한 Lean 4 증명이 정답은 맞추지만, 지나치게 장황하고 특정 버전의 라이브러리에 취약하다는 점을 해결하고자 합니다 .

#Review #Lean 4 #Proof Optimization #Agentic Framework #Retrieval-Augmented Generation #Multi-Objective Optimization #Formal Verification

2026년 5월 21일

[논문리뷰] A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

현재 Large Language Models (LLM)의 추상적 추론 능력 평가는 진정한 추론 요구와 벤치마크 확장성 사이의 근본적인 trade-off에 직면해 있다.

#Review #Abstract Reasoning #LLM Evaluation #Cycle Consistency #Benchmark Generation #Formal Verification #Task Expansion #Cognitive Analysis

2026년 5월 18일

[논문리뷰] s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

본 논문은 LLM이 산업용 cryptographic assembly 코드를 검증할 수 있는지를 평가하기 위해, AWS의 s2n-bignum 라이브러리에서 추출한 형식 명세와 HOL Light 증명 스크립트 생성 과제를 묶은 실용 벤치마크를 제안합니다.

#Review #Formal Verification #Theorem Proving #HOL Light #LLM for Code #Cryptographic Assembly #Neurosymbolic AI

2026년 3월 22일

[논문리뷰] Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification

본 논문은 대규모 언어 모델(LLMs)의 확률적 토큰 예측 과정에서 발생하는 논리적 불일치와 보상 해킹 문제를 해결하고, 이를 통해 자연어 추론의 신뢰성과 정확성을 향상시키는 것을 목표로 합니다.

#Review #LLM Reasoning #Formal Verification #Neuro-Symbolic AI #Reinforcement Learning #Supervised Fine-tuning #Logic Consistency #Mathematical Reasoning

2026년 2월 1일

[논문리뷰] GenCtrl -- A Formal Controllability Toolkit for Generative Models

본 연구는 생성 모델의 제어 가능성(controllability)이 암묵적으로 가정되는 현 상황을 비판하며, 모델이 실제로 얼마나 제어 가능한지에 대한 이론적 프레임워크 를 제공하는 것을 목표로 합니다.

#Review #Generative Models #Controllability #Reachability #Control Theory #Dialogue Systems #LLMs #T2IMs #PAC Bounds #Formal Verification

2026년 1월 11일

[논문리뷰] miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

본 연구는 AI 시스템이 수학 올림피아드 문제에 참여하는 시나리오에서 miniF2F 벤치마크 의 비공식 및 공식 진술 간의 불일치와 오류를 분석하고 해결하는 것을 목표로 합니다.

#Review #Automated Theorem Proving #Autoformalization #Benchmark Dataset #miniF2F #Lean Language #Large Language Models #Mathematical Reasoning #Formal Verification

2025년 11월 16일

[논문리뷰] OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows

본 연구는 복잡한 모바일 GUI 환경에서 자율 에이전트의 안전 문제 , 특히 시스템 침해 및 개인 정보 유출과 같은 예상치 못한 위험을 효과적으로 탐지하는 문제를 해결하고자 합니다. 기존의 안전 탐지 인프라와 전략이 미흡한 점을 개선하여, 모바일 에이전트 안전 연구의 체계적인 기반을 마련하는 것이 목표입니다.

#Review #Mobile GUI Agents #Agent Safety #Hybrid Detection #Formal Verification #VLM-based Contextual Judgment #Safety Benchmark #Risk Detection

2025년 11월 9일

[논문리뷰] Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

본 논문은 형식 증명 자동화(Automated Theorem Proving, ATP) 분야에서 기존의 대규모 모델 및 연산량 의존성을 극복하고, 더 적은 자원으로도 최첨단 성능을 달성하는 새로운 오픈소스 언어 모델 시리즈인 Goedel-Prover-V2 를 개발하는 것을 목표로 합니다.

#Review #Automated Theorem Proving #Formal Verification #Language Models #Self-Correction #Data Synthesis #Reinforcement Learning #Model Averaging #Lean

2025년 8월 6일

[논문리뷰] Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

본 논문은 대규모 언어 모델(LLM)이 자연어 기반 정리 증명에서 명확한 감독 신호 부족으로 겪는 어려움을 해결하고자 합니다.

#Review #Automated Theorem Proving #Large Language Models #Formal Verification #Reinforcement Learning #Lean #Geometry Reasoning #Chain-of-Thought #Lemma-Style Proving

2025년 8월 2일

[논문리뷰] VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation

본 논문은 자율 AI 에이전트, 특히 LLM 기반 에이전트의 배포로 인해 발생하는 안전, 보안, 프라이버시 위험을 해결하고자 합니다.

#Review #LLM Agents #Safety #Formal Verification #Code Generation #Runtime Monitoring #Security #Guardrails #Policy Enforcement

2025년 10월 8일