#Randomized Tests

1개의 포스트

[논문리뷰] Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Coding Agent의 성능 평가가 실제 실무 능력과 괴리되는 현상은 모델이 벤치마크 데이터를 암기하거나 유출된 테스트 케이스를 미리 확인하는 Cheating 문제에서 기인합니다.

#Review #Coding Agents #Cheating Detection #Capped Evaluation #Randomized Tests #Benchmark Overfitting #Code Generation

2026년 6월 9일