#Model Alignment

4개의 포스트

[논문리뷰] RubricBench: Aligning Model-Generated Rubrics with Human Standards

arXiv에 게시된 'RubricBench: Aligning Model-Generated Rubrics with Human Standards' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Evaluation #Reward Models #Rubric-Guided Evaluation #Benchmarks #Model Alignment #Human Standards #Cognitive Misalignment

2026년 3월 2일

[논문리뷰] DEER: Draft with Diffusion, Verify with Autoregressive Models

Zhijie Deng이 arXiv에 게시한 'DEER: Draft with Diffusion, Verify with Autoregressive Models' 논문에 대한 자세한 리뷰입니다.

#Review #Speculative Decoding #Diffusion LLM #Autoregressive Model #Inference Acceleration #Model Alignment #Code Generation #Block Regeneration

2025년 12월 17일

[논문리뷰] Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Kevin Zhu이 arXiv에 게시한 'Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs' 논문에 대한 자세한 리뷰입니다.

#Review #Emergent Misalignment #In-Context Learning #LLM Safety #Persona Rationalization #Prompt Engineering #Model Alignment

2025년 10월 20일

[논문리뷰] Behavioral Fingerprinting of Large Language Models

Xing Li이 arXiv에 게시한 'Behavioral Fingerprinting of Large Language Models' 논문에 대한 자세한 리뷰입니다.

#Review #Large Language Models #Behavioral Evaluation #Model Alignment #Sycophancy #World Model Brittleness #Metacognition #Personality Profiling

2025년 9월 8일