#Scale Folding

1개의 포스트

[논문리뷰] RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

최근 Large Language Models (LLMs)는 자연어 처리 분야를 혁신했지만, FP16 포맷의 Llama-2-13B 모델이 26GB 의 memory를 요구하는 등 막대한 memory requirement로 인해 consumer GPU나 edge device에 배포하는 데 어려움을 겪는 Memory Wall 문제가 존재합니다.

#Review #Mixed-Precision Quantization #Reinforcement Learning #Post-Training Quantization #Large Language Models #Policy Transfer #Scale Folding #GGUF #On-Device Inference

2026년 3월 18일