최신 포스트

[llm-compressor] SmoothQuant: 활성화→가중치 양자화 난이도 이동

SmoothQuant 논문의 activation smoothing 기법이 llm-compressor에서 어떻게 구현되어 있고, per-channel scale 결정과 RMSNorm 흡수 방식 분석

#llm-compressor #SmoothQuant #Quantization #W8A8

2026년 4월 13일

[llm-compressor] AWQ: 활성화 인식 가중치 양자화 구현

AWQ 논문의 salient weight 스케일링 아이디어가 llm-compressor에서 mappings와 dynamic_mappings를 통해 어떻게 구현되는지 분석

#llm-compressor #AWQ #Quantization #PTQ

2026년 4월 13일

[llm-compressor] GPTQ: 2차 정보 기반 후훈련 양자화 구현

GPTQ 논문의 Hessian 기반 양자화가 llm-compressor에 어떻게 구현되어 있는지, block_size/dampening_frac/actorder 파라미터와 sequential epoch 종료 시 quantize_weight 호출 구조 분석

#llm-compressor #GPTQ #Quantization #PTQ

2026년 4월 13일

[llm-compressor] Group Size Validation: 그룹 크기 호환성 검사

group_size_validation.py의 validate_group_size 함수가 레이어 shape과 group_size의 호환성을 검증하고 에러 메시지를 제공하는 구조 분석

#llm-compressor #Quantization #Validation

2026년 4월 13일

[llm-compressor] Quantization Calibration: update_weight_zp_scale와 observer 등록

calibration.py의 update_weight_zp_scale, update_weight_global_scale 같은 헬퍼 함수들이 모듈 단위로 observer를 호출해 스케일을 결정하는 흐름 분석

#llm-compressor #Quantization #Calibration

2026년 4월 13일

[llm-compressor] Quantization Base: QuantizationModifier와 QuantizationMixin

QuantizationModifier가 PTQ/QAT 라이프사이클을 어떻게 관리하고, QuantizationMixin이 observer 등록/calibration/종료를 어떻게 처리하는지 분석

#llm-compressor #Quantization #Modifier

2026년 4월 13일

[SGLang] Disaggregated Decode 서버: 디코드 전용 서버 구현

SGLang의 Disaggregated Decode 서버를 분석한다. 디코드 전용 서버의 KV 캐시 수신, 토큰 생성 루프, Prefill 서버로부터의 상태 전달을 코드와 함께 살펴본다.

#sglang #Disaggregated Decode #Token Generation #Decode Server

2026년 4월 13일

[llm-compressor] iMatrix Observer: 입력 채널 중요도 가중 MSE

IMatrixMSEObserver가 forward pre-hook으로 입력의 E[x^2]를 수집해 채널별 중요도를 계산하고, 그 가중치로 MSE grid search를 수행하는 구조 분석

#llm-compressor #Observer #iMatrix #Quantization

2026년 4월 13일

[llm-compressor] Moving Average Observer: 지수 이동 평균 기반 온라인 관측자

MovingAverageObserverBase가 여러 배치의 min/max를 지수 이동 평균으로 누적해서 안정적인 스케일을 제공하는 구조 분석

#llm-compressor #Observer #MovingAverage

2026년 4월 13일

[llm-compressor] MSE Observer: Grid Search로 양자화 오차 최소화

MemorylessMSEObserver와 MovingAverageMSEObserver가 min/max 범위를 점진적으로 줄여가며 양자화 MSE를 최소화하는 grid search 로직 분석

#llm-compressor #Observer #MSE #Quantization

2026년 4월 13일

[llm-compressor] MinMax Observer: 세 가지 min/max 계산 정책

MemorylessMinMaxObserver, StaticMinMaxObserver, MinMaxObserver 세 변형이 각각 어떻게 min/max를 집계하는지 코드 분석

#llm-compressor #Observer #Quantization #MinMax

2026년 4월 13일

[llm-compressor] Observers Base: 스케일/제로포인트 계산의 추상 기반

Observer 베이스 클래스가 get_min_max 훅을 통해 스케일과 제로포인트를 계산하고, compressed-tensors의 calculate_qparams를 호출하는 구조 분석

#llm-compressor #Observer #Quantization

2026년 4월 13일

[llm-compressor] Modifier Interface: 추상 계약과 타입 체크

ModifierInterface ABC가 정의하는 initialized/finalized 프로퍼티와 initialize/finalize/update_event 추상 메서드 분석

#llm-compressor #Modifier #Interface #ABC

2026년 4월 13일

[llm-compressor] Modifier Factory: 문자열 이름에서 Modifier 인스턴스 생성

ModifierFactory가 패키지를 재귀 스캔해 Modifier 서브클래스를 등록하고, 레시피 YAML의 문자열 이름에서 실제 인스턴스를 만드는 메커니즘 분석

#llm-compressor #Modifier #Factory #Registry

2026년 4월 13일

[llm-compressor] Modifier Base: 모든 Modifier가 상속하는 기반 클래스

Modifier 클래스의 라이프사이클 메서드(initialize/update_event/finalize), start/end 훅, should_start/should_end 조건 검사 분석

#llm-compressor #Modifier #Base

2026년 4월 13일

[llm-compressor] Intermediates Cache: 서브그래프 활성화 오프로드 캐시

IntermediatesCache가 배치별 중간 활성화를 CPU/GPU 사이에서 오프로드/온로드하면서 메모리를 관리하는 구조와 prefetch 메커니즘 분석

#llm-compressor #Pipeline #Memory #Offload

2026년 4월 13일

[llm-compressor] Data-Free & Independent Pipeline: 데이터 없는 파이프라인과 Modifier별 개별 실행

DataFreePipeline의 포워드 없는 구조와 IndependentPipeline의 Modifier별 파이프라인 자동 선택 로직 분석

#llm-compressor #Pipeline #DataFree #Independent

2026년 4월 13일

[llm-compressor] Sequential Pipeline: 레이어 단위 서브그래프 캘리브레이션

SequentialPipeline이 모델을 서브그래프로 쪼개고 중간 활성화를 오프로드하며 GPTQ/SparseGPT를 수행하는 구조 분석

#llm-compressor #Pipeline #Sequential #Calibration

2026년 4월 13일

[llm-compressor] Basic Pipeline: 한 번의 forward로 끝내는 캘리브레이션

BasicPipeline이 모델 전체를 단일 forward로 순회하며 캘리브레이션하는 구조와 loss mask, dispatch_model 처리 분석

#llm-compressor #Pipeline #Calibration

2026년 4월 13일

[SGLang] Disaggregated Prefill 서버: 프리필 전용 서버 구현

SGLang의 Disaggregated Prefill 서버를 분석한다. 프리필 전용으로 최적화된 서버 구현, KV 캐시 생성 및 전송, Decode 서버와의 협조를 코드와 함께 살펴본다.

#sglang #Disaggregated Prefill #KV Transfer #Prefill Server

2026년 4월 13일