#LLM Serving

5개의 포스트

[vllm] [vLLM] MiniMax-M2 MoE Gate 최적화: Fused FP32 Kernel로 서빙 성능 32% 향상시키기

vLLM에서 MiniMax-M2 모델의 MoE Gate 연산을 Fused Kernel로 최적화하여 저지연 환경의 성능을 대폭 개선한 사례를 분석합니다.

#vLLM #CUDA #MoE #Optimization #MiniMax-M2 #LLM Serving

2026년 5월 30일

[논문리뷰] KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

본 논문은 Disaggregated LLM Serving 환경에서 KV cache 통신이 전체 end-to-end 지연시간의 최대 60%를 차지하는 주요 병목 현상을 해결하고자 한다 .

#Review #LLM Serving #KV Cache Compression #Disaggregated Inference #Bayesian Optimization #Service-Aware Control

2026년 5월 21일

[SGLang] FastAPI 기반 HTTP 서버: 비동기 추론 서빙의 진입점

SGLang의 FastAPI 기반 HTTP 서버 구현을 분석한다. 라우트 등록, 미들웨어 구성, OpenAI 호환 핸들러 초기화, 비동기 요청 처리 흐름을 코드와 함께 살펴본다.

#sglang #HTTP Server #FastAPI #LLM Serving

2026년 4월 9일

[Ray Serve] SGLang 서버의 순차 배치 처리를 동시 실행으로 전환

completions 엔드포인트에서 여러 프롬프트를 for 루프로 순차 처리하던 로직을 SGLang의 네이티브 배치 호출로 변경하여 동시 처리 성능을 개선한 수정.

#Ray #Python #Performance #SGLang #LLM Serving

2026년 3월 24일

[논문리뷰] FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

본 논문은 LLM 서빙 시스템에서 컴퓨팅 집약적인 프리필(prefill) 단계 중 발생하는 헤드-오브-라인(Head-of-Line, HoL) 블로킹 문제 를 해결하고자 합니다.

#Review #LLM Serving #Head-of-Line Blocking #Preemption #Prefill Scheduling #Time-to-First-Token (TTFT)#SLO-aware Scheduling #Operator-Level Preemption #Event-Driven Scheduling

2026년 2월 24일