[SGLang] 프로젝트 전체 아키텍처 분석 - 개요 및 목차

2026년 4월 9일수정: 2026년 4월 9일

SGLang이란

SGLang은 LMSYS(Large Model Systems Organization)에서 개발한 LLM 추론/서빙 오픈소스 프로젝트다. "Fast serving framework for large language models and vision language models"을 표방하며, 프론트엔드 DSL(Domain-Specific Language)과 고성능 런타임을 모두 제공하는 것이 특징이다.

핵심 논문: SGLang: Efficient Execution of Structured Language Model Programs (2023)

주요 성능 지표

기술	성능 향상	출처
RadixAttention (프리픽스 캐싱)	5x 추론 속도 향상	LMSYS Blog 2024/01
Compressed FSM (JSON 디코딩)	3x 더 빠른 구조화 출력	LMSYS Blog 2024/02
DeepSeek MLA 최적화	7x 어텐션 가속	LMSYS Blog 2024/09
torch.compile 통합	1.5x 컴파일 최적화	LMSYS Blog 2024/09
GB200 NVL72 PD+EP	3.8x Prefill, 4.8x Decode throughput	LMSYS Blog 2025/09

SGLang vs vLLM 핵심 차이

항목	vLLM	SGLang
핵심 캐싱	PagedAttention (블록 할당)	RadixAttention (Radix Tree 프리픽스 공유)
프론트엔드 DSL	없음	SGL Language (gen, select, function)
스케줄러	Async Scheduling	Zero-Overhead CPU Scheduler
Disaggregation	KV Transfer Connectors	Prefill-Decode 분리 서버
커스텀 커널	외부 의존	sgl-kernel (자체 C++/CUDA)

전체 아키텍처

 ┌──────────────────────────────────────────────────────────────────────────┐
 │                      클라이언트 (HTTP / gRPC / SDK)                       │
 └────────────────────────────────┬─────────────────────────────────────────┘
                                  │
                                  ▼
 ┌──────────────────────────────────────────────────────────────────────────┐
 │  1. Entry Point                    python/sglang/srt/entrypoints/       │
 │     HTTP Server (FastAPI) / Engine / OpenAI·Anthropic·Ollama API        │
 └────────────────────────────────┬─────────────────────────────────────────┘
                                  │
                                  ▼
 ┌──────────────────────────────────────────────────────────────────────────┐
 │  2. Frontend Language                   python/sglang/lang/             │
 │     SGL DSL: gen() → IR → Interpreter → Backend                        │
 └────────────────────────────────┬─────────────────────────────────────────┘
                                  │
                                  ▼
 ┌──────────────────────────────────────────────────────────────────────────┐
 │  3. TokenizerManager                                                    │
 │     비동기 토큰화 → GenerateReqInput → TokenizedGenerateReqInput         │
 │                                                                         │
 │              ──── ZMQ IPC ────▶                                         │
 │                                                                         │
 │  4. Scheduler (Zero-Overhead CPU)       python/sglang/srt/managers/     │
 │       ┌─────────────┐   ┌──────────────┐                               │
 │       │SchedulePolicy│──▶│ ScheduleBatch│                               │
 │       └──────┬───────┘   └──────┬───────┘                               │
 │              │                  │                                        │
 │              ▼                  │                                        │
 │  ┌─────────────────────┐       │                                        │
 │  │ 5. RadixCache       │       │                                        │
 │  │    (KV Cache)       │       │                                        │
 │  └─────────────────────┘       │                                        │
 └────────────────────────────────┼────────────────────────────────────────┘
                                  │
                                  ▼
 ┌──────────────────────────────────────────────────────────────────────────┐
 │  6. Attention Backends          python/sglang/srt/layers/attention/     │
 │     FlashAttention │ FlashInfer │ MLA │ NSA │ Mamba │ GDN │ FLA │ ...  │
 ├─────────────────────────────────────────────────────────────────────────┤
 │  7. TP Worker → Model Runner                                           │
 │     ForwardBatch → Model Forward → CUDA Graphs                         │
 ├─────────────────────────────────────────────────────────────────────────┤
 │  8~9. Model Layers                                                      │
 │     Quantization (FP8/FP4/AWQ/INT8) │ MoE (Fused/EP/CUTLASS)          │
 └────────────────────────────────┬─────────────────────────────────────────┘
                                  │
                                  ▼
 ┌──────────────────────────────────────────────────────────────────────────┐
 │  10. Speculative Decoding (EAGLE/N-gram/DFlash)                         │
 │  11. Constrained Decoding (XGrammar/Outlines/LLGuidance)                │
 │                                                                         │
 │  Sampler → BatchTokenIDOutput                                           │
 └────────────────────────────────┬─────────────────────────────────────────┘
                                  │
                                  ▼
 ┌──────────────────────────────────────────────────────────────────────────┐
 │  DetokenizerManager ◀── ZMQ IPC ── 토큰 → 텍스트 → HTTP Response       │
 └──────────────────────────────────────────────────────────────────────────┘

 ┌──────────────────────────────────────────────────────────────────────────┐
 │  횡단 관심사 (Cross-Cutting Concerns)                                    │
 │  12. Distributed (TP/PP/DP/EP) │ 13. Disaggregation (PD 분리)           │
 │  14. LoRA Adapter Serving      │ 16. Multimodal (Vision/Audio)          │
 └──────────────────────────────────────────────────────────────────────────┘

프로세스 아키텍처

SGLang은 멀티프로세스 IPC(Inter-Process Communication) 구조를 사용한다. 각 프로세스는 ZMQ 소켓으로 통신하며, CPU 바운드 작업과 GPU 바운드 작업을 분리한다.

Main Process (HTTP/Engine)
├── FastAPI Server (port 30000)
└── TokenizerManager
        │
        ├──(ZMQ)──▶ Scheduler (subprocess)
        │               ├── SchedulePolicy
        │               ├── RadixCache
        │               └── TP Workers (GPU별 subprocess)
        │                       ├── Model Runner
        │                       └── CUDA Graph Runner
        │
        └──(ZMQ)──▶ DetokenizerManager (subprocess)

시리즈 목차

1. Entry Point & Server

FastAPI 기반 HTTP 서버 — FastAPI 앱 초기화, 라우트 등록, 비동기 요청 처리
Engine: 멀티프로세스 오케스트레이터 — 프로세스 생성, ZMQ IPC, 라이프사이클 관리
OpenAI 호환 API — Chat, Completions, Embedding, Tokenize 엔드포인트
Anthropic/Ollama 호환 API — 멀티 프로토콜 호환 레이어
gRPC 서버 — 분산 추론용 gRPC 통신
Function Calling & Tool Use — 20+ 모델별 포맷 파서
음성 인식 & ASR 통합 — Whisper, Qwen3-ASR 어댑터

2. Frontend Language (SGL DSL)

SGL 언어: LLM 프로그래밍 DSL 설계 — gen, select, function 데코레이터
중간 표현(IR) — SglGen, SglSelect, SglExpr
Interpreter: SGL 프로그램 실행 엔진 — StreamExecutor, ProgramState
멀티 백엔드 — OpenAI, Anthropic, VertexAI, LiteLLM
Chat Template 관리 — Jinja 템플릿, 모델별 매핑

3. Tokenizer & Detokenizer

TokenizerManager — 비동기 토큰화 파이프라인
DetokenizerManager — 스트리밍 디토큰화
IO 데이터 구조 — GenerateReqInput, TokenizedGenerateReqInput
Multi-Tokenizer — 다중 모델 토크나이저 관리

4. Scheduler

Zero-Overhead CPU Scheduler — 배치 스케줄링의 핵심
ScheduleBatch & Req — 배치 데이터 구조 분석
스케줄링 정책 — FCFS, SJF, Age-Penalty
Continuous Batching & Chunked Prefill — 동적 배칭
Pipeline Parallelism 스케줄러 — PP 믹스인
Data Parallel Attention 스케줄러 — DP Attention
Prefill Delayer — 전략적 프리필 지연

5. Memory & KV Cache

RadixAttention — Radix Tree 기반 프리픽스 캐싱
C++ Radix Tree — 고성능 캐시 구현
GPU Memory Pool — 블록 기반 메모리 할당
Allocator — 토큰-KV 풀 할당 전략
HiRadixCache — 계층적 GPU/CPU/Disk 캐시
Sliding Window Attention 캐시 — SWA 최적화
Mamba Radix Cache — SSM 모델 캐싱
캐시 Eviction 정책 — LRU, LFU, FIFO 비교
Hybrid Cache Controller — GPU/CPU 하이브리드
Session-Aware Cache — 사용자별 파티셔닝
외부 스토리지 백엔드 — LMCache, 3FS, Mooncake, NIXL
Multimodal Cache — Vision Encoder 출력 캐싱

6. Attention Backends

RadixAttention Layer — 통합 어텐션 인터페이스
Attention Registry — 동적 백엔드 선택
FlashAttention — IO-aware 타일링 어텐션
FlashInfer — 래그드 텐서 어텐션
Multi-head Latent Attention (MLA) — KV 캐시 압축 어텐션
NSA (Narrow Sparse Attention) — DeepSeek 스파스 어텐션
Double Sparsity — H-Sparsity + T-Sparsity
Hybrid Attention — Dense-Sparse 동적 전환
Triton Attention — Triton 커널
Mamba (SSM) — 선형 시간 시퀀스 모델링
GDN 선형 어텐션 — Gated Diagonal Net
KDA — Kernel-Driven Attention
Lightning Attention — 선형 어텐션
FLA 연산 — Flashy Linear Attention

7. Model Runner & Worker

TP Worker — GPU별 텐서 병렬 워커
Model Runner — 포워드 패스 실행
ForwardBatch — GPU 텐서 변환
CUDA Graphs — 커널 런칭 오버헤드 제거
Piecewise CUDA Graph — 분할 CUDA Graph 컴파일
Model Loader — 가중치 로딩 인프라
torch.compile & Inductor — PyTorch 컴파일러 통합
Warmup — GPU 초기화 & JIT 사전 컴파일

8. Quantization

FP8 — 8비트 부동소수점 양자화
FP4 — 4비트 부동소수점 (NVIDIA NF4)
AWQ — 활성화 인식 가중치 양자화
Block-wise INT8 — 블록 단위 INT8 양자화
BitsAndBytes — QLoRA & NF4
AutoRound — 자동 라운딩 최적화
Compressed Tensors — 통합 양자화 프레임워크
혼합 정밀도 스킴 — W4A8, W8A8, W4A4
MoE 전용 양자화 — 전문가별 양자화 전략
하드웨어별 양자화 튜닝 — B200, H100, MI300X

9. Mixture of Experts

Fused MoE (Triton) — 라우팅+전문가 융합
CUTLASS MoE — 최적화 GEMM 커널
Expert Parallel MoE — 분산 전문가 레이어
MoE 라우팅 — 토큰 → 전문가 배분 알고리즘
Elastic Expert Parallelism — 동적 스케일링
EPLB — Expert-Parallel Load Balancing
FlashInfer + TensorRT-LLM MoE — 하이브리드 MoE

10. Speculative Decoding

Speculative Decoding 개요 — 원리와 구현
EAGLE — 은닉 상태 기반 드래프트 모델
EAGLE v2 — 개선된 드래프트
Multi-Layer EAGLE — 다계층 드래프트
N-gram Draft — 모델 프리 투기적 디코딩
DFlash — Flash 기반 드래프팅
EAGLE CUDA Graph — 드래프트 가속
Tree Search & Verification — 트리 탐색과 검증

11. Constrained Decoding

Grammar Manager — 구조화된 출력 생성
XGrammar — JSON/Regex 제약 백엔드
Outlines — FSM 기반 제약 & Jump-Forward
LLGuidance — 추가 문법 백엔드
Reasoner Grammar — 추론 체인 제약

12. Distributed Systems

Parallel State — TP/PP/DP/EP 병렬화 관리
통신 연산 — AllReduce, Broadcast, AllGather
Custom All-Reduce — NCCL 너머의 최적화
NCCL & MSCCL++ — 집합 통신 라이브러리
Data Parallel Controller — 다중 인스턴스 조율
Ray 통합 — 분산 엔진 & 스케줄러
Shared Memory Broadcast — 프로세스 간 통신
하드웨어별 통신 — HPU, NPU, XPU

13. Disaggregated Serving

Prefill-Decode Disaggregation 개요 — PD 분리 아키텍처
Disaggregated Prefill 서버 — Prefill 전용 서버
Disaggregated Decode 서버 — Decode 전용 서버
KV Cache Offloading — Decode 중 오프로딩
Disaggregation 커넥터 — Mooncake, NIXL, MORI
Staging Buffer — 전송 버퍼 관리

14. LoRA Adapter Serving

LoRA Manager — 어댑터 라이프사이클 관리
LoRA Layers — QKV, Gate/Up 프로젝션
LoRA 백엔드 — PyTorch, Triton, Chunked
LoRA Triton 커널 — SGMV, SGEMM
LoRA + MoE 융합 — 어댑터와 전문가 혼합
LoRA Eviction — 어댑터 캐시 관리

15. Sampling & Output

Sampler — logits → 토큰 샘플링 파이프라인
Sampling Parameters — 전체 파라미터 정리
PenaltyLib — 반복/빈도/존재 페널티
Custom Logit Processor — 커스텀 로짓 처리

16. Multimodal Processing

Multimodal 처리 파이프라인 개요 — Vision/Audio/Video
Vision-Language 모델 — CLIP, InternVL, LLaVA
Audio 모델 — Whisper, Qwen3-ASR, GLM-ASR
ViT CUDA Graph — Vision Encoder 가속
Efficient Vision Sampling — 이미지 압축

17. Model Layers & 기타

Linear Layer — 양자화 통합 선형 레이어
Activation Functions — SiLU, GELU 커스텀 구현
RoPE 변형 — 로타리 위치 인코딩
Deep GEMM Wrapper — 최적화 행렬 곱
Sparsity Algorithms — QUEST, DeepSeek NSA
Batch Overlap — 연산-통신 오버랩

부록

Model Configuration 시스템 — 설정 관리
Server Args — 300+ 서버 인자 정리
sgl-kernel — 커스텀 C++/CUDA 커널 라이브러리
Observability — 추적, 메트릭, 프로파일링
Debug Utils — 텐서 비교, 스케줄 시뮬레이터
Reasoning & Code Completion Parser — 추론 파서
Hardware Backends — MLX, NPU, XPU

핵심 데이터 흐름

GenerateReqInput → TokenizedGenerateReqInput → Req → ScheduleBatch → ForwardBatch → BatchTokenIDOutput → BatchStrOutput

단계	데이터 구조	위치
HTTP 입력	`GenerateReqInput`	`python/sglang/srt/managers/io_struct.py`
토큰화 결과	`TokenizedGenerateReqInput`	`python/sglang/srt/managers/io_struct.py`
스케줄러 요청	`Req`	`python/sglang/srt/managers/schedule_batch.py`
배치 구성	`ScheduleBatch`	`python/sglang/srt/managers/schedule_batch.py`
GPU 텐서	`ForwardBatch`	`python/sglang/srt/model_executor/forward_batch_info.py`
토큰 출력	`BatchTokenIDOutput`	`python/sglang/srt/managers/io_struct.py`
텍스트 출력	`BatchStrOutput`	`python/sglang/srt/managers/io_struct.py`

논문	적용 계층
SGLang: Efficient Execution of Structured Language Model Programs	전체 아키텍처
FlashAttention: Fast and Memory-Efficient Exact Attention	어텐션 백엔드
Efficient Memory Management for LLM Serving with PagedAttention	메모리 관리
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty	투기적 디코딩
DeepSeek-V2: A Strong, Economical, and Efficient MoE Language Model	MLA, MoE
Sparse Flash Attention	스파스 어텐션

참고 자료

SGLang 의 다른글

이전글 없음
현재글 : [SGLang] 프로젝트 전체 아키텍처 분석 - 개요 및 목차
다음글 [SGLang] FastAPI 기반 HTTP 서버: 비동기 추론 서빙의 진입점