[SGLang] Zero-Overhead CPU Scheduler: 배치 스케줄링의 핵심 설계

2026년 4월 10일수정: 2026년 4월 10일

들어가며

LLM 서빙 시스템에서 스케줄러는 어떤 요청을 언제 GPU에 보낼지 결정하는 핵심 컴포넌트다. 전통적인 스케줄러는 GPU 연산이 끝날 때까지 기다린 뒤 다음 배치를 결정하는 방식으로 동작하여, CPU 스케줄링 시간이 GPU idle time으로 이어졌다. SGLang v0.4에서 도입된 Zero-Overhead CPU Scheduler는 이 문제를 해결한다.

이 글에서는 python/sglang/srt/managers/scheduler.py의 Scheduler 클래스를 중심으로, 메인 루프 구조와 overlap 스케줄링 설계를 분석한다.

구조도

                        Scheduler (CPU)
                        ┌──────────────────────────┐
                        │                          │
  ZMQ recv_requests()──>│  waiting_queue (deque)   │
                        │         │                │
                        │         v                │
                        │  get_next_batch_to_run() │
                        │    ┌─────────┐           │
                        │    │ Prefill │ (우선)     │
                        │    │  Batch  │           │
                        │    └────┬────┘           │
                        │         │ (없으면)        │
                        │    ┌────v────┐           │
                        │    │ Decode  │           │
                        │    │  Batch  │           │
                        │    └────┬────┘           │
                        │         │                │
                        │         v                │
                        │    run_batch() ──────────┼──> GPU Worker
                        │         │                │
                        │         v                │
                        │  process_batch_result()  │
                        └──────────────────────────┘

Scheduler 클래스 정의

Scheduler는 다수의 Mixin을 상속받아 기능별로 관심사를 분리한다.

class Scheduler(
    SchedulerOutputProcessorMixin,
    SchedulerUpdateWeightsMixin,
    SchedulerProfilerMixin,
    SchedulerMetricsMixin,
    SchedulerDisaggregationDecodeMixin,
    SchedulerDisaggregationPrefillMixin,
    SchedulerMultiplexMixin,
    SchedulerRuntimeCheckerMixin,
    SchedulerPPMixin,
    SchedulerDPAttnMixin,
    SchedulerDllmMixin,
):
    """A scheduler that manages a tensor parallel GPU worker."""

초기화 시 주요 상태는 다음과 같다.

def __init__(self, server_args, port_args, gpu_id, tp_rank, ...):
    self.schedule_policy = server_args.schedule_policy
    self.enable_overlap = not server_args.disable_overlap_schedule
    self.waiting_queue = deque()  # 대기 큐
    self.running_batch = ScheduleBatch(...)  # 실행 중 배치

enable_overlap 플래그가 핵심이다. 이 값에 따라 Normal 루프와 Overlap 루프 중 하나가 선택된다.

핵심 코드 분석: Normal Loop

가장 기본적인 event_loop_normal은 직관적인 3단계 구조다.

def event_loop_normal(self):
    """A normal scheduler loop."""
    while True:
        # 1. 요청 수신
        recv_reqs = self.recv_requests()
        self.process_input_requests(recv_reqs)

        # 2. 다음 배치 결정
        batch = self.get_next_batch_to_run()
        self.cur_batch = batch

        # 3. 배치 실행 + 결과 처리
        if batch:
            result = self.run_batch(batch)
            self.process_batch_result(batch, result)
        else:
            self.self_check_during_idle()

        self.last_batch = batch

이 루프는 GPU 연산(run_batch)이 완료될 때까지 CPU가 블로킹된다. GPU 연산 동안 CPU는 아무 일도 하지 않는다.

핵심 코드 분석: Overlap Loop

Zero-Overhead의 핵심인 event_loop_overlap은 GPU 연산과 CPU 스케줄링을 파이프라인화한다.

def event_loop_overlap(self):
    """A scheduler loop that overlaps the CPU processing
       and GPU computation."""
    self.result_queue: Deque[Tuple[ScheduleBatch, ...]] = deque()

    while True:
        # 1. 요청 수신 (CPU)
        recv_reqs = self.recv_requests()
        self.process_input_requests(recv_reqs)

        # 2. 다음 배치 결정 (CPU)
        batch = self.get_next_batch_to_run()
        self.cur_batch = batch

        # 3. GPU가 이전 배치를 실행하는 동안
        #    CPU는 다음 배치를 준비
        if batch:
            result = self.run_batch(batch)
            self.result_queue.append((batch, result))

        # 4. 이전 결과 처리 (CPU)
        if len(self.result_queue) > 0:
            tmp_batch, tmp_result = self.result_queue.popleft()
            self.process_batch_result(tmp_batch, tmp_result)

Before / After 비교

[Before: Normal Loop - GPU idle 발생]

 CPU:  |--recv--|--schedule--|--------idle--------|--process--|
 GPU:  |--------idle---------|======forward=======|---idle----|
       t0      t1           t2                    t3         t4

[After: Overlap Loop - GPU idle 최소화]

 CPU:  |--recv--|--schedule--|--process prev--|--recv--|--schedule--|
 GPU:  |======forward(N-1)===|=====forward(N)=======|===forward(N+1)===|
       t0                   t1               t2                      t3

 * CPU 스케줄링이 GPU forward와 동시에 실행
 * GPU idle time -> 거의 0

Normal Loop에서는 schedule -> forward -> process가 순차 실행되어 GPU가 대기한다. Overlap Loop에서는 step N의 forward가 GPU에서 실행되는 동안, CPU는 step N-1의 결과를 처리하고 step N+1의 배치를 준비한다.

get_next_batch_to_run: Prefill 우선 전략

배치 결정의 핵심 로직은 Prefill을 항상 Decode보다 우선한다.

def get_next_batch_to_run(self) -> Optional[ScheduleBatch]:
    # 이전 prefill 배치를 running_batch에 병합
    if self.last_batch and self.last_batch.forward_mode.is_extend():
        self.last_batch.filter_batch(...)
        if not self.last_batch.is_empty():
            self.running_batch.merge_batch(self.last_batch)

    # Prefill 우선 시도
    new_batch = self.get_new_batch_prefill()

    if new_batch is not None:
        ret = new_batch  # Prefill 배치 실행
    else:
        # Decode 실행
        if not self.running_batch.is_empty():
            self.running_batch = self.update_running_batch(
                self.running_batch
            )
            ret = self.running_batch
        else:
            ret = None
    return ret

이 설계의 이유는 명확하다. Prefill은 새 요청의 KV 캐시를 채우는 단계이므로, 빠르게 처리할수록 전체 TTFT(Time To First Token)가 줄어든다. Decode 배치는 이미 실행 중인 요청이므로 한 스텝 지연되어도 영향이 작다.

왜 이 설계인가

1. CPU-GPU 파이프라인 분리: 스케줄링 로직(정책 계산, 큐 관리, 메모리 예산 계산)은 전부 CPU에서 실행된다. GPU는 오직 forward만 수행한다. 이 분리 덕분에 두 프로세서가 동시에 일할 수 있다.

2. Mixin 기반 확장성: Scheduler 클래스는 11개의 Mixin을 통해 기능을 조합한다. DP Attention, Pipeline Parallelism, Disaggregation 등이 독립적인 Mixin으로 존재하여, 핵심 루프를 수정하지 않고도 기능을 추가할 수 있다.

3. batch_is_full 플래그: ScheduleBatch.batch_is_full 플래그로 불필요한 prefill 체크를 건너뛴다. 이전 스텝에서 토큰이 부족하다고 판단되면, 다음 스텝에서 새 요청 추가 시도 자체를 생략하여 CPU 오버헤드를 줄인다.

SGLang v0.4 블로그 포스트에 따르면, 이 overlap 스케줄링만으로도 throughput이 최대 1.3배 향상된다고 보고되었다.

참고

SGLang 의 다른글

이전글 [SGLang] Multi-Tokenizer: 다중 모델 토크나이저 동시 관리
현재글 : [SGLang] Zero-Overhead CPU Scheduler: 배치 스케줄링의 핵심 설계
다음글 [SGLang] ScheduleBatch & Req: 배치 데이터 구조의 설계와 생명주기