[sglang] SGLang에 Piecewise CUDA Graph 및 Torch Compile 백엔드 도입

2025년 10월 12일수정: 2025년 10월 12일

PR 링크: sgl-project/sglang#10062 상태: Merged | 변경: +2706 / -19

들어가며

LLM 추론 서빙에서 CUDA kernel launch overhead는 무시할 수 없는 병목이다. CUDA Graph는 이를 해결하는 핵심 기술이지만, attention 연산처럼 dynamic shape를 가진 연산이 포함되면 전체 forward pass를 하나의 graph로 캡처할 수 없다. 이 PR은 piecewise CUDA graph 방식을 도입하여 attention 연산을 기준으로 그래프를 분할 캡처하고, 나머지 구간은 torch.compile로 최적화하는 백엔드를 SGLang에 추가한다.

핵심 코드 분석

1. Attention 연산을 Custom Op으로 등록

기존에는 attention forward를 직접 호출했다. torch.compile이 graph를 추적하려면 attention을 명시적인 custom op으로 등록해야 한다.

Before:

return forward_batch.attn_backend.forward(
    q, k, v, self, forward_batch, save_kv_cache, **kwargs,
)

After:

if forward_batch.forward_mode.is_extend() and get_forward_context() is not None:
    output = torch.zeros_like(q)
    torch.ops.sglang.unified_attention_with_output(
        q, k, v, output, save_kv_cache, self.layer_id
    )
    return output
else:
    return forward_batch.attn_backend.forward(
        q, k, v, self, forward_batch, save_kv_cache, **kwargs,
    )

direct_register_custom_op으로 등록된 unified_attention_with_output은 torch.compile이 graph를 분할하는 splitting point 역할을 한다. fake implementation도 함께 등록하여 symbolic tracing을 가능하게 한다.

2. Graph 분할 로직 (split_graph)

SGLangBackend.__call__에서 FX graph를 attention op 기준으로 분할한다.

self.split_gm, self.piecewise_graphs = split_graph(
    graph, ["sglang.unified_attention_with_output"]
)

split_graph 함수는 FX graph의 각 node를 순회하며, attention op을 만나면 subgraph ID를 증가시켜 graph를 조각낸다. keep_original_order=True로 mutation 의미론을 보존하는 것이 핵심이다.

3. Custom AllReduce 버퍼 확대

Piecewise CUDA graph 모드에서는 custom allreduce 버퍼 요구량이 커진다.

Before:

self.rank_data = torch.empty(
    8 * 1024 * 1024, dtype=torch.uint8, device=self.device
)

After:

if torch_compile is not None and torch_compile:
    ca_max_size = 256 * 1024 * 1024
else:
    ca_max_size = 8 * 1024 * 1024

8MB에서 256MB로 32배 확대하여 illegal CUDA memory access를 방지한다. Piecewise graph 캡처 시 여러 subgraph가 동시에 allreduce 버퍼를 참조하기 때문이다.

4. PiecewiseCompileInterpreter

각 subgraph를 Inductor로 컴파일하고, CUDAPiecewiseBackend로 래핑하는 interpreter이다.

self.module.__dict__[target] = CUDAPiecewiseBackend(
    submod,
    self.compile_config,
    self.inductor_config,
    self.graph_pool,
    index,
    len(self.compile_submod_names),
    sym_shape_indices,
    compiled_graph_for_dynamic_shape,
    self.sglang_backend,
)

각 조각은 독립적으로 컴파일되고, 런타임에 CUDA graph pool을 공유하며 캡처/재생된다.

왜 이게 좋은가

Kernel launch overhead 감소: attention 사이의 MLP, LayerNorm 등 연산을 하나의 CUDA graph로 묶어 launch 횟수를 대폭 줄인다.
Dynamic shape 대응: attention은 graph 밖에서 실행되므로 가변 sequence length에 대응 가능하다.
Inductor 최적화: torch.compile의 operator fusion, memory planning 등 최적화가 각 subgraph에 적용된다.
vLLM에서 검증된 패턴을 SGLang에 도입한 것으로, 실전 성능 향상이 기대된다.

정리

Piecewise CUDA graph는 "전부 캡처 vs 전혀 캡처 안 함"의 이분법을 깬다.
Custom op 등록 + FX graph splitting + Inductor 컴파일의 3단계 파이프라인이 핵심이다.
추론 엔진을 만들 때, attention처럼 dynamic한 연산은 graph boundary로 설정하고 나머지를 최적화하는 전략이 효과적이다.

참고 자료

vLLM Piecewise CUDA Graph 구현 — SGLang이 참고한 원본 구현
PyTorch torch.compile 공식 문서 — torch.compile 백엔드 작성법
CUDA Graphs 개요 — NVIDIA 공식 CUDA Graph 설명

⚠️ 알림: 이 분석은 AI가 실제 코드 diff를 기반으로 작성했습니다.

PR Analysis 의 다른글

이전글 [Triton] split_k에 m*n 제약 조건 추가
현재글 : [sglang] SGLang에 Piecewise CUDA Graph 및 Torch Compile 백엔드 도입
다음글 [triton] AMD: range analysis 버그 수정 및 buffer-ops의 range analysis 의존성 강화