[sglang] SGLang, CUDA 그래프 재실행 시 호스트-디바이스 동기화 제거로 성능 향상

2026년 6월 27일수정: 2026년 6월 27일

PR 링크: sgl-project/sglang#29415 상태: Merged | 변경: +35 / -10

들어가며

최근 SGLang 프로젝트의 Pull Request(PR) #29415는 언어 모델 추론 시 성능 병목 현상을 해결하는 중요한 개선 사항을 포함하고 있습니다. 특히, _apply_cuda_graph_metadata 함수 내에서 발생하던 불필요한 호스트-디바이스(H2D) 동기화를 제거함으로써, CUDA 그래프 재실행(replay) 시 발생하는 오버헤드를 줄이고 GPU 활용률을 높이는 것을 목표로 합니다. 이 글에서는 해당 PR의 코드 변경 내용을 상세히 분석하고, 왜 이러한 변경이 성능 향상에 기여하는지, 그리고 이로부터 얻을 수 있는 일반적인 최적화 교훈을 공유하고자 합니다.

코드 변경 분석

이번 PR의 핵심은 DeepseekSparseAttnBackend 및 DeepseekSparseAttnMultiStepBackend 클래스에서 needs_cpu_seq_lens 속성을 False로 설정하고, _apply_cuda_graph_metadata 함수 내에서 torch.tensor([num_draft_tokens] * bs, device=cuda) 대신 torch.full((bs,), num_draft_tokens, ..., device=self.device)를 사용하도록 변경한 것입니다.

1. `needs_cpu_seq_lens` 속성 변경

먼저, DeepseekSparseAttnBackend 클래스에 needs_cpu_seq_lens: bool = False 속성이 추가되었습니다. 이는 해당 백엔드가 CUDA 그래프 재실행 시 호스트 측 시퀀스 길이(seq_lens_cpu) 정보가 필요 없음을 명시적으로 나타냅니다.

--- a/python/sglang/srt/layers/attention/dsa_backend.py
+++ b/python/sglang/srt/layers/attention/dsa_backend.py
@@ -303,6 +303,11 @@ def topk_transform(
 class DeepseekSparseAttnBackend(
     DeepseekSparseAttnBackendMTPPrecomputeMixin, AttentionBackend
 ):
+    # Decode/verify/draft graph replay rebuilds metadata from static buffers
+    # (page-table width) and never reads seq_lens_cpu / seq_lens_sum; opt out of
+    # the D2H sync. The eager fallback derives lengths from GPU seq_lens.
+    needs_cpu_seq_lens: bool = False
+
     def __init(
         self,
         model_runner: ModelRunner,

마찬가지로 DeepseekSparseAttnMultiStepBackend 클래스에서도 동일한 변경이 이루어졌습니다.

--- a/python/sglang/srt/layers/attention/dsa_backend.py
+++ b/python/sglang/srt/layers/attention/dsa_backend.py
@@ -2583,6 +2583,10 @@ def _compute_flashmla_metadata(self, cache_seqlens: torch.Tensor, seq_len_q: int
 
 class DeepseekSparseAttnMultiStepBackend:
 
+    # Per-step draft decode replays from precomputed GPU metadata; opt out so
+    # decide_needs_cpu_seq_lens' OR over the backends stays False.
+    needs_cpu_seq_lens: bool = False
+
     def __init(
         self, model_runner: ModelRunner, topk: int, speculative_num_steps: int
     ):

이 변경은 decide_needs_cpu_seq_lens 함수가 여러 백엔드의 needs_cpu_seq_lens 값을 OR 연산하여 최종 결정하는 로직에서, 해당 백엔드들이 CPU 시퀀스 길이 정보에 의존하지 않음을 명확히 하여 불필요한 데이터 전송을 방지하는 데 기여합니다.

2. `_apply_cuda_graph_metadata` 함수 내 H2D 동기화 제거

가장 핵심적인 변경은 _apply_cuda_graph_metadata 함수 내에서 extend_seq_lens_cpu를 생성하고 사용하는 방식입니다. 이전 코드에서는 다음과 같이 Python 리스트를 생성한 후 PyTorch 텐서로 변환하여 GPU로 복사했습니다.

Before:

            extend_seq_lens_cpu = [self.speculative_num_draft_tokens] * bs

            seqlens_expanded = seqlens_expand_triton(
                torch.tensor(
                    extend_seq_lens_cpu, dtype=torch.int32, device=self.device
                ),
                cache_seqlens,
                self.speculative_num_draft_tokens * bs,
                self.speculative_num_draft_tokens,
            )

이 방식은 torch.tensor(list, device=cuda) 호출 시, Python 리스트의 데이터를 호스트 메모리에서 GPU 메모리로 복사하는 H2D 전송이 발생합니다. 특히, 이 복사 작업은 해당 스트림의 모든 이전 연산이 완료될 때까지 호스트를 블로킹(blocking)할 수 있으며, 이는 CUDA 그래프 재실행과 같이 이미 GPU에서 실행 중인 작업 흐름에 예상치 못한 지연을 유발합니다. PR 설명에 따르면 이 동기화는 GB200에서 스텝당 약 8.5ms의 지연을 발생시켰다고 합니다.

수정된 코드에서는 이 부분을 다음과 같이 변경했습니다.

After:

            # Fill the constant per-req qo lengths (num_draft_tokens) on-device;
            # torch.tensor(list, device=cuda) does a pageable H2D copy that
            # blocks the host on the whole queued stream.
            extend_seq_lens = torch.full(
                (bs,),
                self.speculative_num_draft_tokens,
                dtype=torch.int32,
                device=self.device,
            )
            seqlens_expanded = seqlens_expand_triton(
                extend_seq_lens,
                cache_seqlens,
                self.speculative_num_draft_tokens * bs,
                self.speculative_num_draft_tokens,
            )

torch.full 함수를 사용하여 동일한 값을 가지는 텐서를 직접 GPU 디바이스에 생성하도록 변경했습니다. torch.full((bs,), value, ..., device=self.device)는 torch.tensor(list, device=cuda)와 달리, 호스트 메모리를 거치지 않고 GPU에서 직접 텐서를 할당하고 초기화합니다. 이는 H2D 복사 및 관련 동기화 오버헤드를 완전히 제거하여, CUDA 그래프 재실행 시 발생하는 지연 시간을 크게 줄여줍니다.

또한, init_forward_metadata 함수에서도 유사한 변경이 이루어져, seq_lens_cpu가 None일 경우 max_seqlen_k 계산 방식을 수정하여 needs_cpu_seq_lens=False 설정과 일관성을 유지하도록 했습니다.

--- a/python/sglang/srt/layers/attention/dsa_backend.py
+++ b/python/sglang/srt/layers/attention/dsa_backend.py
@@ -642,8 +647,13 @@ def init_forward_metadata(self, forward_batch: ForwardBatch):
 
         cache_seqlens_int32 = (forward_batch.seq_lens + draft_token_num).to(torch.int32)
         cu_seqlens_k = compute_cu_seqlens(cache_seqlens_int32)
-        assert forward_batch.seq_lens_cpu is not None
-        max_seqlen_k = int(forward_batch.seq_lens_cpu.max().item() + draft_token_num)
+        if forward_batch.seq_lens_cpu is not None:
+            max_seqlen_k = int(forward_batch.seq_lens_cpu.max().item() + draft_token_num)
+        else:
+            # needs_cpu_seq_lens=False nulls the host mirror for spec-v2 relay
+            # batches; graph replay uses the static page-table width, so only this
+            # eager (e.g. over-capture-bs) fallback needs a length here.
+            max_seqlen_k = int(forward_batch.seq_lens.max().item()) + draft_token_num
         # [b, max_seqlen_k]
         page_table = self.req_to_token_pool.req_to_token[
             forward_batch.req_pool_indices, :max_seqlen_k

왜 이게 좋은가?

이 PR의 변경 사항은 다음과 같은 이유로 매우 긍정적입니다:

성능 향상: 가장 직접적인 효과는 CUDA 그래프 재실행 시 발생하는 불필요한 H2D 동기화 오버헤드를 제거하는 것입니다. PR 설명에 따르면 이 동기화는 GB200에서 스텝당 약 8.5ms의 지연을 유발했습니다. 이 지연이 제거됨으로써 GPU가 더 효율적으로 연산을 수행하고, 결과적으로 전체 추론 속도가 향상됩니다. NVTX 프로파일링 결과에서도 해당 동기화 이벤트가 사라진 것을 확인할 수 있습니다.
GPU 활용률 증대: 호스트 동기화는 GPU가 유휴 상태로 대기하게 만드는 주요 원인 중 하나입니다. 이러한 동기화를 제거함으로써 GPU는 다음 연산을 더 빨리 시작할 수 있게 되어, GPU 활용률을 높이고 전체 처리량을 개선합니다.
코드 명확성 및 유지보수성: needs_cpu_seq_lens 속성을 명시적으로 False로 설정함으로써, 해당 백엔드가 CPU 시퀀스 길이 정보에 의존하지 않음을 코드 레벨에서 분명히 했습니다. 이는 코드의 가독성을 높이고, 향후 관련 로직을 수정하거나 디버깅할 때 혼란을 줄여줍니다.
일반화 가능한 교훈: 이 최적화는 PyTorch를 사용하여 GPU 연산을 수행할 때 흔히 발생하는 패턴 중 하나를 개선한 사례입니다. 즉, 호스트에서 생성된 데이터를 GPU로 복사할 때 torch.tensor(list, device=cuda)와 같은 방식은 암묵적인 동기화를 유발할 수 있으므로, 가능한 한 torch.full, torch.zeros, torch.ones 등과 같이 GPU에서 직접 텐서를 생성하는 함수를 사용하거나, torch.empty_like와 같은 방법을 활용하여 불필요한 H2D 복사 및 동기화를 피해야 한다는 교훈을 줍니다.

성능 수치

지연 시간 감소: GB200에서 스텝당 약 8.5ms의 동기화 지연 제거.
테스트 결과: GSM8K 벤치마크에서 0.985의 성능을 유지하며 최적화 달성.

https://pytorch.org/docs/stable/generated/torch.full.html

⚠️ 알림: 이 분석은 AI가 실제 코드 diff를 기반으로 작성했습니다.

PR Analysis 의 다른글

이전글 [vllm] vLLM, DeepSeek V4 모델 성능 최적화: AITER MXFP4 BF16 백엔드 개선
현재글 : [sglang] SGLang, CUDA 그래프 재실행 시 호스트-디바이스 동기화 제거로 성능 향상
다음글 [vllm] vLLM의 GLM5.2 성능 최적화: Triton 커널 융합을 통한 E2E Throughput 향상