[sglang] NPU 호환성 수정: empty_cache와 memory_saver 충돌 해결

2026년 3월 31일수정: 2026년 3월 31일

PR 링크: sgl-project/sglang#21507 상태: Merged | 변경: +5 / -3

들어가며

SGLang은 NVIDIA GPU 외에도 Ascend NPU를 지원합니다. NPU 환경에서는 모델 weight 로딩 후 torch.npu.empty_cache()로 메모리를 정리해야 하는데, 기존 위치에서는 memory_saver_adapter.region과 충돌하는 문제가 있었습니다. 이번 PR은 empty_cache 호출을 적절한 위치로 이동하여 이 충돌을 해결합니다.

핵심 코드 분석

1. empty_cache 호출 위치 이동

Before (loader.py):

def load_weights_and_postprocess(model, weights, target_device):
    for module in model.modules():
        if hasattr(module, "weight_loader"):
            with device_loading_context(module, target_device):
                quant_method.process_weights_after_loading(module)
            if _is_npu:
                torch.npu.empty_cache()

After (model_runner.py):

def load_model(self):
    # ... 모델 로딩 완료 후
    if _is_npu:
        torch.npu.empty_cache()
    monkey_patch_vllm_parallel_state(reverse=True)

기존에는 각 모듈의 weight 후처리 직후마다 empty_cache를 호출했으나, 이 시점에서 memory_saver_adapter의 device_loading_context region이 활성 상태여서 충돌이 발생했습니다. 모든 weight 로딩이 완료된 load_model 함수의 마지막 단계로 이동하여 충돌을 방지합니다.

2. Triton 비지원 백엔드에 ascend 추가

Before:

def support_triton(backend: str) -> bool:
    return backend not in ["torch_native", "intel_amx"]

After:

def support_triton(backend: str) -> bool:
    return backend not in ["torch_native", "intel_amx", "ascend"]

왜 이게 좋은가

context manager 안전성: empty_cache를 device_loading_context 바깥에서 호출하여, memory region 관리와 충돌하지 않습니다.
효율성: 모듈마다 empty_cache를 호출하는 대신 한 번만 호출하여, NPU의 메모리 관리 오버헤드를 줄입니다.

정리

5줄 수정의 작은 PR이지만, NPU 환경에서 메모리 관리자 간의 미묘한 상호작용 문제를 해결하는 중요한 수정입니다.

참고 자료

⚠️ 알림: 이 분석은 AI가 실제 코드 diff를 기반으로 작성했습니다.

PR Analysis 의 다른글

이전글 [Triton] AMD gfx1250 Tensor Descriptor 기반 GEMM 테스트 추가
현재글 : [sglang] NPU 호환성 수정: empty_cache와 memory_saver 충돌 해결
다음글 [triton] Proton CUPTI Graph Replay 힙 증가 재현 테스트 추가