[triton] Proton CUPTI Graph Replay 힙 증가 재현 테스트 추가

2026년 3월 31일수정: 2026년 3월 31일

PR 링크: triton-lang/triton#9881 상태: Merged | 변경: +1001 / -0

들어가며

GPU 프로파일링 도구에서 메모리 누수는 장시간 실행되는 워크로드에서 심각한 문제를 일으킬 수 있습니다. 이 PR은 NVIDIA의 CUPTI 라이브러리(특히 cupti13 버전)가 CUDA graph replay 중 Proton 프로파일링이 활성화된 상태에서 힙 메모리가 지속적으로 증가하는 현상을 체계적으로 재현하는 테스트 스크립트를 추가합니다.

핵심 코드 분석

이 테스트는 CUPTI의 generic 버전과 Blackwell 버전을 비교할 수 있도록 설계되었습니다.

핵심 구조:

# CUDA graph 생성 및 반복 replay
torch.cuda.synchronize()
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    for _ in range(graph_ops):
        add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=BLOCK_SIZE)
torch.cuda.synchronize()

# Proton 프로파일링 하에서 반복 replay
proton.start(...)
for step in range(total_steps):
    g.replay()
    # 주기적으로 힙 스냅샷 캡처
    if should_checkpoint(elapsed):
        capture_heap_snapshot(output_dir, label, checkpoint_name)

타이밍 체크포인트(t0, t1, t3)에서 jemalloc 힙 프로파일과 /proc/self/status, smaps_rollup 등을 수집하여, t+1h vs t+3h 힙 diff 분석을 통해 메모리 증가의 원인 스택을 추적할 수 있도록 합니다.

핵심 발견 스택:

at::cuda::CUDAGraph::replay
cuGraphLaunch
cudaGraphLaunch@@libcudart.so.13
cuptiEnableAllDomains@@libcupti.so.13
cuptiOpenMpInitialize_v2

이 분석은 AI가 실제 코드 diff를 기반으로 작성했습니다.

PR Analysis 의 다른글

이전글 [sglang] NPU 호환성 수정: empty_cache와 memory_saver 충돌 해결
현재글 : [triton] Proton CUPTI Graph Replay 힙 증가 재현 테스트 추가
다음글 [Open WebUI] CodespanToken에서 JS 트랜지션을 CSS 애니메이션으로 교체하여 메인 스레드 부하 제거

[triton] Proton CUPTI Graph Replay 힙 증가 재현 테스트 추가

들어가며

핵심 코드 분석

왜 이게 좋은가

정리

참고 자료

댓글

관련 포스트

PR Analysis 의 다른글