[triton] Profile scratch용 기본 allocator 제공

2026년 3월 3일수정: 2026년 3월 3일

PR 링크: triton-lang/triton#9596 상태: Merged | 변경: +143 / -141

들어가며

Triton의 ConSan(Concurrency Sanitizer)은 profile_scratch 메모리를 사용합니다. 기존에는 사용자가 triton.set_allocator()로 별도의 allocator를 설정해야 했는데, 이는 사용 편의성을 떨어뜨렸습니다. 이 PR은 createThirdPartyScratchAlloc 헬퍼를 도입하여 기본 allocator를 자동으로 사용합니다.

핵심 코드 분석

1. createThirdPartyScratchAlloc 헬퍼 도입

Before:

def alloc_fn(size: int, alignment: int, stream: Optional[int]):
    return torch.empty(size, device="cuda", dtype=torch.int8)
triton.set_allocator(alloc_fn)

After (C++ 내부):

gpu::GlobalScratchAllocOp
createThirdPartyScratchAlloc(OpBuilder &b, Location loc, Type ptrType,
                             int64_t sizeInBytes, int64_t alignment) {
  return gpu::GlobalScratchAllocOp::create(b, loc, ptrType, sizeInBytes,
                                           alignment, b.getUnitAttr());
}

2. ConSan 테스트에서 allocator 설정 제거

Before:

def run_failing_kernel(device, enable_consan, mode):
    triton.set_allocator(alloc_fn)
    # ...

After:

def run_failing_kernel(device, enable_consan, mode):
    # allocator 설정 불필요
    # ...

3. profile scratch size 활용 확인 테스트

def test_consan_uses_profile_scratch(device, fresh_knobs):
    fresh_knobs.compilation.instrumentation_mode = "consan"
    compiled = failing_kernel.warmup(input, grid=(1, ))
    assert compiled.metadata.profile_scratch_size > 0
    assert compiled.metadata.global_scratch_size == 0

왜 이게 좋은가

사용 편의성: ConSan을 사용하기 위해 별도의 allocator를 설정할 필요가 없어졌습니다.
명확한 분리: Global scratch(사용자 데이터)와 profile scratch(instrumentation 데이터)의 역할이 명확히 구분됩니다.
코드 정리: 4곳에 분산된 GlobalScratchAllocOp::create 호출이 하나의 헬퍼로 통합되었습니다.

정리

Instrumentation 도구가 profile scratch memory를 사용할 때 별도 allocator 설정이 불필요하도록 개선한 PR입니다. 사용자 경험을 개선하면서 내부 코드도 정리되었습니다.

참고 자료

triton-lang/triton#9596

이 글은 AI의 도움을 받아 작성되었으며, 원본 PR의 코드 변경 사항을 기반으로 분석한 내용입니다.

PR Analysis 의 다른글

이전글 [triton] AMD GFX1250 MachineSink 이슈 우회를 위한 fence 추가
현재글 : [triton] Profile scratch용 기본 allocator 제공
다음글 [faster-qwen3-tts] HF Space에 1000자 텍스트 제한 추가로 CUDA static cache overflow 방지