[Triton] Proton 프로파일러 tensor descriptor 및 two-CTA 모드 테스트 추가

2025년 12월 23일수정: 2025년 12월 23일

PR 링크: triton-lang/triton#9070 상태: Merged | 변경: +93 / -8

들어가며

Proton은 Triton의 내장 프로파일링 도구로, 커널 내부의 연산별 성능을 분석한다. 최근 Triton에 tensor descriptor 기반 TMA 지원과 two-CTA(2-CTA) 모드가 추가되었는데, 이러한 새로운 기능에 대한 프로파일링이 올바르게 동작하는지 검증하는 테스트가 없었다.

이 PR은 tensor descriptor와 two-CTA 모드를 사용하는 커널에 대한 Proton 테스트를 추가한다.

핵심 코드 분석

tensor descriptor 테스트

@triton.jit
def tensor_desc_kernel(desc, out_desc):
    """Tensor descriptor 기반 TMA 로드/스토어 커널"""
    with proton.scope("tma_load"):
        data = tl.load(desc)
    with proton.scope("compute"):
        result = data * 2
    with proton.scope("tma_store"):
        tl.store(out_desc, result)

def test_proton_tensor_descriptor():
    # Proton이 tensor descriptor 커널에서
    # scope별 사이클을 올바르게 수집하는지 검증
    proton.activate()
    tensor_desc_kernel[grid](in_desc, out_desc)
    proton.deactivate()
    # 프로파일 결과에서 각 scope 존재 확인

two-CTA 모드 테스트

def test_proton_two_cta():
    """Two-CTA 모드에서 프로파일링 동작 검증"""
    # two-CTA 모드: 두 CTA가 협력하여 하나의 타일을 처리
    # Proton이 두 CTA의 프로파일링 데이터를
    # 올바르게 수집하고 병합하는지 테스트
    proton.activate()
    two_cta_kernel[grid](...)
    proton.deactivate()

기존 테스트 구조 개선

# Before: 단순 matmul 커널만 테스트
@pytest.mark.parametrize("mode", ["matmul"])
def test_proton_kernel(mode):
    ...

# After: tensor descriptor와 two-CTA 포함
@pytest.mark.parametrize("mode", [
    "matmul",
    "tensor_descriptor",
    "two_cta"
])
def test_proton_kernel(mode):
    ...

왜 이게 좋은가

프로파일링 범위 확장: 새로운 Triton 기능(TMA, 2-CTA)에서 Proton이 올바르게 동작함을 보장한다.
회귀 방지: 향후 TMA나 클러스터 관련 변경이 프로파일링을 깨뜨리면 이 테스트가 잡아낸다.
실사용 패턴 반영: tensor descriptor 기반 커널은 실제 고성능 커널에서 주로 사용되므로, 프로파일링 테스트도 이를 반영해야 한다.

정리

이 PR은 Proton 프로파일러에 tensor descriptor와 two-CTA 모드 커널 테스트를 추가하여, 최신 Triton 기능에 대한 프로파일링 커버리지를 확보한다.

참고 자료

이 글은 AI(Claude)의 도움을 받아 작성되었습니다. 핵심 코드와 explaination은 실제 PR diff를 기반으로 합니다.

PR Analysis 의 다른글

이전글 [Triton] AMD gfx950/gfx1250에 AsyncCopy 기본 활성화 — 파이프라인 성능 향상
현재글 : [Triton] Proton 프로파일러 tensor descriptor 및 two-CTA 모드 테스트 추가
다음글 [Triton] ext slice rematerialization 견고성 개선 — 실패 시 원본 보존