[Triton] GFX1250용 MXGEMM Gluon 커널 업데이트

2026년 3월 18일

들어가며

AMD gfx1250의 MXGEMM(Microscaled GEMM) Gluon 커널에서 세 가지 개선이 이루어졌다: hip.init(0) 워크어라운드 제거, TDM(Tensor Data Mover)으로 전체 텐서를 한 번에 로드하여 sgpr 압력 감소, 그리고 padded layout 수정이다.

핵심 코드 분석

Before: 슬라이스별 로드와 subtile 관리

# 슬라이스별 로드
BLOCK_K_PACKED_A = BLOCK_K // self.DIV_FACTOR_A // NUM_SUBTILES_K
# ...
self.shared_layout_a = gl.constexpr(
    gl.PaddedSharedLayout.with_identity_for(
        [[BLOCK_K_PACKED_A, 16]],
        [BLOCK_M // NUM_SUBTILES_M, BLOCK_K_PACKED_A], [1, 0]))

각 subtile을 개별적으로 로드하고, shared memory layout도 subtile 크기에 맞췄다.

After: 전체 텐서를 한 번에 로드

BLOCK_K_PACKED_A = BLOCK_K // self.DIV_FACTOR_A
PAD_INTERVAL_A = 256 if BLOCK_K_PACKED_A <= 256 else BLOCK_K_PACKED_A

self.shared_layout_a = gl.constexpr(
    gl.PaddedSharedLayout.with_identity_for(
        [[PAD_INTERVAL_A, 16]], [BLOCK_M, BLOCK_K_PACKED_A], [1, 0]))

subtile 분할(// NUM_SUBTILES_K)이 제거되고, 전체 블록을 한 번에 로드한다. 이는 LLVM 백엔드의 LDS 인덱싱 버그가 수정된 후 가능해졌다.

파이프라인 스케줄도 변경되었다:

# Before: lds와 tdm+wmma를 분리
with gl.amd.warp_pipeline_stage("lds", priority=1): ...
with gl.amd.warp_pipeline_stage("tdm+wmma", priority=0): ...

# After: tdm+lds를 합치고 wmma만 분리
with gl.amd.warp_pipeline_stage("tdm+lds", priority=1): ...
with gl.amd.warp_pipeline_stage("wmma", priority=0): ...