[triton] AMD gfx1250 Gluon에 Tensor Async Scatter 지원 추가

비연속 쓰기: 인덱스 기반으로 임의의 행에 데이터를 쓸 수 있어 scatter 패턴에 효율적입니다.
비동기 실행: TDM 엔진이 GPU 코어와 병렬로 데이터를 전송합니다.
완전한 스택: Python API -> MLIR Op -> LLVM lowering까지 전체 스택을 구현했습니다.

2026년 1월 26일수정: 2026년 1월 26일

PR 링크: triton-lang/triton#9299 상태: Merged | 변경: +766 / -32

들어가며

TDM(Tensor Data Movement) scatter는 shared memory의 데이터를 global memory의 비연속적인 행들에 비동기적으로 쓰는 연산입니다. Flash Attention에서 다양한 시퀀스 위치에 결과를 흩뿌릴 때 유용합니다. 이 PR은 Gluon 프론트엔드에서 이 기능을 사용할 수 있게 합니다.

핵심 코드 분석

Python API

@builtin
def async_scatter(desc: tensor_descriptor, dst_row_indices: ttgl.tensor,
                  dst_col_offset, src: shared_memory_descriptor,
                  mbarrier=None, _semantic=None) -> None:
    """Scatter data from shared memory to non-contiguous rows.
    
    dst_row_indices의 dtype에 따라:
    - int16: 한 번에 최대 16행
    - int32: 한 번에 최대 8행
    """
    ndim = len(desc.block_shape)
    assert ndim == 2, f"TDM scatter only supports 2D tensors"

MLIR Op 정의

def AsyncTDMScatterOp : TT_AMDGPU_Op<"async_tdm_scatter"> {
  let arguments = (ins
    Arg<TT_TensorDescType, "", [MemWrite<GlobalMemory>]>:$desc,
    TensorOf<[I16, I32]>:$dst_row_indices,
    I32:$dst_col_offset,
    Arg<TTG_MemDescType, "", [MemRead<SharedMemory>]>:$src,
    Optional<TTG_MemDescType>:$barrier
  );
}