[Triton] Frontend에서 scaled batched matrix multiply 지원

2025년 12월 18일수정: 2025년 12월 18일

PR 링크: triton-lang/triton#9000 상태: Merged | 변경: +147 / -5

들어가며

Triton의 dot_scaled 연산은 FP8/FP4 등의 scaled dot product를 수행한다. 이전 PR(#8564)에서 scale 텐서의 차원 검증이 추가되었지만, 전체 shape를 비교했기 때문에 BMM(Batched Matrix Multiply) 피연산자에서는 검증이 실패했다. 이 PR은 마지막 2차원만 비교하도록 수정하여 BMM을 지원한다.

핵심 코드 분석

Before: 전체 shape 비교

LogicalResult DotScaledOp::verify() {
  auto aShape = this->getA().getType().getShape();
  int64_t rank = aShape.size();
  // rank가 3 이상이면 batch 차원 때문에 검증 실패
  auto k = aShape[rank - 1];
  // ...
}

scale shape 검증 에러 메시지: "lhs_scale must be a tensor of shape [32, 2]. Got ['32', '4']"

After: 마지막 2차원만 비교 + rank 검증

LogicalResult DotScaledOp::verify() {
  auto aShape = this->getA().getType().getShape();
  int64_t rank = aShape.size();
  if (rank < 2)
    return this->emitError("operands must be at least 2D");
  // 마지막 2차원 기준으로 k, scale shape 검증
  auto k = aShape[rank - 1];
  // ...
}

scale shape 검증 에러 메시지: "lhs_scale must be a tensor of shape [..., 32, 2]. Got ['32', '4']"

Batched MXFP matmul 테스트

@triton.jit
def batched_mxfp_matmul(a_ptr, b_ptr, output_ptr, a_scale, b_scale,
                         M, N, K, ...,
                         BLOCK_BATCH_SIZE: tl.constexpr,
                         BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr,
                         BLOCK_K: tl.constexpr):
    offs_batch = (batch_id * BLOCK_BATCH_SIZE
                  + tl.arange(0, BLOCK_BATCH_SIZE)) % BATCH_SIZE
    # 3D 텐서: [BLOCK_BATCH_SIZE, BLOCK_M, BLOCK_K]
    a = tl.load(a_ptrs)
    b = tl.load(b_ptrs)
    scale_a = tl.load(a_scale_ptr)
    scale_b = tl.load(b_scale_ptr)
    accumulator = tl.dot_scaled(a, scale_a, "e5m2",
                                b, scale_b, "e5m2", accumulator)