[triton] tcgen05.cp를 Generic Matrix Descriptor Lowering으로 통합

2025년 10월 2일수정: 2025년 10월 2일

PR 링크: triton-lang/triton#8338 상태: Merged | 변경: +154 / -290

핵심 코드 분석

Before - 별도의 디스크립터 로딩 로직

// 기존: tcgen05.cp와 MMA가 각각 독립적으로 SMEM 디스크립터 생성
static DotOpMmaSmemLoader
build(Location loc, RewriterBase &rewriter, gpu::MemDescType memTy,
      Value smemBase, ArrayRef<unsigned> instrShape, int mmaVersion, ...);

After - pseudoinverse 기반 통합 로직

// llInv를 pseudoinverse로 계산하여 다양한 MMA 버전에서 재사용
static DotOpMmaSmemLoader
build(Location loc, RewriterBase &rewriter, const LinearLayout &ll,
      int bitwidth, Value smemBase, ArrayRef<unsigned> instrShapeArray,
      int mmaVersion, ...);

auto llInv = toLinearLayout(memTy).pseudoinvert();
// FP4 처리
if (isFp4) {
  llInv = LinearLayout::identity1D(2, dims[trans ? 0 : 1], kOffset) * llInv;
  bitwidth /= 2;
}

왜 이게 좋은가

코드 제거: 290줄 삭제로 중복되던 디스크립터 생성 로직을 하나로 합쳤습니다.
Linear Layout 활용: pseudoinverse를 통해 메모리 레이아웃에서 디스크립터를 자동 유도하여 수동 계산 오류를 방지합니다.
MMA v3/v5 통합: MMAv3(Hopper)과 MMAv5(Blackwell) 모두 같은 경로를 사용합니다.

정리

SMEM 디스크립터 생성을 linear layout의 pseudoinverse로 일반화하여, tcgen05.cp와 MMA가 같은 코드를 공유하게 만든 정리 작업입니다.

참고 자료

PR #8338

이 글은 AI(Claude)의 도움을 받아 작성되었으며, 원본 PR의 코드 변경 사항을 기반으로 분석한 내용입니다.

PR Analysis 의 다른글

이전글 [triton] ConSan: 상태 변경 시 커널 재컴파일을 보장하여 JIT 캐시 무효화
현재글 : [triton] tcgen05.cp를 Generic Matrix Descriptor Lowering으로 통합
다음글 [Triton] TMEM Store 레이아웃 변환 최적화 — FlexAttention 성능 복구

[triton] tcgen05.cp를 Generic Matrix Descriptor Lowering으로 통합

핵심 코드 분석

Before - 별도의 디스크립터 로딩 로직

After - pseudoinverse 기반 통합 로직

왜 이게 좋은가

정리

참고 자료

댓글

관련 포스트

PR Analysis 의 다른글