[Triton] AMD PartitionedSharedEncodingAttr 도입 — shared memory 파티션 충돌 감소

2026년 2월 2일수정: 2026년 2월 2일

PR 링크: triton-lang/triton#9314 상태: Merged | 변경: +622 / -64

들어가며

GPU의 shared memory는 물리적으로 여러 파티션(bank)으로 나뉘어 있다. 동일 파티션에 동시 접근하면 bank conflict가 발생하여 성능이 저하된다. 이 PR은 Triton의 AMD 백엔드에 PartitionedSharedEncodingAttr를 도입하여, 텐서의 데이터를 여러 물리적 파티션에 강제로 분산 배치할 수 있게 한다.

핵심 코드 분석

Before (Allocation.h)

// Value -> Explicit Buffer (단일 버퍼)
using ValueBufferMapT = llvm::MapVector<Value, BufferT *>;

BufferId getBufferId(Value value) const {
  if (valueBuffer.count(value)) {
    return valueBuffer.lookup(value)->id;
  } else {
    return InvalidBufferId;
  }
}

After (Allocation.h)

// Value -> Explicit Buffers (파티션 텐서를 위한 벡터)
using ValueBufferMapT = llvm::MapVector<Value, SmallVector<BufferT *>>;

SmallVector<BufferId> getBufferIds(Value value) const {
  SmallVector<BufferId> bufferIds;
  auto it = valueBuffer.find(value);
  if (it == valueBuffer.end())
    return bufferIds;
  for (auto *buffer : it->second) {
    bufferIds.push_back(buffer->id);
  }
  return bufferIds;
}

하나의 Value가 여러 개의 물리적 버퍼를 가질 수 있도록 확장되었다.

파티셔닝 전략 (TritonGPUAttrDefs.td)

numPartitions=2, numGroups=4, partitionDim=0 on [128, 32]:
- Buffer 0 (Partition 0): [Piece0 | Piece2 | Piece4 | Piece6]
- Buffer 1 (Partition 1): [Piece1 | Piece3 | Piece5 | Piece7]

서로 다른 파티션의 버퍼는 반드시 서로 다른 물리적 shared memory 슬롯에 배치되어 bank conflict를 원천적으로 방지한다.

왜 이게 좋은가

Bank conflict 제거: 파티션 간 데이터를 물리적으로 다른 메모리 슬롯에 배치하여, 동시 접근 시 충돌을 방지한다.
확장 가능한 설계: numPartitions, numGroups, partitionDim, partitionLayout 4개의 파라미터로 다양한 분할 전략을 표현할 수 있다.
allocator 연동: neighbors 필드를 통해 동일 Value의 서로 다른 파티션 버퍼가 반드시 다른 물리적 파티션에 할당되도록 강제한다.

정리

Shared memory bank conflict는 GPU 커널 성능의 주요 병목 중 하나다. 이 PR은 텐서 레벨에서 데이터 분산 전략을 MLIR attribute로 명시적으로 표현할 수 있게 하여, 컴파일러가 하드웨어의 물리적 메모리 구조를 활용한 최적화를 수행할 수 있는 기반을 마련한다.

참고 자료

이 글은 AI 도구의 도움을 받아 작성되었습니다.

PR Analysis 의 다른글

이전글 [triton] AMD MoveUpPrologueLoads로 ReorderInstructions 패스 완전 대체
현재글 : [Triton] AMD PartitionedSharedEncodingAttr 도입 — shared memory 파티션 충돌 감소
다음글 [Loki] 인덱스 빌더에서 오브젝트 다운로드 시 슬라이스 사전 할당으로 메모리 효율화