[Triton] AMD ds_read_tr의 b8/b16 타입 lowering 리워크

명령어 모델링: 출력을 변환하는 대신 명령어의 동작을 직접 모델링하여 정확성이 향상되었다.
대폭 코드 감소: +322/-727로 약 400줄이 줄었다. 복잡한 rotate/swap 로직이 사라졌다.
확장성: PaddedSharedEncoding 지원이 자연스럽게 추가되었다.

2025년 11월 7일

들어가며

AMD GPU의 ds_read_tr 명령어는 LDS(Local Data Share)에서 데이터를 읽으면서 transpose를 수행하는 전용 명령어다. 기존 lowering은 요청 출력의 LinearLayout(LL)을 변환하는 방식이었으나, b8/b16 타입에서 복잡한 레이아웃 변환이 필요했다. 이 PR은 ldmatrix와 유사하게 명령어 자체의 LL을 직접 모델링하는 방식으로 리워크한다.

핵심 코드 분석

Before

// 기존: 90줄의 chooseLLDsReadTrLayout - output LL의 prefix를 rotate
auto rotatePrefixes = [](BaseTy &regBase, std::size_t numReg,
                         BaseTy &laneBase, std::size_t numLane) {
    BaseTy baseUnit(laneBase.begin(), laneBase.begin() + numLane);
    llvm::append_range(baseUnit, ...);
    std::copy(baseUnit.begin(), baseUnit.begin() + numReg, regBase.begin());
    std::copy(baseUnit.begin() + numReg, baseUnit.end(), laneBase.begin());
};

출력 LL의 register/lane basis를 "rotate"하는 복잡한 로직이었고, b8에서는 추가 swap이 필요했다.

After

// 새: 명령어의 LL을 직접 모델링 + PaddedSharedEncoding 지원
auto paddedEnc = dyn_cast<PaddedSharedEncodingAttr>(srcTy.getEncoding());
LinearLayout cvtDstLL = LinearLayout::empty();
if (paddedEnc) {
    const auto &sharedLL = paddedEnc.getLinearComponent();
    cvtDstLL = toLinearLayout(dstTy).invertAndCompose(sharedLL);
} else {
    auto sharedLL = toLinearLayout(srcTy);
    cvtDstLL = toLinearLayout(dstTy).invertAndCompose(sharedLL);
}