#Inline Assembly

2개의 포스트

[Triton] MXFP4→BF16 변환에서 mul.bf16x2 강제 사용 — 1% MoE 성능 향상

LLVM 자동 벡터화 실패를 우회하여 ptxas가 HMUL2 명령어를 생성하도록 유도

#Triton #NVIDIA #Performance #PTX #Inline Assembly

2025년 12월 11일

[Triton] WGMMA wait op의 출력 constraint 타입별 분기 수정

f16 등 16비트 타입에서 잘못된 =r constraint 대신 =h를 사용하여 불필요한 cvt 제거

#Triton #NVIDIA #Bug Fix #Inline Assembly #WGMMA

2025년 10월 29일