#Pipelining

5개의 포스트

[triton] AMD GFX950에서 Padded Layout Async Copy의 OOM 버그 수정

작은 타일 크기에서 padding interval이 contiguous 차원보다 큰 경우를 처리하여 pipelining 시 OOM을 방지한 사례를 분석합니다.

#Triton #AMD #GPU #GFX950 #Pipelining #BugFix

2026년 2월 18일

[Triton] WGMMA rs-dot 분할을 2회로 제한 — 1% MoE 성능 향상

K 차원 분할 수를 K/instrK에서 2로 고정하여 in-register pipelining 최적화

#Triton #NVIDIA #Performance #WGMMA #Pipelining

2026년 1월 7일

[Triton] SWP 루프 로우어링에서 barrier 위치 결정 로직 수정

MMA의 non-pipelined operand와 tmem_load 간 barrier 위치를 linearized schedule 기반으로 정확히 결정

#Triton #NVIDIA #Pipelining #SWP #Bug Fix

2025년 12월 22일

[Triton] WGMMA register pipelining에서 누락된 wait 삽입 수정

Persistent matmul epilogue에서 accumulator 접근 시 필요한 wgmma wait 누락 버그 수정

#Triton #NVIDIA #MLIR #Bug Fix #Pipelining

2025년 12월 11일

[Triton] Warp Specialization 중첩 루프 지원

partition-schedule 패스를 재귀적으로 확장하고, tmem_alloc hoisting을 최상위로 수행하여 중첩 루프 E2E 지원

#Triton #NVIDIA #Warp Specialization #Nested Loop #Pipelining

2025년 12월 2일