#Triton

305개의 포스트

[sglang] SGLang DFlash 최적화: 호스트-디바이스 동기화 제거를 통한 추론 성능 향상

DFlash의 단계별 호스트 동기화를 제거하여 CPU가 한 단계 앞서 실행되도록 최적화, 최대 13%의 토큰 처리량 향상을 달성했습니다.

#SGLang #LLM #CUDA #Optimization #Triton

2026년 7월 18일

[axolotl] Axolotl의 SinkGD 최적화: Triton 커널과 스펙트럼 정규화로 성능 극대화

SinkGD 옵티마이저에 도입된 Triton 커널, 스펙트럼 정규화, 그리고 MD-sphere 기법을 통한 학습 효율 및 속도 개선 분석.

#Axolotl #SinkGD #Triton #DeepLearning #Optimization

2026년 7월 12일

[sglang] SGLang MoE Shared Expert 최적화: 3개 커널을 1개로 융합하여 GPU 오버헤드 제거

SGLang에서 MoE Shared Expert 처리 시 3개의 GPU 커널을 1개로 융합하여 성능을 개선했습니다.

#SGLang #MoE #Kernel Fusion #Triton #GPU Optimization #AMD AITER

2026년 7월 8일

[triton] Triton: Blackwell 아키텍처를 위한 TMEM Load-Reduce 연산 퓨전 최적화

Blackwell sm103+ GPU에서 TMEM Load와 Row Reduction을 단일 PTX 명령어로 퓨전하여 성능을 개선했습니다.

#Triton #Blackwell #GPU #Optimization #Compiler

2026년 7월 7일

[vllm] [vLLM] Triton 커널 최적화로 Unlimited-OCR 성능 3.7배 끌어올리기: R-SWA의 효율적 구현

Unlimited-OCR의 R-SWA 마스크를 TritonAttention 백엔드에 직접 구현하여 FlexAttention 대비 최대 3.7배의 성능 향상을 달성한 과정을 분석합니다.

#vLLM #Triton #LLM Optimization #Attention #R-SWA #OCR

2026년 7월 3일

[triton] Triton 커널 최적화: 불필요한 텐서 메모리 할당 제거하기

Triton의 reduce_launch_metadata에서 발생하는 대규모 중간 텐서 생성 문제를 해결하여 메모리 효율성을 개선한 사례를 분석합니다.

#Triton #GPU #Optimization #MemoryManagement #DeepLearning

2026년 7월 2일

[sglang] [NPU] GLM-4.7-Flash 성능 최적화: Fused Triton 커널로 연산 병목 해결하기

Split과 RMSNorm 연산을 하나로 합친 Fused Kernel을 도입하여 GLM-4.7-Flash 모델의 NPU 추론 성능을 대폭 개선했습니다.

#NPU #Triton #Optimization #DeepSeek-V2 #SGLang #LLM Inference

2026년 6월 30일

[vllm] vLLM의 GLM5.2 성능 최적화: Triton 커널 융합을 통한 E2E Throughput 향상

Triton 커널 융합으로 Q RoPE, FP8 양자화, 스케일 폴딩을 통합하여 추론 성능을 최대 3.3% 개선했습니다.

#vLLM #Triton #LLM #Optimization #FP8

2026년 6월 27일

[sglang] SGLang의 Qwen3.5 성능 극대화: Fused QK GemmaRMSNorm + RoPE 커널 최적화 분석

Qwen3.5 모델의 어텐션 레이어 연산을 Triton 커널로 통합하여 메모리 접근을 줄이고 추론 성능을 최대 9.4% 향상시킨 최적화 기법을 소개합니다.

#SGLang #Triton #LLM #Optimization #Qwen3.5

2026년 6월 25일

[sglang] SGLang: AMD GPU 환경에서의 DeepSeek-V4 성능 최적화 분석

AMD GPU 환경에서 MLA GEMM 및 RoPE 연산을 최적화하여 추론 성능을 최대 8.8% 향상시킨 사례 분석

#SGLang #AMD #DeepSeek-V4 #Triton #GEMM #RoPE

2026년 6월 20일

[triton] Triton Autotuner 최적화: Pruned Config가 하나일 때 불필요한 벤치마크 생략하기

Triton Autotuner에서 설정이 하나로 압축될 경우, 불필요한 벤치마킹 과정을 건너뛰어 성능을 개선한 사례를 분석합니다.

#Triton #Autotuner #Performance #Optimization #Compiler

2026년 6월 18일

[sglang] AMD GPU 최적화: Triton 커널 퓨전을 통한 Qwen2 MoE 공유 전문가 게이팅 성능 향상

AMD GPU에서 Qwen2 MoE 모델의 공유 전문가 게이팅 연산을 Triton 커널로 융합하여 성능을 개선한 PR 분석

#AMD #Triton #Triton Kernel Fusion #Qwen2 MoE #Performance Optimization #SGLang

2026년 6월 16일

[triton] Triton AMD StreamK GEMM 커널의 Race Condition 해결: 동기화 로직 최적화 분석

AMD GPU 환경에서 StreamK GEMM 커널의 동기화 결함(Race Condition)을 해결하고 안정성을 확보한 코드 변경 사항을 분석합니다.

#Triton #AMD #GEMM #StreamK #GPU #Concurrency

2026년 6월 13일

[triton] Triton에서 i8 행렬 곱셈 최적화: 레지스터 압력 감소 및 성능 향상

Triton의 i8 행렬 곱셈에서 레지스터 압력을 줄이고 성능을 향상시키는 최적화 기법을 분석합니다.

#Triton #AI #최적화 #행렬 곱셈 #GPU

2026년 6월 12일

[sglang] SGLang에서 Qwen3-Next FP8 MoE 최적화: H200을 위한 Shared-Expert Fusion

H200 환경에서 Qwen3-Next FP8 MoE 모델의 성능을 극대화하기 위한 Shared-Expert Fusion 및 Triton 커널 최적화 분석.

#SGLang #LLM #MoE #FP8 #Triton #H200

2026년 6월 11일

[논문리뷰] Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

본 논문은 대규모 데이터셋에 대한 GMM 훈련 시 발생하는 메모리 부족(OOM) 문제와 과도한 HBM 대역폭 요구 사항을 해결합니다.

#Review #Gaussian Mixture Models #GMM #Triton #IVF #Approximate Nearest Neighbor #Memory-Efficient #Soft Clustering

2026년 6월 11일

[triton] [AMD Triton] LLVM InstCombine의 함정을 피하는 법: TDM 텐서 클램핑 최적화

LLVM의 InstCombine이 유발하는 불필요한 VALU 연산과 v_readfirstlane 오버헤드를 방지하기 위한 TDM 디스크립터 생성 로직 개선 사례를 살펴봅니다.

#Triton #AMD #LLVM #GPU #Optimization #Codegen

2026년 6월 8일

[axolotl] ScatterMoE LoRA 최적화: Grouped-Gram 및 Sync-free 역전파 구현

대규모 MoE 모델의 LoRA 학습 시 발생하는 병목을 해결하기 위해 Grouped-Gram 커널과 동기화 없는 역전파 경로를 도입하여 성능을 최대 2.2배 개선했습니다.

#PyTorch #Triton #MoE #LoRA #PerformanceOptimization

2026년 6월 7일

[sglang] [SGLang] Blackwell(B200)에서 Diffusion Attention 성능을 7배 끌어올리는 Triton 커널 최적화 분석

PyTorch SDPA의 마스크 처리 한계를 Triton 커널 퓨전과 Varlen FlashAttention으로 극복하여 B200에서 최대 21%의 성능 향상을 달성했습니다.

#Triton #FlashAttention #Diffusion #CUDA #Performance Optimization #SGLang

2026년 5월 28일

[triton] [Triton] Persistent Matmul 성능을 13% 향상시킨 정교한 Shared Memory 계산 기법 분석

Shared Memory 계산 휴리스틱을 개선하여 TF32 Matmul에서 4-stage 파이프라이닝을 활성화하고 GB200 성능을 13% 끌어올린 사례를 분석합니다.

#Triton #GPU #CUDA #Matmul #Optimization #Deep Learning

2026년 5월 27일

[vllm] [vLLM] W4A16 양자화 모델의 호환성 문제 해결: Triton 커널을 활용한 CUDA Fallback 구현

Marlin 커널의 정렬 제약으로 인해 실행 불가능했던 W4A16 모델들을 Triton 커널 fallback을 통해 CUDA 환경에서도 지원하도록 개선했습니다.

#vLLM #CUDA #Triton #Quantization #LLM Inference #W4A16

2026년 5월 27일

[vllm] vLLM DeepSeek V4 ROCm MTP 지원: 하드웨어 최적화와 추론 성능 향상

DeepSeek V4 모델의 ROCm MTP 지원을 통해 추론 성능을 크게 향상시킨 vLLM PR 분석.

#vLLM #ROCm #DeepSeekV4 #MTP #SpeculativeDecoding #Triton #FP8 #Optimization

2026년 5월 24일

[triton] Triton Reduce 커널 성능 최적화: Subtiling과 RowIdxs 도입

Triton Reduce 커널의 성능을 향상시키기 위해 subtiling과 rowidxs 기법을 도입한 코드 변경 분석.

#Triton #Performance Optimization #CUDA #Deep Learning #Kernel Tuning

2026년 5월 24일

[LlamaFactory] LlamaFactory의 Triton 기반 Fused MoE 커널 도입: 40% 이상의 성능 향상

Triton으로 구현된 Fused MoE 커널을 통해 Mixtral 등 MoE 모델의 학습 속도를 획기적으로 개선합니다.

#LlamaFactory #Triton #MoE #DeepLearning #Optimization

2026년 5월 20일

[sglang] Qwen3.5 및 Qwen3_Next 모델의 NPU 성능 향상을 위한 Triton 커널 퓨전 최적화

NPU 환경에서 Qwen3.5 및 Qwen3_Next 모델의 어텐션 레이어 성능을 극대화하는 Triton 커널 퓨전 최적화 분석

#NPU #Triton #Kernel Fusion #Optimization #Qwen3.5 #Qwen3_Next #LLM

2026년 5월 20일

[triton] AMD GPU에서 불필요한 워프 로드를 제거하여 성능을 최적화한 Triton PR 분석

AMD GPU 아키텍처에서 불필요한 데이터 로드를 방지하여 VGPR 사용량을 최대 35% 줄이는 최적화 기법을 분석합니다.

#Triton #AMD GPU #Optimization #LLVM #Compiler

2026년 5월 19일

[sglang] DeepSeekV4 Fused MoE Triton 커널 지원 추가: 성능 최적화 분석

DeepSeekV4 모델의 Fused MoE Triton 커널 지원을 추가하여 추론 성능을 향상시킨 PR 분석

#AI #LLM #Optimization #Triton #DeepSeekV4 #MoE

2026년 5월 18일

[논문리뷰] AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

본 연구는 GPU 커널 최적화 작업이 딥러닝 시스템의 효율성에 핵심적임에도 불구하고, 기존 벤치마크들이 이를 충분히 포괄하지 못한다는 문제 의식에서 출발합니다.

#Review #GPU Kernel Optimization #AI Coding Agents #Generalization #Performance Benchmarking #Triton #HIP #LLM Evaluation

2026년 5월 18일

[sglang] LTX2 스플릿 로터리 커널 최적화: 헤드 배치 처리로 성능 2배 향상

LTX2 스플릿 로터리 커널에서 헤드 배치 처리를 도입하여 성능을 2배 향상시킨 코드 최적화 분석.

#Triton #Performance Optimization #LLM Kernel #RoPE #SGLang

2026년 5월 16일

[sglang] SGLang의 MLA KV 캐시 쓰기 최적화: TMA Bulk-Store 도입

TMA Bulk-Store와 Triton 커널 최적화를 통해 MLA KV 캐시 쓰기 성능을 최대 12배 향상시킨 기술적 여정.

#SGLang #CUDA #Triton #LLM #Optimization #TMA

2026년 5월 15일

[triton] Triton 커널 최적화: Mask Sorting을 통한 Reduction 연산 가속화

Triton의 reduction 연산에서 불필요한 루프 반복을 줄이기 위해 마스크를 기준으로 행을 정렬하고 루프 바운드를 최적화하는 기법을 분석합니다.

#Triton #GPU Optimization #Deep Learning #CUDA #Kernel Programming

2026년 5월 15일

[vllm] vLLM의 Triton 통합 어텐션 커널에 Tensor Descriptor 최적화 도입

vLLM의 Triton 통합 어텐션 커널에 Tensor Descriptor를 도입하여 Intel XPU의 2D 블록 읽기 성능을 향상시킵니다.

#vLLM #Triton #Optimization #Deep Learning #LLM

2026년 5월 13일

[flashinfer] FlashInfer Mamba SSU 커널 최적화: Async State Prefetching과 Vectorized Load를 통한 성능 혁신

FlashInfer의 Mamba SSU 커널이 Async State Prefetching, Vectorized Load 등으로 극적인 성능 향상을 이루었습니다.

#FlashInfer #Mamba #SSU #Kernel Optimization #Triton #CUDA #Performance

2026년 5월 13일

[vllm] vLLM Mamba2 SSD 커널 웜업: 첫 요청 지연 시간 91% 감소의 비결

vLLM Mamba2 모델의 첫 요청 지연 시간을 91% 줄인 Triton 커널 웜업 최적화 분석.

#vLLM #Mamba2 #Triton #Kernel Optimization #Latency Reduction #Deep Learning Inference

2026년 5월 12일

[vllm] vLLM, DeepSeek-V4 K 캐시 커널 최적화: CuteDSL 도입으로 성능 향상

vLLM의 DeepSeek-V4 모델에서 K 캐시 커널의 메모리 대역폭 활용도를 높여 성능을 개선한 PR 분석

#vLLM #DeepSeek-V4 #성능 최적화 #GPU 커널 #CuteDSL #Triton

2026년 5월 11일

[sglang] SGLang의 MHC 파이프라인 최적화: 커널 퓨전과 DeepGemm 도입

MHC 파이프라인에서 커널 퓨전과 DeepGemm을 활용해 연산 효율을 극대화하고 HBM 접근을 최소화하여 성능을 개선했습니다.

#SGLang #CUDA #Triton #DeepGemm #Optimization

2026년 5월 10일

[sglang] SGLang: Triton 버전 업그레이드에 따른 MoE 성능 회귀 해결 및 설정 자동화

PyTorch 2.11 업그레이드 이후 발생한 Triton 버전 호환성 문제를 해결하고, MoE 커널 설정 탐색 로직을 동적으로 개선하여 성능 회귀를 방지하는 방법.

#SGLang #Triton #DeepSeek #MoE #PerformanceOptimization

2026년 5월 9일

[sglang] SGLang 성능 최적화: PDL 도입과 안전한 CUDA 동기화로 DSV3.2/GLM-5 가속하기

PDL(Programmatic Dependency Launch) 도입과 CUDA 커널의 메모리 배리어 수정을 통해 추론 지연 시간을 개선하고 안정성을 확보했습니다.

#CUDA #SGLang #Performance Optimization #LLM Inference #Triton

2026년 5월 9일

[flashinfer] NVIDIA Blackwell SM120을 위한 MoE Short-Decode 최적화 분석

FlashInfer의 SM120 MoE 커널 업데이트를 통해 단일 토큰 디코딩 성능을 극대화하는 마이크로 커널 최적화 기법을 살펴봅니다.

#CUDA #MoE #Blackwell #Performance #Triton

2026년 5월 7일

[sglang] HunyuanVideo VAE 디코딩 성능 향상: GroupNorm SiLU 커널 최적화

HunyuanVideo VAE 디코딩 시 GroupNorm SiLU 연산의 성능을 극적으로 개선한 Triton 커널 최적화 분석

#AI #딥러닝 #최적화 #Triton #HunyuanVideo #VAE

2026년 5월 2일

[vllm] [vLLM] ROCm 환경에서의 DeepSeek-V2/V3 성능 극대화를 위한 MLA 최적화 분석

ROCm 환경에서 DeepSeek 모델의 MLA 성능을 높이기 위한 KV 캐시 레이아웃 셔플, FP8 Sparse MLA 지원 및 메타데이터 빌더 최적화 기법을 살펴봅니다.

#vLLM #ROCm #DeepSeek #MLA #Performance Optimization #Triton

2026년 5월 1일

[vllm] vLLM의 첫 추론 지연 문제 해결: forward_native 샘플러 커널 웜업 최적화

vLLM v1 엔진에서 FlashInfer 도입으로 발생한 JIT 컴파일 지연 문제를 샘플러 웜업 로직 개선으로 해결한 사례를 분석합니다.

#vLLM #LLM #Triton #Performance #JIT

2026년 5월 1일

[vllm] vLLM chunk_kda 커널의 숨겨진 상태(h) 레이아웃 불일치 버그 수정 및 정확도 개선

vLLM의 chunk_kda 커널에서 h 행렬 레이아웃 불일치 버그를 수정하여 모델 정확도를 크게 개선합니다.

#vLLM #CUDA #Triton #Kernel #Bugfix #Deep Learning #Optimization

2026년 4월 30일

[triton] Triton의 Ragged Matmul 메타데이터 계산 최적화: CPU 동기화 없는 효율적인 프로파일링

Ragged matmul의 메타데이터 계산을 다수의 Torch 커널에서 단일 Triton 커널로 통합하여 오버헤드를 획기적으로 줄였습니다.

#Triton #GPU #Performance #Profiling #Matmul

2026년 4월 29일

[sglang] AMD ROCm 환경에서의 성능 최적화: Triton을 활용한 Fused QK GemmaRMSNorm 구현

ROCm 플랫폼에서 4개의 개별 커널을 하나의 Triton 커널로 통합하여 QK 정규화 성능을 개선한 사례를 분석합니다.

#SGLang #Triton #ROCm #Performance Optimization #LLM

2026년 4월 25일

[triton] Triton Gluon Attention 커널의 Autotuning을 통한 성능 최적화 분석

Triton Gluon 예제에서 커널 설정을 동적으로 선택하는 Autotuning 로직을 도입하여 다양한 시나리오에서 성능을 개선했습니다.

#Triton #GPU #Optimization #Attention #DeepLearning

2026년 4월 23일

[sglang] SGLang Triton 커널 최적화: libdevice.tanh 도입과 2D Strided Tensor 지원

Triton 커널에서 수치적 불안정성을 해결하기 위해 libdevice.tanh를 도입하고, 2D Strided Tensor를 지원하도록 구조를 개선한 사례를 분석합니다.

#Triton #CUDA #LLM #SGLang #Optimization #DeepLearning

2026년 4월 22일

[vllm] vLLM, Gemma4 라우팅 함수 Triton 커널로 최적화하여 성능 대폭 향상

vLLM이 Gemma4 모델의 라우팅 함수를 Triton 커널로 최적화하여 서빙 성능을 크게 개선했습니다.

#vLLM #Gemma4 #Triton #최적화 #성능 향상 #AI 모델 서빙

2026년 4월 19일

[triton] Triton AMD 커널 최적화: TDM 로드 파이프라이닝 개선을 통한 성능 향상

Triton의 AMD gfx1250 GEMM 커널에서 TDM 로드 시점을 조정하여 파이프라인 효율을 극대화한 최적화 사례 분석.

#Triton #AMD #GPU #Optimization #GEMM #HPC

2026년 4월 18일

[vllm] vLLM TurboQuant: KV 캐시 압축으로 LLM 서빙 효율 극대화

vLLM의 TurboQuant는 KV 캐시를 압축하여 메모리 사용량을 줄이고 LLM 서빙 효율을 높입니다.

#vLLM #LLM #KV Cache #Quantization #Optimization #Triton #GPU Memory

2026년 4월 15일

[sglang] [AMD] Triton 커널 퓨전을 통한 Qwen3.5 MoE 라우팅 최적화 분석

4개의 커널 호출을 단일 Triton 커널로 통합하여 Qwen3.5 MoE 모델의 서빙 성능을 최대 4.16% 향상시킨 최적화 기법을 살펴봅니다.

#Triton #MoE #Qwen3.5 #Kernel-Fusion #SGLang #AMD

2026년 4월 15일

[triton] Triton 테스트 속도 혁신: Python 루프에서 벡터화된 NumPy로의 전환

Triton의 느린 테스트를 Python 루프에서 벡터화된 NumPy로 전환하여 200초에서 3.3초로 단축한 PR 분석

#Triton #최적화 #테스트 #NumPy #성능

2026년 4월 14일

[SGLang] LoRA 백엔드: PyTorch, Triton, Chunked 구현 비교

SGLang의 LoRA 백엔드를 분석한다. PyTorch 기본 구현, Triton 최적화, Chunked 배치 처리 등 3종 백엔드의 구현과 성능 차이를 코드와 함께 비교한다.

#sglang #LoRA Backend #PyTorch #Triton #Chunked

2026년 4월 13일

[SGLang] Fused MoE (Triton): 라우팅과 전문가 연산의 융합

SGLang의 Fused MoE Triton 구현을 분석한다. 라우팅과 전문가 GEMM을 하나의 커널로 융합하는 구조, 200+ 사전 튜닝 설정, 메모리 최적화를 코드와 함께 살펴본다.

#sglang #Fused MoE #Triton #Expert Fusion #GEMM

2026년 4월 12일

[SGLang] Triton Attention 커널: Python으로 작성하는 GPU 커널

SGLang의 Triton Attention 백엔드를 분석한다. Python으로 GPU 커널을 작성하는 Triton의 장점, Prefill/Decode/Extend 각 단계별 커널 구현을 코드와 함께 살펴본다.

#sglang #Triton #GPU Kernel #Attention Kernel

2026년 4월 11일

[vllm] vLLM 성능 최적화: H2D 메모리 복사 병목 해결을 통한 추론 처리량 개선

Triton Attention 커널에서 발생하는 불필요한 Host-to-Device(H2D) 메모리 전송을 캐싱 전략으로 제거하여 멀티모달 모델의 추론 성능을 최적화했습니다.

#vLLM #CUDA #Performance #Triton #DeepLearning

2026년 4월 10일

[vllm] AMD ROCm을 위한 Triton 기반 W4A16 커널 도입: MI300X 성능 최적화 분석

vLLM에 AMD ROCm 전용 Triton W4A16 커널이 추가되어 MI300X 환경에서 최대 122%의 성능 향상을 달성했습니다.

#vLLM #ROCm #Triton #Quantization #MI300X #Performance

2026년 4월 10일

[vllm] [vLLM] GPU-CPU 동기화 병목 제거: prepare_chunk_indices 최적화 분석

GDN Prefill 과정에서 발생하는 .tolist() 호출에 의한 GPU-CPU 동기화 병목을 제거하여 추론 효율성을 높인 사례를 분석합니다.

#vLLM #CUDA #Performance-Optimization #Deep-Learning #Triton

2026년 4월 3일

[sglang] SGLang의 디코드 성능 향상을 위한 Temperature 및 Softmax 커널 융합

Triton 커널을 활용해 Temperature Scaling과 Softmax를 하나로 융합하여 메모리 접근을 최적화하고 디코드 지연 시간을 최대 4배 이상 단축했습니다.

#SGLang #Triton #CUDA #LLM #Optimization

2026년 4월 2일

[triton] AMD Canonicalize Pointers에서 arith.select의 비대칭 fat pointer 처리 강화

Triton AMD 백엔드의 포인터 정규화 과정에서 한쪽만 base+offset 분리된 arith.select를 안전하게 처리하도록 수정한 PR을 분석합니다.

#Triton #AMD #Compiler #Bug Fix #MLIR

2026년 4월 1일

[triton] Proton CUPTI Graph Replay 힙 증가 재현 테스트 추가

CUDA graph replay 중 CUPTI 라이브러리의 메모리 누수를 체계적으로 재현하고 프로파일링하는 테스트 스크립트를 분석합니다.

#Triton #Proton #Profiling #CUDA #MemoryLeak

2026년 3월 31일

[Triton] AMD gfx1250 Tensor Descriptor 기반 GEMM 테스트 추가

AMD GFX1250에서 Tensor Descriptor Mode를 활용한 FP16, MXFP GEMM 및 Fused Attention 테스트 커버리지 확보

#Triton #AMD #gfx1250 #GEMM #Tensor Descriptor #Testing

2026년 3월 31일

[논문리뷰] Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

현대적인 대규모 모델 시스템과 과학 컴퓨팅 분야에서 고성능 GPU 커널 최적화는 하드웨어 성능을 실질적인 Throughput으로 전환하는 핵심 요소입니다.

#Review #GPU Kernel Optimization #Large Language Models #Evolutionary Algorithms #Reinforcement Learning #Triton #MetaX MACA #System Optimization

2026년 3월 30일

[triton] AMD GPU Descriptor Encoding 최적화 패스 추가

AMD GFX1250 타겟에서 tensor descriptor의 shared memory encoding을 padded 방식으로 최적화하는 OptimizeDescriptorEncoding 패스를 추가한 PR을 분석합니다.

#Triton #AMD GPU #Tensor Descriptor #Shared Memory #Optimization

2026년 3월 30일

[SGLang] GDN의 kkt + solve_tril을 하나의 Triton 커널로 퓨전

Gated Delta Network의 K@K^T 계산과 삼각 행렬 풀이를 단일 Triton 커널로 합쳐 HBM 왕복을 제거한다

#SGLang #Triton #Kernel Fusion #Linear Attention

2026년 3월 29일

[triton] AMD TDM의 Partition-Aware 분할 및 다중 Intrinsic 지원

PartitionedSharedEncoding에서 TDM warp 배분을 파티션 경계에 맞추고, 다중 TDM 명령어 생성 및 wait count 계산을 올바르게 처리하도록 개선한 사례를 분석합니다.

#Triton #AMD #GPU #TDM #WarpDistribution

2026년 3월 28일

[triton] GSan AxisInfo 기반 Shadow Update 중복 제거로 2~10배 성능 향상

Triton의 Global Sanitizer에서 AxisInfo의 contiguity 속성을 활용하여 중복 shadow update를 제거하고, FP16 matmul에서 최대 10배 속도 향상을 달성한 PR을 분석합니다.

#Triton #GPU #Sanitizer #Optimization #MLIR

2026년 3월 27일

[triton] AMD GFX9 Async Copy에서 Shared Memory 순서 버그 수정

스레드가 contiguous 차원을 정확히 커버할 때 shared memory 순서가 잘못 설정되는 문제를 수정하여 데이터 정합성을 보장한 사례를 분석합니다.

#Triton #AMD #GPU #SharedMemory #AsyncCopy

2026년 3월 27일

[triton] MMAv2 dot에 Prefetch 재활성화 - 루프 프롤로그 분리 방식으로 재설계

Triton의 MMAv2 dot 연산에 대한 prefetch 최적화를 루프 프롤로그 분리 방식으로 재설계하여 재활성화한 PR을 분석합니다.

#Triton #NVIDIA #Prefetch #MMAv2 #Pipeline

2026년 3월 27일

[triton] AMD Async Wait Count에서 Warp Free Variable 및 Register Zero Base 버그 수정

비정규 warp가 async copy를 건너뛰는 경우와 register zero base가 명령어 수를 부풀리는 문제를 수정한 사례를 분석합니다.

#Triton #AMD #GPU #AsyncCopy #WarpSpecialization

2026년 3월 26일

[triton] AMD 백엔드에 Concurrency Sanitizer(ConSan) 지원 추가

AMD GPU에서 GPU 동시성 버그를 감지하는 ConSan을 지원하기 위해 MBarrierOpInterface, 타겟 훅, 캡처 카운트 추정 등을 구현한 사례를 분석합니다.

#Triton #AMD #GPU #ConSan #Sanitizer #Concurrency

2026년 3월 26일

[triton] Triton AMD 백엔드 최적화: SGPR 활용과 루프 최적화를 통한 GEMM 성능 향상

Triton의 AMD GPU 커널에서 VGPR 의존성을 제거하고 루프 분기 최적화를 통해 성능을 개선한 사례를 분석합니다.

#Triton #AMD #GPU #Optimization #GEMM

2026년 3월 25일

[SGLang] Diffusion Triton Rotary Embedding 다중 헤드 병렬 처리 최적화

Triton rotary embedding 커널을 토큰당 여러 헤드를 동시에 처리하도록 재구성하여 커널 launch 횟수를 줄인다

#SGLang #Triton #Diffusion #Rotary Embedding

2026년 3월 26일

[triton] AMD WMMA Utilization 개선: Unroll 제거와 상수 폴딩

LLVM 코드 생성의 루프 언롤링 문제로 인한 레지스터 스필링을 방지하고, 상수 폴딩으로 VALU 연산을 줄여 WMMA 활용률을 개선한 PR을 분석합니다.

#Triton #AMD #WMMA #Gluon #Optimization

2026년 3월 25일

[triton] GSan 테스트에서 nanosleep 대신 Atomic 기반 동기화로 전환

GPU Sanitizer 테스트에서 비결정적인 nanosleep 기반 동기화를 atomic polling으로 교체하여 테스트 안정성을 크게 향상시킨 사례를 분석합니다.

#Triton #GSan #Testing #GPU #Synchronization

2026년 3월 24일

[논문리뷰] Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

기존의 Weight-Decomposed Low-Rank Adaptation (DoRA) 구현은 특히 high-rank 설정에서 심각한 메모리 및 성능 병목 현상을 겪습니다.

#Review #DoRA #Low-Rank Adaptation #Parameter-Efficient Fine-Tuning #Fused Kernels #Memory Optimization #Performance Scaling #Triton

2026년 3월 23일

[triton] AMD MXFP FA 예제에서 TDM Store 도입으로 Output 저장 최적화

buffer_store 기반의 수동 레이아웃 관리를 TDM store로 대체하여 코드를 단순화하고 메모리 접근 효율을 높인 사례를 분석합니다.

#Triton #AMD #GPU #TDM #FlashAttention

2026년 3월 23일

[Axolotl] LoRA 커널에 bias, dropout, DoRA, embedding 지원 추가

Axolotl의 Triton LoRA 커널을 확장하여 bias 파라미터, dropout, DoRA(Weight-Decomposed LoRA), embedding 레이어를 지원하도록 개선한 분석.

#Axolotl #LoRA #DoRA #Triton #LLM Training #Performance #PEFT

2026년 3월 22일

[Axolotl] Qwen 3.5 모델 Liger 커널 지원 및 fused RMSNorm+Gated 커널 추가

Axolotl에 Qwen 3.5 / Qwen 3.5 MoE 모델용 Liger FLCE 커널 지원과 fused RMSNorm+SiLU gate Triton 커널을 추가한 분석.

#Axolotl #Liger Kernel #Qwen 3.5 #RMSNorm #Triton #LLM Training #Performance

2026년 3월 22일

[Axolotl] ScatterMoE LoRA Triton 커널의 autotune 탐색 공간 축소

ScatterMoE LoRA Triton 커널의 autotune 설정에서 불필요하게 큰 block size를 제거하여 컴파일 시간을 단축하고 shared memory 초과를 방지한 분석.

#Axolotl #Triton #ScatterMoE #LoRA #Autotune #Performance #GPU

2026년 3월 21일

[Triton] AMD RDNA3에서 buffer cache modifier LLVM IR 전파

RDNA3 타겟에서 .cg/.cs/.cv/.wt cache modifier가 무시되던 문제를 수정하여 non-temporal 메모리 접근 지원

#Triton #AMD #RDNA3 #Cache Optimization #LLVM IR

2026년 3월 21일

[triton] Global Sanitizer에 TMA 및 cp.async 연산 부분 지원 추가

Triton의 Global Sanitizer에 tensor descriptor 디코딩과 TMA/cp.async 연산의 메모리 접근 추적 기능을 추가한 PR 분석.

#Triton #GSan #Sanitizer #TMA #AsyncCopy #Debugging

2026년 3월 20일

[Axolotl] ScatterMoE LoRA 최적화: 벤치마크, 커널 분할, autograd 통합

ScatterMoE LoRA Triton 커널에 벤치마크 도구를 추가하고, large expert 모델에서 fused/split forward 자동 선택 및 autograd 통합을 최적화한 분석.

#Axolotl #ScatterMoE #LoRA #Triton #MoE #Benchmark #GPU #Performance

2026년 3월 19일

[triton] Custom DSL Plugin Ops 지원

Triton 플러그인 시스템에 custom op 등록 기능을 추가하여, 서드파티가 자체 DSL 연산을 Triton 프론트엔드에 통합할 수 있도록 한 PR을 분석합니다.

#Triton #Plugin System #DSL #Extensibility #Frontend

2026년 3월 19일

[triton] getTranspositionSelectors 알고리즘 단순화 및 복원

다중 mixed transposition에서의 정합성 문제를 해결하고, prmt selector 알고리즘의 수학적 분해를 명확히 정리한 사례를 분석합니다.

#Triton #GPU #LinearLayout #Optimization #Algorithm

2026년 3월 19일

[triton] ConSan Multi-CTA 지원 추가

Triton의 Concurrency Sanitizer(ConSan)에 multi-CTA 클러스터 환경 지원을 추가하여, 클러스터 내 여러 CTA가 공유하는 scratch memory 상태를 올바르게 추적하도록 개선한 PR을 분석합니다.

#Triton #GPU Compiler #Concurrency Sanitizer #Multi-CTA #CUDA

2026년 3월 19일

[axolotl] Axolotl: Triton 커널을 활용한 Entropy 및 Selective Log Softmax 최적화

Axolotl에서 Triton 커널을 사용하여 Entropy 및 Selective Log Softmax 계산을 최적화하여 훈련 성능을 크게 향상시킨 PR 분석.

#Triton #PyTorch #Optimization #Deep Learning #Performance #GPU

2026년 3월 19일

[axolotl] Triton LoRA 커널 Autotune 테스트 안정화: pytest-xdist 환경에서의 모듈 격리 전략

pytest-xdist 병렬 실행 시 sys.modules 공유로 인한 flaky 테스트를 _find_lora_ops_module 직접 패치 방식으로 해결한 사례를 분석합니다.

#Axolotl #Triton #Testing #pytest #LoRA

2026년 3월 19일

[axolotl] Axolotl 커스텀 Triton 커널 — entropy/softmax 최대 5배 가속

Triton 커널로 entropy_from_logits와 selective_log_softmax를 fuse하여 RLHF 학습을 가속한다

#Triton #RLHF #Kernel Optimization #Axolotl

2026년 3월 19일

[triton] GFX1250에서 AsyncCopy의 OOB Shared Memory 주소를 이용한 마스킹

브랜치 기반 마스킹 대신 out-of-range LDS 주소를 활용하여 async copy를 효율적으로 마스킹하는 GFX1250 최적화를 분석합니다.

#Triton #AMD #GPU #AsyncCopy #GFX1250

2026년 3월 18일

[triton] triton-ext Plugin API에 문자열 인자 지원 추가

Triton 확장 플러그인의 addPass API에 문자열 인자를 전달할 수 있도록 확장하여, 커스텀 패스의 설정 가능성을 높인 PR을 분석합니다.

#Triton #Plugin #API #MLIR #Extension

2026년 3월 18일

[triton] AMD gfx1250에서 Async Copy와 TDM 경로의 Padded Layout 휴리스틱 통합

AMD gfx1250 GPU의 async copy와 TDM 로드 경로에서 사용되는 padded shared memory layout 선택 휴리스틱을 통합한 PR 분석.

#Triton #AMD #gfx1250 #SharedMemory #Padding #BankConflict

2026년 3월 17일

[triton] Fork된 서브프로세스에서 간헐적 SIGABRT 충돌 수정

LLVM의 내부 병렬 처리가 fork-safe하지 않아 발생하는 간헐적 SIGABRT를 LLVM 스레드 풀 비활성화로 해결한 PR 분석.

#Triton #LLVM #Fork #SIGABRT #Threading #BugFix

2026년 3월 16일

[triton] AMD GFX1250에서 Buffer Atomic 연산 활성화

GFX1250 아키텍처에서 buffer atomic RMW/CAS 지원을 추가하고, SCOPE_DEV cache policy와 packed bf16 fadd를 구현한 사례를 분석합니다.

#Triton #AMD #GPU #GFX1250 #Atomics

2026년 3월 16일

[triton] Consumer Blackwell(sm_120)에서 PTX Codegen Segfault 수정

RTX 5070 Ti 등 consumer Blackwell GPU에서 sm_120a suffix 사용으로 인한 런타임 segfault를 수정한 사례를 분석합니다.

#Triton #NVIDIA #GPU #Blackwell #PTX #BugFix

2026년 3월 16일

[triton] AMD AtomicCAS의 Tensor Operand Thread Predicate 수정

AMD 백엔드에서 tensor 기반 atomic CAS 연산의 thread predicate를 올바르게 적용하여 redundant thread의 잘못된 atomic 실행을 방지한 사례를 분석합니다.

#Triton #AMD #GPU #Atomics #BugFix

2026년 3월 14일

[triton] AMD Pipelined Loop에서 TDM Load의 Buffer Race 수정

AMD GPU의 pipelined loop에서 TDM load 사용 시 버퍼 수가 부족하여 발생하는 데이터 경쟁 버그를 수정한 PR 분석.

#Triton #AMD #TDM #Pipeline #BufferRace #BugFix

2026년 3월 14일

[triton] Triton Gluon을 활용한 고성능 2CTA 블록 스케일 행렬 곱셈 최적화

Triton Gluon의 2CTA 워프 전문화 기법을 통해 행렬 곱셈의 연산 강도를 높이고 SMEM 사용량을 최적화하는 방법

#Triton #GPU #CUDA #MatMul #HighPerformanceComputing

2026년 3월 13일

[triton] Triton 2CTA Block-Scaled Matmul — cuBLAS 대비 성능 비교

Triton Gluon으로 구현한 2CTA warp-specialized block-scaled matmul이 mxfp8/mxfp4/nvfp4를 지원한다

#Triton #CUDA #Matrix Multiplication #FP8 #Blackwell

2026년 3월 13일

[triton] AMD GFX1250 MXFP Flash Attention 예제 커널 대규모 리팩터링

preshuffle 로직 제거, TDM store 도입, expand_dims 전환 등 GFX1250 FA 예제를 단순화하고 성능을 개선한 리팩터링을 분석합니다.

#Triton #AMD #GPU #FlashAttention #GFX1250 #Refactoring

2026년 3월 12일

[triton] Concurrency Sanitizer를 Vendor Target Hooks로 리팩터링

Triton의 Concurrency Sanitizer를 벤더 독립적인 인터페이스로 리팩터링하여 NVIDIA 외 다른 GPU 벤더도 지원할 수 있게 한 PR 분석.

#Triton #ConSan #Sanitizer #Refactoring #VendorHooks #Architecture

2026년 3월 9일

[triton] AMD GFX9 AsyncCopy를 위한 Padded Layout 선택 확장

AMD CDNA4(GFX9) GPU에서 async copy의 padded layout 선택을 8비트 데이터 타입과 더 넓은 kWidth로 확장하여 bank conflict를 줄인 PR 분석.

#Triton #AMD #CDNA4 #AsyncCopy #PaddedLayout #BankConflict

2026년 3월 9일

[PyTorch] Inductor mixed-order reduction 최적화

mix-order-reduction의 multi-stage를 기본 비활성화하여 shared memory 초과 문제를 방지한다

#PyTorch #Inductor #Triton #Compiler

2026년 3월 9일

[triton] AMD FpSan dot 에뮬레이션의 MFMA/WMMA encoding 호환성 수정

FP Sanitizer의 dot 에뮬레이션에서 MFMA/WMMA 인코딩 대신 최적화된 blocked layout을 사용하고 cross-warp barrier를 추가하여 정확성을 보장한 PR을 분석합니다.

#Triton #AMD #FpSan #Bug Fix #MFMA

2026년 3월 6일

[axolotl] ScatterMoE 커널 라우팅 통합: Softmax/Sigmoid 기반 라우팅과 Autotune Telemetry 추가

MoE 모델의 다양한 라우팅 전략(Softmax TopK, Sigmoid TopK)을 통합 함수로 정리하고, Triton autotune 결과를 자동 수집하는 telemetry callback을 추가한 사례를 분석합니다.

#Axolotl #MoE #ScatterMoE #Triton #Routing #Telemetry

2026년 3월 6일

[triton] Multi-CTA 튜토리얼 추가: CGA 기반 협력 연산

NVIDIA Hopper/Blackwell의 CGA(Cooperative Grid Array)를 활용한 multi-CTA 프로그래밍 튜토리얼을 추가한 사례를 분석합니다.

#Triton #NVIDIA #GPU #MultiCTA #Tutorial #Blackwell

2026년 3월 6일

[triton] PyTorch 없이 Triton CUDA 백엔드 독립 사용 지원

Triton의 CUDA 백엔드에서 PyTorch 의존성을 제거하여, 순수 Python 환경에서도 GPU 커널을 컴파일하고 실행할 수 있도록 한 PR을 분석합니다.

#Triton #CUDA #PyTorch #Runtime #Independence

2026년 3월 5일

[triton] Multi-CTA 예제에서 Program ID를 Shared Memory에 저장하여 재계산 방지

CLC 타일 스케줄러에서 planar snake ID를 shared memory에 저장하여 consumer와 epilogue 파티션 간 재계산을 제거한 최적화를 분석합니다.

#Triton #Gluon #GPU #MultiCTA #Optimization

2026년 3월 5일

[sglang] MoE 모델 추론 최적화: Triton 커널 퓨전을 통한 TTFT 28% 개선

MoE 모델 추론 시 `fused_moe_triton`과 `moe_sum_all_reduce` 커널 퓨전으로 TTFT를 28% 개선했습니다.

#MoE #Triton #Kernel Fusion #GPU Optimization #LLM Inference #SGLang

2026년 3월 4일

[triton] Profile scratch용 기본 allocator 제공

ConSan 등 instrumentation이 profile scratch memory를 사용할 때, 사용자가 별도 allocator를 설정하지 않아도 드라이버 기본 allocator로 동작하도록 개선한 PR을 분석합니다.

#Triton #Instrumentation #Memory Allocation #ConSan #Developer Experience

2026년 3월 3일

[triton] MultiCTA Membar에 Fence + Cluster Relaxed 자동 삽입

Triton의 MultiCTA 환경에서 cross-CTA mbarrier에 fence_mbarrier_init과 cluster arrive/wait를 자동 삽입하여 동기화 정합성을 보장하는 PR 분석.

#Triton #NVIDIA #MultiCTA #Membar #Fence #ClusterBarrier

2026년 3월 3일

[triton] AMD Software Warp Pipeline에서 크래시 수정

AMD GPU의 ConvertWarpPipeline pass에서 AsyncWaitOp을 barrier로 인식하지 못해 발생하던 크래시를 수정하고 barrier 정렬 로직을 개선한 PR 분석.

#Triton #AMD #WarpPipeline #AsyncWait #BugFix #SWP

2026년 3월 3일

[triton] AMD BlockPingpong 패스의 non-MFMA dot 크래시 수정

AMD BlockPingpong 최적화가 FMA 기반 dot 연산에 적용되어 발생하던 크래시를 안전한 타입 캐스팅으로 수정한 PR을 분석합니다.

#Triton #AMD #Bug Fix #Pingpong #MFMA

2026년 3월 3일

[triton] AMD GFX1250 MachineSink 이슈 우회를 위한 fence 추가

LLVM의 MachineSink 최적화가 LDS load를 barrier 너머로 이동시키는 버그를 우회하기 위해, AMD GFX1250 타겟에 compiler fence를 삽입한 PR을 분석합니다.

#Triton #AMD GPU #LLVM #Compiler Bug #Workaround

2026년 3월 3일

[Triton] FenceAsync에 비동기 읽기 의존성 추가 — st.shared와 copy_local_to_global 간 정합성 보장

비동기 프록시 읽기 연산에 대한 fence 삽입 누락 버그를 수정하여 공유 메모리 쓰기와 글로벌 복사 간 데이터 정합성을 보장한다

#Triton #MLIR #NVIDIA #Memory Fence #GPU Compiler

2026년 3월 2일

[triton] Gluon tmem_load에서 Register Layout 자동 추론

get_tmem_reg_layout 호출을 제거하고 tensor memory descriptor에서 register layout을 자동으로 추론하도록 BC-breaking 변경을 적용한 사례를 분석합니다.

#Triton #Gluon #NVIDIA #Blackwell #TensorMemory

2026년 2월 28일

[triton] AMD ConvertWarpPipeline에서 AsyncWaitOp 인식 및 Barrier 정렬 수정

AMD GPU의 warp pipeline 변환에서 AsyncWaitOp을 barrier로 인식하고 bars 배열 정렬 버그를 수정한 PR 분석.

#Triton #AMD #WarpPipeline #AsyncWait #BugFix

2026년 2월 27일

[triton] NVIDIA inval_barrier를 leader CTA에서만 실행하도록 변경

multi-CTA 환경에서 broadcasted barrier의 inval_barrier 연산을 leader CTA에서만 실행하도록 수정하여, 올바른 barrier invalidation을 보장하는 PR을 분석합니다.

#Triton #NVIDIA #Multi-CTA #Barrier #mbarrier

2026년 2월 27일

[triton] WSSpecialize에서 초기화된 Barrier의 Invalidation 추가

WarpSpecialize 패스가 생성한 mbarrier를 사용 후 올바르게 invalidate하여 재사용 시의 하드웨어 정합성 문제를 방지한 사례를 분석합니다.

#Triton #NVIDIA #GPU #WarpSpecialize #Barrier

2026년 2월 26일

[triton] Proton 커널 런처에 더 많은 메타데이터 전달

Proton의 metric 커널 런치에 numThreads와 sharedMemBytes 등 추가 메타데이터를 전달하여 GPU 자원 활용을 정밀하게 제어하도록 개선한 사례를 분석합니다.

#Triton #Proton #Profiling #GPU #KernelLaunch

2026년 2월 26일

[triton] Backend별 global_scratch_alloc 할당 통합

Proton 프로파일러의 scratch 메모리를 별도 풀로 분리하고, third-party allocation 지원을 추가하여 global scratch 메모리 관리를 통합한 사례를 분석합니다.

#Triton #GPU #MemoryAllocation #Proton #Refactoring

2026년 2월 26일

[triton] Gluon에서 3D Dot FMA 연산 노출

Triton Gluon 프론트엔드에서 batched(3D) matrix multiplication을 FMA dot 연산으로 지원하도록 확장한 PR 분석.

#Triton #Gluon #DotFMA #BatchedMatMul #3D #GPU

2026년 2월 25일

[triton] Triton Gluon을 활용한 Blackwell 아키텍처에서의 Multi-CTA 행렬 곱셈 최적화

Blackwell GPU의 Multi-CTA 환경에서 CLC(Cluster Launch Control)를 활용한 행렬 곱셈 성능 최적화 및 메모리 레이아웃 개선 분석.

#Triton #Blackwell #GPU #MatMul #HPC

2026년 2월 24일

[Triton] AsyncCompileMode 에러 발생 시 active_mode 초기화 보장

context manager exit에서 예외 발생 시에도 active_mode를 None으로 설정하여 후속 컴파일 블록킹 방지

#Triton #Python #Bug Fix #Error Handling #Async Compilation

2026년 2월 24일

[triton] AMD Batched WMMA Scaled에서 스케일 레이아웃 수정

AMD gfx1250 GPU의 batched WMMA scaled 연산에서 스케일 텐서의 차원 순서 처리 버그를 수정하고 batched 테스트를 추가한 PR 분석.

#Triton #AMD #WMMA #Scale #BatchedMatMul #BugFix

2026년 2월 23일

[Triton] 2CTA Block Scale MMA with tcgen05.cp — 두 CTA 협력 행렬 곱셈

두 CTA가 협력하는 Block Scale MMA의 전체 경로(TMA→cp→MMA→commit)를 tcgen05.cp 명령으로 구현한다

#Triton #NVIDIA #Blackwell #2CTA #MMA #tcgen05

2026년 2월 23일

[triton] 캐시 테스트를 Device Agnostic하게 개선

하드코딩된 device index 0을 실제 현재 디바이스 ID로 교체하여 모든 GPU 백엔드에서 캐시 테스트가 동작하도록 수정한 사례를 분석합니다.

#Triton #Testing #Cache #DeviceAgnostic

2026년 2월 23일

[triton] AMD gfx1250 MXFP Flash Attention 예제 커널 업데이트

AMD gfx1250 GPU의 MXFP Flash Attention Gluon 예제에서 레이아웃 선택, 공유 메모리 관리, TDM 로드 추상화를 대폭 개선한 PR 분석.

#Triton #AMD #gfx1250 #FlashAttention #MXFP #Gluon

2026년 2월 20일

[triton] AMD TensorDescType의 Shared Memory 크기 계산 수정

WarpSpecialize capture에서 TensorDescType의 크기를 정확히 계산하도록 수정하여 shared memory 할당 오류를 방지한 사례를 분석합니다.

#Triton #AMD #GPU #WarpSpecialize #SharedMemory

2026년 2월 20일

[triton] Triton AMD GPU: 버퍼 로드 루프 내 주소 계산 최적화

루프 내 버퍼 로드 시 오프셋 기반 주소 계산을 베이스 포인터 증분 방식으로 변경하여 연산 효율성을 개선했습니다.

#Triton #AMD #Compiler Optimization #MLIR #GPU

2026년 2월 20일

[triton] MemDescSubslice에서 Non-CTA 차원 슬라이싱 지원

multi-CTA 레이아웃에서 broadcasted CTA와 CTA 차원 분할을 올바르게 처리하도록 메모리 슬라이싱 검증 로직을 개선한 사례를 분석합니다.

#Triton #GPU #MultiCTA #SharedMemory #LinearLayout

2026년 2월 20일

[triton] Async TMA Lowering에서 Cluster Barrier 로직 수정

Triton의 TMA 비동기 복사에서 cluster barrier 사용 조건과 cross-CTA mbarrier init 동기화를 수정한 PR 분석.

#Triton #NVIDIA #TMA #ClusterBarrier #MultiCTA #BugFix

2026년 2월 19일

[triton] AMD TargetInfo에 16/32비트 Elementwise 벡터화 지원 추가

AMD GPU의 TargetInfo에 supportBitwidth16Elementwise와 supportBitwidth32Elementwise를 활성화하여 reduction 코드 생성을 최적화한 PR을 분석합니다.

#Triton #AMD #Vectorization #Reduction #GFX1250

2026년 2월 19일

[Triton] 모듈 언로드 테스트 비결정적 실패 수정 — GC 비활성화로 안정성 확보

Python garbage collector가 테스트 중 module_unload callback을 예기치 않게 호출하는 비결정적 실패를 수정한다

#Triton #Python #Testing #Garbage Collection #Bug Fix

2026년 2월 19일

[triton] AMD GFX950에서 Padded Layout Async Copy의 OOM 버그 수정

작은 타일 크기에서 padding interval이 contiguous 차원보다 큰 경우를 처리하여 pipelining 시 OOM을 방지한 사례를 분석합니다.

#Triton #AMD #GPU #GFX950 #Pipelining #BugFix

2026년 2월 18일

[triton] AMD 백엔드에서 Floating-Point Sanitizer(FPSan) 지원 활성화

AMD GPU(CDNA3/CDNA4/GFX1250)에서 FPSan을 지원하도록 테스트를 확장하고, warp size 차이에 따른 레이아웃 문제를 해결한 사례를 분석합니다.

#Triton #AMD #GPU #FPSan #Testing

2026년 2월 17일

[triton] 컴파일된 커널 모듈 명시적 unload 지원

Triton 런타임에서 컴파일된 커널 모듈을 명시적으로 unload할 수 있도록 __del__ 메서드와 unload_module 드라이버 함수를 추가한 PR을 분석합니다.

#Triton #Runtime #Memory Management #CUDA #HIP

2026년 2월 17일

[Triton] HIPBackend에서 import torch 가드 추가 — JAX 호환성 복원

torch 없는 환경(jax-triton)에서 AMD 백엔드 사용 시 ImportError 수정

#Triton #AMD #Python #Bug Fix #Compatibility

2026년 2월 17일

[triton] NVIDIA TMA im2col 모드 Gluon 튜토리얼 - Convolution 커널 구현

Triton Gluon을 사용하여 NVIDIA Blackwell GPU의 TMA im2col 모드로 Convolution 커널을 구현하는 튜토리얼 PR을 분석합니다.

#Triton #NVIDIA #TMA #Convolution #Gluon

2026년 2월 16일

[triton] AMD GFX1250에서 TDM Software Pipelining 지원

AMD GFX1250 타겟에서 Tensor Descriptor Memory(TDM) 기반 비동기 복사를 software pipelining에 통합하여 matmul 성능을 향상시킨 PR을 분석합니다.

#Triton #AMD GPU #GFX1250 #TDM #Software Pipelining

2026년 2월 17일

[triton] CLCTryCancel이 Async Proxy를 사용하도록 수정

Triton NVIDIA 백엔드에서 CLCTryCancelOp을 async proxy write로 인식시켜 proxy fence가 올바르게 삽입되도록 수정한 PR 분석.

#Triton #NVIDIA #CLC #ProxyFence #AsyncCopy #BugFix

2026년 2월 16일

[Triton] Blackwell 2D activation-scale layout에서 ragged metadata 없이 동작하도록 수정

2D 입력 + ragged_metadata=None 조합에서 batched 모드로 fallback하여 레이아웃 구성 실패 방지

#Triton #NVIDIA #Blackwell #MXFP #Bug Fix

2026년 2월 11일

[Triton] grouped_gemm 벤치마크 min/max ms 반환 순서 수정

perf_report에서 error bar가 뒤집히는 문제를 반환값 순서 교정으로 해결

#Triton #Tutorial #Bug Fix #Benchmark

2026년 2월 11일

[triton] Triton AMD 백엔드: 8-Wave PingPong Attention 커널 구현 분석

AMD GPU 환경에서 성능 향상을 위한 8-Wave PingPong Attention 커널 구현 및 파이프라이닝 최적화 기법을 살펴봅니다.

#Triton #AMD #GPU #Attention #Optimization

2026년 2월 10일

[triton] AMD: PartitionedSharedEncodingAttr의 LLVM lowering 지원으로 공유 메모리 파티셔닝 구현

텐서를 여러 물리적 공유 메모리 파티션에 분할 저장하여 파티션 충돌을 줄이는 PartitionedSharedEncodingAttr의 LLVM IR 변환 구현 분석.

#Triton #AMD #LLVM #Shared Memory #Partitioning #MLIR

2026년 2월 10일

[Triton] 커널 끝에 cross-CTA barrier 추가 — 클러스터 메모리 정합성 보장

미처리 읽기/쓰기가 있는 커널 종료 시 클러스터 수준 barrier를 삽입하여 CTA 간 메모리 정합성을 보장한다

#Triton #NVIDIA #Cluster #Memory Barrier #Correctness

2026년 2월 10일

[triton] Triton NVIDIA GPU 백엔드: WarpGroupDotWaitOp 최적화 및 동기화 개선

WarpGroupDotWaitOp에 warpGroupLocal 속성을 추가하여 불필요한 배리어 동기화를 제거하고 성능을 최적화했습니다.

#Triton #NVIDIA #GPU #Optimization #Compiler

2026년 2월 9일

[triton] 클러스터 환경을 위한 Membar 패스 확장

Triton의 membar 분석을 클러스터 환경에 맞게 확장하여, AllocationSlice에 buffer ID를 추가하고 slice/op 레벨의 세분화된 filter를 지원하는 PR을 분석합니다.

#Triton #Memory Barrier #Cluster #Shared Memory #Static Analysis

2026년 2월 9일

[Triton] TMA im2col 모드 — Gluon API 구현

TMA im2col 시리즈의 Gluon DSL API 구현으로, Python에서 im2col 모드 TMA 복사를 직접 사용할 수 있게 한다

#Triton #NVIDIA #TMA #im2col #Gluon #Convolution

2026년 2월 9일

[triton] AMD Async Load에 ROCDL Op 사용으로 전환

AMD GPU의 async load 연산에서 LLVM intrinsic 문자열 기반 호출을 타입 안전한 ROCDL op으로 교체한 NFC(Non-Functional Change) PR 분석.

#Triton #AMD #ROCDL #AsyncCopy #NFC #Refactoring

2026년 2월 9일

[triton] FPSan에서 Warp Specialization + TMem 사용 시 크래시 수정

Floating-point Sanitizer가 WarpSpecialize 파티션 내에서 tensor memory 접근 시 scope 외부 값을 참조하여 발생하는 크래시를 수정한 사례를 분석합니다.

#Triton #FPSan #NVIDIA #WarpSpecialize #TensorMemory #BugFix

2026년 2월 9일

[triton] Membar 분석 함수 호출 시 smem offset 수정

Triton의 membar 분석에서 callee 함수의 shared memory 접근을 caller 컨텍스트로 변환할 때, allocation offset을 올바르게 반영하도록 수정한 PR을 분석합니다.

#Triton #Memory Barrier #Shared Memory #Function Call #Bug Fix

2026년 2월 9일

[triton] Generic Multi-CTA convert_layout 지원

Triton의 convert_layout 연산을 multi-CTA 환경에서 범용적으로 처리하도록 확장한 PR을 분석합니다. CTA 간 데이터 전송을 위한 cluster barrier와 distributed shared memory 활용 방식을 살펴봅니다.

#Triton #GPU Compiler #Multi-CTA #Layout Conversion #MLIR

2026년 2월 9일

[triton] Blackwell GPU Cluster Launch Control 지원으로 Persistent Kernel 워크로드 밸런싱 구현

Triton Gluon에 NVIDIA Blackwell SM100+ GPU의 CLC(Cluster Launch Control) 기능을 추가하여 persistent kernel에서 동적 작업 분배를 가능하게 한 PR을 분석합니다.

#Triton #NVIDIA #Blackwell #GPU #Gluon

2026년 2월 6일

[triton] FpSan - Floating Point Sanitizer 도입

GPU 커널의 부동소수점 연산 오류를 런타임에 감지하는 FpSan(Floating Point Sanitizer)을 Triton에 도입한 PR을 분석합니다. MLIR 패스를 통해 FP 연산을 integer payload 방식으로 rewrite합니다.

#Triton #GPU Compiler #Floating Point #Sanitizer #MLIR

2026년 2월 6일

[triton] Triton 컴파일러 최적화: In-thread 트리 리덕션 도입

Triton의 리덕션 연산을 트리 구조로 변환하고 인-스레드 벡터화를 적용하여 Gluon 어텐션 커널 성능을 개선했습니다.

#Triton #Compiler #Optimization #LLVM #GPU

2026년 2월 6일

[Triton] TMA im2col 모드 — LLVM Lowering 구현

TMA im2col 시리즈의 다섯 번째 PR로, im2col descriptor 생성과 TMA 복사의 LLVM IR lowering을 구현한다

#Triton #NVIDIA #TMA #im2col #LLVM #Compiler

2026년 2월 6일

[논문리뷰] Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

본 논문은 대규모 언어 모델(LLMs)을 활용하여 고품질 GPU 커널 코드를 생성하는 과정에서 발생하는 보상 해킹(reward hacking) 및 게으른 최적화(lazy optimization)와 같은 문제점을 해결하고, 실제 성능 향상으로 이어지는 견고한 강화 학습(RL) 방법론을 체계적으로 연구하는 것을 목표로 합니다.

#Review #Reinforcement Learning #Kernel Generation #Triton #GPU Optimization #LLMs #Reward Hacking #Multi-turn Interaction #Code Generation

2026년 2월 5일

[triton] AMD GFX1250용 Warp-Pipeline f16 GEMM 예제 추가

AMD GFX1250 아키텍처에서 TDM과 warp pipeline을 활용한 f16 GEMM 커널 예제를 추가한 사례를 분석합니다.

#Triton #AMD #GPU #GFX1250 #GEMM #WarpPipeline

2026년 2월 5일

[Triton] AMD GFX9에서 AsyncCopy shared layout order 수정

getElementsPerThread 대신 getContigPerThread를 사용하고 vecSize를 하드웨어 지원 범위로 clamp하여 coalesced direct-to-LDS 쓰기 보장

#Triton #AMD #GFX9 #Async Copy #Bug Fix

2026년 2월 5일

[triton] ConSan 컴파일 타임 19분에서 34초로 단축 - 대규모 최적화

Triton Concurrency Sanitizer의 컴파일 시간을 33배 개선한 대규모 PR을 분석합니다. IR 크기 축소, warp-local layout, 헬퍼 함수 중복제거 등 다양한 최적화가 포함됩니다.

#Triton #ConSan #Compile Time #MLIR #Optimization

2026년 2월 5일

[triton] AMD GFX1250을 위한 Triton Stream-K 커널 최적화: 4/8 Warp 구현

AMD GFX1250 아키텍처에서 Stream-K 커널의 성능을 극대화하기 위한 4/8 warp 병렬 처리 및 atomic lock 최적화 기법 분석.

#Triton #AMD #GFX1250 #Stream-K #GPU-Optimization

2026년 2월 4일

[Triton] AMD PartitionedSharedEncodingAttr 도입으로 shared memory 파티셔닝 지원

텐서를 여러 물리적 shared memory 파티션에 분산 배치하여 bank conflict를 줄이는 새로운 encoding attribute 추가

#Triton #AMD #MLIR #Shared Memory #Memory Optimization

2026년 2월 4일

[Triton] AMD TDM AsyncWait을 UpdateAsyncWaitCount에서 지원

TDM scatter/gather가 여러 intrinsic을 생성하는 경우의 정확한 waitcnt 계산 지원

#Triton #AMD #TDM #Async Wait #Compiler

2026년 2월 2일

[Triton] AMD PartitionedSharedEncodingAttr 도입 — shared memory 파티션 충돌 감소

텐서를 여러 물리적 shared memory 파티션에 분산 배치하여 bank conflict 감소

#Triton #AMD #MLIR #Shared Memory #Architecture

2026년 2월 2일

[triton] AMD MoveUpPrologueLoads로 ReorderInstructions 패스 완전 대체

여러 차례 최적화가 제거된 ReorderInstructions를 단일 목적의 MoveUpPrologueLoads 패스로 대체하여 코드 명확성을 높인 PR을 분석합니다.

#Triton #AMD #Refactoring #Compiler #Pipeline

2026년 2월 1일

[triton] AMD gfx1250 Gluon에 Tensor Async Gather(TDM) 지원 추가

AMD gfx1250 GPU의 TDM gather 모드를 활용하여 비연속 global memory 행에서 비동기적으로 데이터를 읽는 기능을 Gluon에 추가한 PR 분석.

#Triton #AMD #gfx1250 #Gluon #TDM #Gather

2026년 2월 1일

[triton] Triton AMD GPU 백엔드: v_perm 명령어를 활용한 레이아웃 변환 최적화

AMD GPU에서 v_perm 명령어를 사용하여 8비트 데이터 레이아웃 변환 시 성능을 개선하고 명령어 수를 최적화하는 방법

#Triton #AMD #GPU #LLVM #Optimization

2026년 1월 30일

[triton] Reduce 커널에 Unpadded Batch Size 핸들링 추가

Triton Kernels의 reduce 커널에 unpadded batch size를 지원하여 패딩된 배치에서 불필요한 연산을 건너뛰도록 개선한 PR 분석.

#Triton #TritonKernels #Reduce #Padding #BatchSize #Performance

2026년 1월 30일

[triton] NVIDIA TMA im2col 모드 드라이버 지원

NVIDIA TMA의 im2col 모드를 위한 Python 드라이버 레벨 지원을 추가한 PR을 분석합니다. cuTensorMapEncodeIm2col API 바인딩과 descriptor 생성 로직을 살펴봅니다.

#Triton #NVIDIA #TMA #Im2col #Driver

2026년 1월 28일

[Triton] TMA im2col 모드 — tma load op 수정

NVIDIA TMA im2col 모드 시리즈의 세 번째 PR로, tma load op의 타입 매칭과 offset 처리를 수정한다

#Triton #NVIDIA #TMA #im2col #Convolution

2026년 1월 26일

[triton] NVIDIA TMA im2col 모드 Tensor Descriptor 지원

NVIDIA TMA의 im2col 모드를 Triton의 tensor descriptor 시스템에 통합한 PR을 분석합니다. TensorDescInterface 도입과 TensorDescIm2ColType 추가를 통해 convolution-friendly 메모리 접근 패턴을 지원합니다.

#Triton #NVIDIA #TMA #Im2col #Convolution #MLIR

2026년 1월 26일

[triton] AMD gfx1250 Gluon에 Tensor Async Scatter 지원 추가

AMD gfx1250 GPU의 TDM scatter 모드를 활용하여 비연속 global memory 행에 비동기적으로 데이터를 쓰는 기능을 Gluon에 추가한 PR 분석.

#Triton #AMD #gfx1250 #Gluon #TDM #Scatter

2026년 1월 26일

[Triton] AMD PrepareIfCombining 패스 추가 — scf.if 병합 최적화

동일 조건의 인접 scf.if 연산 사이 명령어를 이동시켜 canonicalizer가 if를 병합하도록 지원

#Triton #AMD #MLIR #Compiler Optimization #Control Flow

2026년 1월 24일

[Triton] AMD TDM 기능 활성화 및 ConvertToTensorOps 패스 추가

TDM(Tensor Descriptor Memory) 관련 기능 활성화와 ConvertToTensorOps 변환 패스 추가

#Triton #AMD #TDM #Tensor Descriptor #Compiler Pass

2026년 1월 23일

[triton] NVIDIA canSkipBarSync 복원으로 MoE 커널 18GBps 성능 향상

Blackwell 지원 과정에서 비활성화된 barrier skip 최적화를 보수적으로 재설계하여 복원하고, persistent MoE 커널 성능을 개선한 PR을 분석합니다.

#Triton #NVIDIA #Membar #Optimization #MoE

2026년 1월 22일

[triton] Triton Hopper 커널 최적화: Persistent Matmul에서 Epilogue 오버랩 제거하기

Triton의 Persistent Hopper Matmul에서 Epilogue 오버랩을 비활성화하여 150 GBps의 성능 향상을 달성한 사례를 분석합니다.

#Triton #GPU #Optimization #HPC #Matmul

2026년 1월 22일

[triton] AMD membarFilter에 bufferID 고려 추가

AMD 백엔드의 membar 분석에서 buffer ID를 고려하여 불필요한 barrier 삽입을 줄이고, 재사용된 allocation 간 누락된 barrier를 올바르게 삽입하도록 개선한 PR을 분석합니다.

#Triton #AMD GPU #Memory Barrier #Shared Memory #Optimization

2026년 1월 22일

[Triton] AxisInfo의 divisibility 초기화 로직 문서화 개선

MulIOp에서 contiguity > 1일 때 divisibility를 1로 리셋하는 이유를 명확히 문서화

#Triton #Documentation #MLIR #AxisInfo #Compiler Analysis

2026년 1월 22일

[triton] CUDA 가변 인자 Pre-compiled Launcher로 커널 런치 오버헤드 제거

Triton의 CUDA/HIP 커널 런처를 Python 문자열 치환 방식에서 C 기반 가변 인자 방식으로 전환하여 런치 오버헤드를 제거한 PR을 분석합니다.

#Triton #CUDA #HIP #Runtime #Performance

2026년 1월 21일

[Triton] Proton 프로파일러에서 불필요한 lock 추가 제거

PhaseStore를 분리하고 atomic 연산을 활용하여 프로파일링 오버헤드를 줄이는 lock 최적화

#Triton #Proton #Profiler #Performance #Concurrency

2026년 1월 21일

[triton] Triton 컴파일 타임 최적화: Alias Matrix 생략을 통한 성능 개선

Triton의 CONSAN 모드에서 불필요한 Alias Matrix 생성을 제거하여 컴파일 시간을 약 15% 단축한 최적화 사례를 분석합니다.

#Triton #Compiler #Optimization #LLVM #Performance

2026년 1월 20일

[triton] Triton 커널 최적화: High Occupancy Persistent Matmul 구현을 통한 성능 향상

Triton의 Persistent Matmul 커널에서 SM 점유율을 최적화하여 H200 기준 15% 성능 향상을 달성한 사례 분석.

#Triton #GPU #CUDA #Optimization #Matmul

2026년 1월 20일

[Triton] M=64 2CTA 모드 지원 추가

Blackwell 아키텍처에서 M=64 instruction shape의 2CTA 모드를 지원하여 TensorMemory 레이아웃 유연성 확대

#Triton #NVIDIA #Blackwell #CTA #TensorMemory

2026년 1월 18일

[triton] [Blackwell] NVIDIA 차세대 아키텍처를 위한 Triton의 tcgen05.ld.red 최적화 분석

Blackwell 아키텍처의 TMEM 로드 및 리덕션 동시 수행 기능을 Triton Gluon에 구현하여 성능을 최적화한 사례를 분석합니다.

#Triton #Blackwell #NVIDIA #GPU #Optimization #MLIR

2026년 1월 16일

[Triton] TritonGPU Barrier 재설계 — 주소 공간별 메모리 가시성 보장

gpu.barrier를 TritonGPU 전용 barrier op으로 교체하여 shared/global 메모리 가시성을 세밀하게 제어한다

#Triton #MLIR #GPU Barrier #Memory Visibility #Compiler Infrastructure

2026년 1월 16일

[triton] Warp Specialization: 데이터 플로우 그래프 기반의 개선된 파티션 스케줄링 패스

기존 파티션 스케줄링을 데이터 플로우 그래프와 incremental heuristic merging 기반으로 재작성하여 범용성을 높인 분석.

#Triton #Warp Specialization #Partition Scheduling #Data Flow Graph #Compiler #MLIR

2026년 1월 16일

[triton] moveUpTranspose 최적화 제거 PR의 Revert - 회귀 방지

일부 워크로드에서 성능 회귀를 유발한 moveUpTranspose 제거를 되돌려, TransposeOp 재배치 최적화를 복원한 PR을 분석합니다.

#Triton #AMD #Revert #Performance #Regression

2026년 1월 15일

[Triton] AMD fine-grained cluster barrier 추가 및 Gluon 노출

CTA 간 실행 동기화를 위한 cluster barrier arrive/wait 연산을 AMD 백엔드에 추가

#Triton #AMD #Gluon #Multi-CTA #Synchronization

2026년 1월 15일

[Triton] Proton에서 선택적 커널 메타데이터 기록 및 커스텀 메트릭 지원

LaunchHook에 include/exclude 필터와 임의 메트릭 지원을 추가하여 프로파일링 유연성 향상

#Triton #Proton #Profiler #Metadata #Performance

2026년 1월 15일

[triton] AMD: padded shared layout을 더 작은 block size에도 적용하여 bank conflict 제거

16KB 미만의 작은 블록에서도 LDS padding을 활용한 bank conflict 프리 레이아웃을 지원하도록 개선한 변경 분석.

#Triton #AMD #GPU #LDS #Bank Conflict #Shared Memory

2026년 1월 13일

[Triton] ReduceOp 로우어링을 LinearLayout 기반으로 개선 및 단순화

ReduceOp 로우어링을 LinearLayout 기반으로 재설계하여 shmem swizzling 활용, 불필요한 round-trip 제거

#Triton #MLIR #Compiler Optimization #LinearLayout #Refactoring

2026년 1월 12일

[Triton] 소규모 async_cp를 위한 최적 레이아웃 선택

작은 텐서의 async copy 시 coalesced encoding을 독립적으로 선택하여 불필요한 convert_layout 제거

#Triton #MLIR #Compiler Optimization #GPU #Async Copy

2026년 1월 9일

[triton] AMD ReorderInstructions에서 no-op sinkDotConversion 최적화 제거

ConvertLayout이 이미 local_load로 대체된 후 실행되어 효과가 없는 sinkDotConversion 최적화를 제거하여 코드 복잡성을 줄인 PR을 분석합니다.

#Triton #AMD #Refactoring #Dead Code #MLIR

2026년 1월 9일

[Triton] AMD Gluon DSL에 TDM L2 Prefetch 노출 — 사용자 수준 프리페치 제어

AMD GPU의 TDM L2 프리페치 기능을 Gluon DSL API로 노출하여 사용자가 커널에서 직접 프리페치를 제어할 수 있게 한다

#Triton #AMD #Gluon #L2 Cache #Prefetch #GPU Optimization

2026년 1월 8일

[triton] SwiGLU 커널에 ex2.approx.ftz 적용으로 1-2 GBps 성능 개선

Triton의 SwiGLU 커널에서 exp 연산을 CUDA의 ex2.approx.ftz 인라인 어셈블리로 대체하여, 수치적 안전성을 유지하면서 처리량을 개선한 PR을 분석합니다.

#Triton #Kernel #SwiGLU #PTX #Optimization

2026년 1월 8일

[Triton] Proton GlobalScratchAllocOp 폐기 — TritonGPU 공용 op으로 통합

Proton 전용 GlobalScratchAllocOp을 TritonGPU의 공용 op으로 교체하고, backend 속성으로 할당 정책을 구분한다

#Triton #Proton #MLIR #Refactoring #Op Deprecation

2026년 1월 7일

[triton] Gluon TMA Op Verifier 강화 및 Illegal Instruction Sanitize 모드 추가

Triton Gluon의 TMA 연산 verifier를 강화하고, descriptor와 tensor 간의 element 수 일치 검증, 그리고 illegal instruction sanitize 모드를 추가한 PR 분석.

#Triton #Gluon #TMA #Verifier #Sanitizer #MLIR

2026년 1월 7일

[triton] AutoWS에서 TMA와 non-TMA 로드 혼합 시 self-latency 및 MMA 처리 수정

Warp specialization에서 TMA와 일반 로드가 혼합될 때 MMA의 self-latency를 올바르게 설정하고 warp-specialized MMA를 lowerMMA에서 처리하도록 수정한 PR을 분석합니다.

#Triton #NVIDIA #AutoWS #TMA #Pipeline

2026년 1월 7일

[Triton] WGMMA rs-dot 분할을 2회로 제한 — 1% MoE 성능 향상

K 차원 분할 수를 K/instrK에서 2로 고정하여 in-register pipelining 최적화

#Triton #NVIDIA #Performance #WGMMA #Pipelining

2026년 1월 7일

[Triton] WarpSpecializePartitionsOp에 명시적 캡처 전달 — IR 구조 정합성 개선

WarpSpecializeOp의 explicit capture를 실제 소비하는 WarpSpecializePartitionsOp으로 이동하여 IR 구조를 정합적으로 만든다

#Triton #MLIR #Warp Specialization #IR Design #Compiler

2026년 1월 7일

[triton] Proton의 Runtime과 Metric 상관관계 단순화로 오버헤드 감소

Proton 프로파일러의 Data/Metric 인터페이스를 재설계하여 이중 잠금과 불필요한 조회를 제거하고 프로파일링 오버헤드를 줄인 사례를 분석합니다.

#Triton #Proton #Profiling #Performance #Refactoring

2026년 1월 4일

[Triton] AMD TDM L2 Prefetch 백엔드 지원 추가

AMD GPU의 TDM L2 프리페치 하드웨어 기능에 대한 MLIR op 정의와 LLVM lowering을 구현한다

#Triton #AMD #L2 Cache #Prefetch #MLIR #LLVM Lowering

2025년 12월 31일

[triton] AMD ReorderInstructions에서 효과 없는 sinkSecondLoad 최적화 제거

제한적 케이스에서만 트리거되고 성능 영향이 없는 sinkSecondLoad 최적화를 제거하여 ReorderInstructions를 단순화한 PR을 분석합니다.

#Triton #AMD #Refactoring #Dead Code #Cleanup

2025년 12월 30일

[triton] AMD: WMMA layout의 CTA 필드를 LinearLayout으로 일반화하여 swizzled warp 레이아웃 지원

warpsPerCTA/tilesPerWarp 파라미터를 LinearLayout 기반 ctaLayout으로 대체하여 gfx1250의 swizzled warp 레이아웃 등 더 복잡한 배치를 표현할 수 있도록 개선한 분석.

#Triton #AMD #WMMA #LinearLayout #GPU Layout #gfx1250

2025년 12월 29일

[Triton] AMD에서 non-integer 타입 atomic-cas 시 컴파일러 크래시 수정

float 타입 atomic CAS를 integer bitcast로 감싸서 LLVM cmpxchg 명령어 생성 시 core dump 방지

#Triton #AMD #Bug Fix #Atomic Operations #LLVM

2025년 12월 27일

[Triton] LLVM Debug Information에서 커널 인자 누락 수정

Triton FuncOp에서 LLVM IR 변환 시 포인터 타입의 pointee 정보가 유실되어 디버그 정보에 커널 인자가 누락되는 버그를 수정

#Triton #LLVM #Debug Info #Bug Fix

2025년 12월 25일

[Triton] ext slice rematerialization 견고성 개선 — 실패 시 원본 보존

레이아웃 변환 제거 패스에서 ext backward slice 탐색 실패 시 원본 데이터가 오염되는 버그를 수정한다

#Triton #MLIR #Compiler Optimization #Layout Conversion #Bug Fix

2025년 12월 24일

[Triton] Proton 프로파일러 tensor descriptor 및 two-CTA 모드 테스트 추가

Proton 프로파일러에 tensor descriptor와 two-CTA 모드 커널에 대한 테스트를 추가하여 프로파일링 범위를 확장한다

#Triton #Proton #Testing #Tensor Descriptor #Two-CTA

2025년 12월 23일

[Triton] AMD gfx950/gfx1250에 AsyncCopy 기본 활성화 — 파이프라인 성능 향상

gfx950과 gfx1250 아키텍처에서 비동기 복사를 기본값으로 활성화하여 메모리 파이프라인 효율을 높인다

#Triton #AMD #AsyncCopy #GPU Pipeline #Performance

2025년 12월 23일

[Triton] SWP 루프 로우어링에서 barrier 위치 결정 로직 수정

MMA의 non-pipelined operand와 tmem_load 간 barrier 위치를 linearized schedule 기반으로 정확히 결정

#Triton #NVIDIA #Pipelining #SWP #Bug Fix

2025년 12월 22일

[Triton] AMD RDNA에서 matmul_ogs 설정 최적화 — 최대 46% 성능 향상

RDNA3/4 GPU에서 block_m/block_n/block_k 설정을 조정하여 레지스터 스필링 해결

#Triton #AMD #RDNA #Performance #Kernel Tuning

2025년 12월 22일

[triton] Triton에서 cuBLAS를 활용한 mxfp8 및 nvfp4 블록 스케일 행렬 곱셈 벤치마킹

Triton의 블록 스케일 행렬 곱셈 성능을 검증하기 위해 cuBLAS 기반의 베이스라인을 도입하고 튜토리얼을 개선했습니다.

#Triton #cuBLAS #mxfp8 #nvfp4 #Performance

2025년 12월 19일

[triton] Triton AMD 백엔드 최적화: Subtiling을 통한 GEMM 성능 향상

AMD GPU 환경에서 Subtiling 기법을 도입하여 공유 메모리 사용량을 줄이고 레지스터 스필을 제거한 GEMM 최적화 분석.

#Triton #AMD #GEMM #GPU #Optimization

2025년 12월 19일

[triton] Triton PROTON: CUDA 그래프 프로파일링 오버헤드를 줄이고 MsgPack API를 추가하여 성능을 대폭 개선

Triton PROTON 라이브러리의 CUDA 그래프 프로파일링 오버헤드를 줄이고 MsgPack 직렬화 API를 추가하여 성능을 3배~10배 향상시킨 코드 변경 분석.

#Triton #PROTON #CUDA #Profiling #Optimization #MsgPack #C++#Python

2025년 12월 19일

[triton] CGAEncodingAttr::getDefault를 get1CTALayout/get1DLayout로 분리하여 multi-CTA 지원

1CTA 전용이던 getDefault 함수를 명확한 이름의 두 함수로 분리하고, multi-CTA 환경에서의 coalesce 유틸리티를 수정한 분석.

#Triton #MLIR #CGA #Multi-CTA #Encoding #Compiler

2025년 12월 18일

[Triton] ConSan에서 barrier 다중 도착 시 false positive deadlock 감지 수정

barrier_expect를 arrive로 모델링하여 여러 TMA copy가 같은 barrier를 공유할 때 발생하는 오탐 deadlock 해결

#Triton #ConSan #Concurrency Sanitizer #Bug Fix #TMA

2025년 12월 19일

[Triton] Gluon 검증 로직을 C++ verifier로 이동 — 차원 축소 로드 지원

Python assert 기반 검증을 C++ verifier로 이동하여 dimension-reducing load를 올바르게 지원한다

#Triton #Gluon #MLIR #Verifier #Refactoring

2025년 12월 18일

[Triton] Frontend에서 scaled batched matrix multiply 지원

dot_scaled의 shape 검증을 마지막 2차원 기준으로 변경하여 BMM 연산을 올바르게 처리

#Triton #Frontend #BMM #MXFP #Bug Fix

2025년 12월 18일

[Triton] AMD scf.if else 분기 누락 버그 수정 — deduceMinCountBetweeOps

scf.if에 else 영역이 없을 때 async wait count가 잘못 계산되는 버그 수정

#Triton #AMD #MLIR #Bug Fix #Compiler

2025년 12월 18일

[triton] Triton GFX1250 MXFP GEMM 커널의 4-Warp 스케줄링 최적화 분석

Triton의 AMD GFX1250 MXFP GEMM 커널에서 4-Warp 스케줄링 도입 및 비동기 복사(Async Copy)를 통한 성능 최적화 사례를 살펴봅니다.

#Triton #AMD #GEMM #GPU #Optimization

2025년 12월 18일

[triton] wgmma wait(0)를 accumulator 첫 사용 시점으로 지연하여 MMA-epilogue 오버랩 달성

파이프라인된 wgmma 루프 이후의 wait(0)를 accumulator 첫 사용 시점으로 지연시켜, epilogue 연산과 MMA를 오버랩한 PR을 분석합니다.

#Triton #NVIDIA #WGMMA #Pipeline #Optimization

2025년 12월 17일

[Triton] gfx1250에 async_copy_local_to_global 추가

Gluon에서 GFX1250의 shared-to-global 비동기 복사를 지원하는 Op 정의, lowering, 테스트 추가

#Triton #AMD #gfx1250 #Gluon #Async Copy

2025년 12월 16일

[triton] Async 연산에 명시적 의미론(Semantics) 문서 추가

Triton의 async_copy, async_commit_group, async_wait 연산에 명시적인 의미론 설명과 동기화 요구사항을 문서화한 PR 분석.

#Triton #AsyncOps #Documentation #MLIR #Semantics #CopyAsync

2025년 12월 16일

[triton] Triton AMD 커널 최적화: 루프 언롤링(Loop Unrolling)을 통한 성능 향상

Triton AMD FlashAttention 커널에서 루프 언롤링(unroll_factor=2)을 적용하여 레지스터 회전 효율을 높이고 연산 오버헤드를 줄인 사례 분석.

#Triton #AMD #GPU #Optimization #FlashAttention

2025년 12월 15일

[Triton] Gluon Dialect verifier 강화 및 에러 메시지 개선

NVMMASharedEncoding 검증, TMA 함수 verifier 추가, DotOpMMASmemLoader를 fallible하게 변경하여 illegal instruction 방지

#Triton #Gluon #MLIR #Verifier #Error Handling

2025년 12월 14일

[triton] AMD: Warp Pipeline 지원 추가 - Gluon 프론트엔드부터 LLVM lowering까지

AMD GPU에서 서로 다른 warp가 staggered 스테이지를 실행하는 warp-pipelined 루프를 Gluon API부터 LLVM IR까지 지원하는 전체 파이프라인 구현 분석.

#Triton #AMD #Warp Pipeline #Gluon #LLVM #GPU Optimization

2025년 12월 11일

[Triton] ConSan에 버퍼 aliasing 지원 추가 — 메모리 안전성 분석 강화

ConSan(Concurrency Sanitizer)에 BufferRegion 기반 aliasing 분석을 추가하여 겹치는 버퍼 간 동시성 버그를 감지한다

#Triton #ConSan #Aliasing #Memory Safety #Static Analysis

2025년 12월 11일

[Triton] WGMMA register pipelining에서 누락된 wait 삽입 수정

Persistent matmul epilogue에서 accumulator 접근 시 필요한 wgmma wait 누락 버그 수정

#Triton #NVIDIA #MLIR #Bug Fix #Pipelining

2025년 12월 11일

[Triton] MXFP4→BF16 변환에서 mul.bf16x2 강제 사용 — 1% MoE 성능 향상

LLVM 자동 벡터화 실패를 우회하여 ptxas가 HMUL2 명령어를 생성하도록 유도

#Triton #NVIDIA #Performance #PTX #Inline Assembly

2025년 12월 11일

[Triton] preload에 optional device 인자 추가

JIT 함수의 preload 메서드에 device 인자를 추가하여 특정 디바이스에서 커널을 사전 로드할 수 있도록 개선

#Triton #JIT #Frontend #Python

2025년 12월 9일

[Triton] bf16/fp16 x mxfp 조합의 num_stages 조정 — shared memory 초과 방지

bf16/fp16과 mxfp 혼합 행렬 곱셈에서 weight 업캐스트로 인한 shared memory 초과 문제를 num_stages 조정으로 해결한다

#Triton #MXFP #Shared Memory #Matrix Multiplication #Performance Tuning

2025년 12월 9일

[triton] Triton에서 Ragged Mode를 위한 X Scale Swizzling 최적화

Triton의 Ragged Mode에서 MXFP8 연산 시 X scale swizzling을 지원하여 행렬 곱셈 지연 시간을 줄이는 최적화 구현.

#Triton #GPU #Optimization #MXFP8 #MatMul

2025년 12월 8일

[triton] 손상된 캐시 파일에 대한 방어적 처리 추가

JSON 캐시 파일 읽기 시 발생할 수 있는 파싱 오류를 try-except로 처리하여 손상된 캐시로 인한 크래시를 방지한 사례를 분석합니다.

#Triton #Cache #Robustness #BugFix

2025년 12월 6일

[triton] 벤치마크에서 symmetric memory 해제

분산 환경 벤치마크와 테스트에서 각 실행 후 symmetric memory pool을 명시적으로 해제하여 메모리 누수를 방지하도록 개선한 PR을 분석합니다.

#Triton #Benchmark #Distributed #Memory Management

2025년 12월 5일

[Triton] Hopper에서 소규모 배치 크기 벤치마크 수정

Hopper GPU에서 small batch MLP 벤치마크의 num_warps 설정과 테스트 케이스 추가

#Triton #Benchmark #Hopper #MLP #Bug Fix

2025년 12월 4일

[Triton] SwiGLU exp2 최적화 부분 롤백 — 수치 정확도 우선

exp2_ftz 최적화가 일부 모델에서 수치 차이를 유발하여 일시 롤백

#Triton #Kernel #Numerical Stability #Revert #SwiGLU

2025년 12월 4일

[Triton] 성능 진단 테스트에서 stack trace 생성 비활성화

diagnostics context에서 stacktraces 옵션 제거로 테스트 시간 15분 → 1초 이하로 단축

#Triton #Testing #Performance #Developer Experience

2025년 12월 3일

[triton] Triton Blackwell 아키텍처를 위한 MXFP8 입력 스케일 스위즐링 최적화

Blackwell GPU에서 MXFP8 행렬 곱셈 시 입력 스케일 스위즐링과 TMA를 도입하여 성능을 1.7배에서 1.1배로 개선했습니다.

#Triton #Blackwell #GPU #Optimization #MXFP8

2025년 12월 2일

[Triton] Warp Specialization 중첩 루프 지원

partition-schedule 패스를 재귀적으로 확장하고, tmem_alloc hoisting을 최상위로 수행하여 중첩 루프 E2E 지원

#Triton #NVIDIA #Warp Specialization #Nested Loop #Pipelining

2025년 12월 2일

[Triton] MXFP 포맷 출력 matmul 버그 2건 수정

MXFP downcast epilogue에서 scale 마스크 계산과 shared memory overflow 문제를 수정

#Triton #MXFP #Matmul #Bug Fix

2025년 12월 1일

[triton] Triton JIT 컴파일러 최적화: `inspect.getclosurevars` 제거를 통한 10,000배 성능 향상

Triton JIT 컴파일러에서 `inspect.getclosurevars`를 제거하여 캡처 스코프 조회 속도를 10,000배 향상시켰습니다.

#Triton #JIT #성능 최적화 #Python #컴파일러 #inspect

2025년 11월 25일

[Triton] AMD TDM 연산에 multi-CTA 및 multicast 지원 추가

CGALayout 기반으로 TDM load/store에 멀티캐스트 마스크를 자동 설정하여 cluster 간 데이터 공유 가능

#Triton #AMD #TDM #Multi-CTA #Multicast

2025년 11월 24일

[triton] Triton Kernel의 Matrix Multiplication 리팩토링: 코드 가독성과 유지보수성 향상

Triton의 행렬 곱셈 관련 모듈을 정리하고 변수 명명 규칙을 개선하여 코드의 일관성과 유지보수성을 높인 리팩토링 사례를 분석합니다.

#Triton #GPU #Kernel #Refactoring #MatrixMultiplication

2025년 11월 23일

[triton] Out-of-tree TTIR/TTGIR 패스 플러그인 시스템

Triton에 플러그인 시스템을 도입하여 외부에서 TTIR/TTGIR 컴파일 패스를 등록하고 실행할 수 있도록 한 PR을 분석합니다. 동적 라이브러리 로딩과 C API 기반 확장 메커니즘을 살펴봅니다.

#Triton #Plugin System #MLIR #Compiler Pass #Extensibility

2025년 11월 22일

[Triton] Gluon의 to_linear_layout에서 TensorMemory 레이아웃 지원

to_linear_layout 함수가 Distributed, Shared에 더해 TensorMemory 인코딩도 처리할 수 있도록 확장

#Triton #Gluon #NVIDIA #TensorMemory #LinearLayout

2025년 11월 21일

[Triton] clamp 최적화를 scalar에도 적용 — fmin.xorsign.abs 활용

Hopper 이상에서 clamp(x, -limit, limit) 패턴을 scalar 값에도 min.xorsign.abs로 최적화

#Triton #NVIDIA #Compiler Optimization #PTX #Scalar

2025년 11월 21일

[triton] AMD 비동기 복사에서 block 차원 중복 복사 허용

AMD GPU의 async_copy_global_to_local에서 block 차원의 redundant copy를 허용하여, multi-CTA 환경에서 각 CTA가 자신의 shared memory에 데이터를 올바르게 복사하도록 수정한 PR을 분석합니다.

#Triton #AMD GPU #Async Copy #Multi-CTA

2025년 11월 20일

[triton] tl.cat 연산을 permute+reshape+join으로 재구현하여 결정적(deterministic) 동작 보장

Triton의 tl.cat 연산에서 CatOp을 제거하고 permute, reshape, join 조합으로 대체하여 결정적 결과를 보장하는 변경 분석.

#Triton #Compiler #MLIR #Tensor Operations #Determinism

2025년 11월 19일

[Triton] AMD CI에 pip 캐시 디렉토리 도입 — 네트워크 장애 대응

AMD GPU CI 환경에서 pip 캐시 디렉토리를 사용하여 네트워크 지연에 의한 빌드 실패를 방지한다

#Triton #AMD #CI/CD #GitHub Actions #DevOps

2025년 11월 19일

[triton] AMD GPU에서 Block Scaled Matmul 지원 추가

Triton의 block scaled matrix multiplication 튜토리얼에 AMD CDNA4 GPU 지원을 추가하고, 스케일 프리셔플링 로직을 문서화한 PR 분석.

#Triton #AMD #CDNA4 #MatMul #MXFP #GPU

2025년 11월 19일

[Triton] AMD gfx1250 tt.LoadOp에 multicast 지원 추가

cluster_load를 사용하여 여러 CTA에 동시 레지스터 로드를 수행하는 multicast 기능 구현

#Triton #AMD #gfx1250 #Multicast #Load

2025년 11월 18일

[Triton] Pipeliner에서 cp_async의 alignment 정보 손실 수정

async_copy Op에 optional contiguity 정보를 추가하여 컴파일러 변환 후에도 정렬 정보 유지

#Triton #Compiler #Pipeliner #Async Copy #Bug Fix

2025년 11월 18일

[Triton] JIT 함수를 커널에 안전하게 전달하는 테스트 추가

JIT 함수(higher-order function)를 constexpr 인자로 커널에 전달하고 캐시 키가 올바르게 갱신되는지 검증

#Triton #Compiler

2025년 11월 18일

[Triton] gfx1250에서 async_copy multicast 지원

AMD gfx1250 타겟의 async_copy_global_to_local에 cluster load 기반 multicast를 추가하여 CTA간 데이터 공유 지원

#Triton #AMD #Multicast #Async Copy #gfx1250

2025년 11월 16일

[triton] AMD: LLVM 백엔드에 커스텀 스케줄러 옵션 추가로 메모리 바운드 커널 최적화

AMD HIP 백엔드에 iterative-ilp 스케줄러를 선택할 수 있는 schedule_hint 옵션을 추가하여 메모리 바운드 Flash Attention 커널 성능을 개선한 분석.

#Triton #AMD #LLVM #Scheduler #Flash Attention #Performance

2025년 11월 14일

[Triton] TRITON_INTERPRET 모드에서 언어 패치 자동 정리

인터프리터 모드가 triton.language를 패치한 후 자동으로 원래 상태로 복원하도록 개선

#Triton #Interpreter #Python #Refactoring

2025년 11월 14일

[Triton] Gluon에 coalesced layout 추가 — 메모리 접근 효율 최적화

Gluon DSL에 coalesced layout을 도입하여 글로벌 메모리 접근의 coalescing을 자동으로 보장한다

#Triton #Gluon #Memory Coalescing #Layout #GPU Optimization

2025년 11월 13일

[Triton] JIT specialization data 직렬화 tuple/constexpr 수정

JSON 직렬화 시 tuple과 constexpr 값이 올바르게 round-trip되도록 수정

#Triton #Compiler

2025년 11월 12일

[Triton] AMD gfx1250에 LDS 메모리 배리어 지원 추가

gfx1250 아키텍처의 LDS memory barrier op을 구현하고 Gluon DSL에 노출한다

#Triton #AMD #LDS #Memory Barrier #gfx1250 #Gluon

2025년 11월 11일

[Triton] Proton 메모리 누수 수정 및 미사용 변수 제거

Proton 프로파일러의 메모리 누수를 수정하고 미사용 변수를 정리하여 리소스 관리를 개선한다

#Triton #Proton #Memory Leak #Bug Fix #Code Cleanup

2025년 11월 11일

[Triton] Concurrency Sanitizer에 TMA Store 검증 추가

Triton의 동시성 검사기(CONSAN)가 TMA Store 연산의 메모리 접근도 추적하여 데이터 레이스를 감지

#Triton #Sanitizer #TMA #Concurrency #NVIDIA

2025년 11월 10일

[Triton] AMD에 MemoryCounterWaitOp과 ROCDL lowering 추가

하드웨어 메모리 카운터 대기를 추상화하는 MemoryCounterWaitOp을 도입하여 아키텍처별 waitcnt 인코딩을 통합 관리

#Triton #AMD #ROCDL #Synchronization #ISA

2025년 11월 10일

[Triton] AMD LLVM 백엔드에 커스텀 스케줄러 옵션 추가

schedule_hint로 memory-bound-attention 등의 LLVM 스케줄링 전략을 지정할 수 있도록 확장

#Triton #Compiler

2025년 11월 10일

[Triton] Proton 기본 버퍼 크기 설명 개선 — 문서화와 코드 주석 보강

Proton 프로파일러의 기본 버퍼 크기 설정에 대한 문서와 코드 주석을 명확하게 개선한다

#Triton #Proton #Documentation #Profiling #Developer Experience

2025년 11월 8일

[triton] Triton PROTON: FinalizeOp 최적화를 통한 프로파일링 오버헤드 개선

Triton PROTON의 FinalizeOp를 리팩토링하여 warp 단위 병렬 쓰기를 구현하고 프로파일링 오버헤드를 최대 2배 이상 개선했습니다.

#Triton #GPU #Optimization #Compiler #Profiling

2025년 11월 7일

[triton] AMD/Gluon: gfx1250에서 async_copy 런타임 테스트 추가 및 UpdateAsyncWaitCnt 활성화

AMD gfx1250 아키텍처에서 async_copy의 다양한 shared memory layout 조합에 대한 런타임 테스트를 추가하고 UpdateAsyncWaitCnt를 활성화한 분석.

#Triton #AMD #Gluon #gfx1250 #Async Copy #Testing

2025년 11월 6일

[triton] Triton에서의 MXFP 변환 성능 최적화: TMA와 벡터화된 연산 활용

Triton의 MXFP8/MXFP4 변환 커널을 TMA와 벡터화된 스토어, 타일링 튜닝을 통해 대폭 가속화한 사례를 분석합니다.

#Triton #MXFP #GPU #Optimization #HPC

2025년 11월 6일

[triton] Proton 커널 내 프로파일러 Global Memory 지원

Triton Proton의 intra-kernel profiler에 global memory buffer 지원을 추가하여, shared memory가 부족한 환경에서도 프로파일링이 가능하도록 한 PR을 분석합니다.

#Triton #Proton #Profiler #Global Memory #GPU Performance

2025년 11월 5일

[triton] Tutorials: 벤치마크 결과 테이블에 단위(units) 표시 추가

Triton 튜토리얼의 벤치마크 결과 테이블 컬럼에 ylabel 단위를 포함시켜 결과의 가독성을 개선한 변경 분석.

#Triton #Tutorial #Benchmark #UX #Python

2025년 11월 4일

[Triton] AMD FAv3 pingpong에서 s_xxx 명령어 배치 최적화

Memory cluster와 compute cluster 사이의 스칼라 명령어 배치를 개선하여 GPU 파이프라인 활용도를 높임

#Triton #AMD #Scheduling #Performance #FlashAttention

2025년 11월 3일

[Triton] gfx1250에 Gluon async_copy API 추가

AMD gfx1250 타겟에서 Gluon 프론트엔드를 통한 async global-to-shared copy 지원

#Triton #Compiler

2025년 11월 3일

[triton] rewrite-partition-dependencies를 insert-aref로 통합하여 Warp Specialization 파이프라인 간소화

Triton Warp Specialization의 partition dependency 재작성 pass를 insert-aref pass에 통합하여 컴파일 파이프라인을 간소화한 PR 분석.

#Triton #WarpSpecialization #MLIR #Compiler #Refactoring

2025년 11월 3일

[triton] AMD: BufferLoadToLocal을 UpdateAsyncWaitCount에 포함하여 성능 회귀 수정

buffer_load_to_local 명령어를 비동기 대기 카운트 계산에 포함시켜 보수적 wait으로 인한 성능 저하를 해결한 분석.

#Triton #AMD #Async #Buffer Operations #Performance

2025년 11월 2일

[Triton] AMD Gluon에서 async_wait을 commit group 기반으로 변경

하드웨어 명령어 수 대신 commit group 수 기반으로 async_wait 의미론을 변경하여 Gluon 커널 작성 편의성 향상

#Triton #AMD #Gluon #Async Wait #Compiler

2025년 11월 1일

[Triton] Aggregate cache key 변경 Reland

Revert 후 수정하여 다시 적용한 aggregate 멤버 cache key 포함 PR

#Triton #Compiler

2025년 10월 30일

[Triton] Gluon에서 초기 multi-CTA 지원

multi-CTA 레이아웃의 TMEM 로드스토어 인코딩 계산을 PlanCTA 패스와 함께 구현

#Triton #Compiler

2025년 10월 30일

[triton] Matmul에서 Split-K Reduction과 Inter-Expert Reduction 분리

Triton Kernels의 matmul_ogs에서 split-k reduction을 inter-expert reduction과 분리하여 MoE 파이프라인의 유연성을 높인 PR 분석.

#Triton #MatMul #SplitK #MoE #Reduction #Refactoring

2025년 10월 29일

[Triton] AMD amdgpu.async_wait Op 도입으로 비동기 트랜잭션 의미론 명확화

ttg.async_wait의 commit group 기반 의미론과 분리하여 AMD 하드웨어 명령어 수 기반 async_wait을 별도 Op으로 정의

#Triton #AMD #MLIR #Async Wait #IR Design

2025년 10월 29일

[Triton] WGMMA wait op의 출력 constraint 타입별 분기 수정

f16 등 16비트 타입에서 잘못된 =r constraint 대신 =h를 사용하여 불필요한 cvt 제거

#Triton #NVIDIA #Bug Fix #Inline Assembly #WGMMA

2025년 10월 29일

[Triton] vLLM 호환 CUDA Graph tracing for Expert Parallelism

Expert Parallelism에서 symmetric memory pool 초기화와 CUDA Graph 호환성을 개선

#Triton #Compiler

2025년 10월 28일

[Triton] Aggregate cache key 변경 일시 Revert

기존 aggregate cache key 변경이 CI에서 문제를 일으켜 일시적으로 revert한 PR

#Triton #Compiler

2025년 10월 28일

[triton] memdesc_index에서 alloc_shape 리셋으로 메모리 디스크립터 정합성 개선

Triton 컴파일러의 MemDescIndexOp에서 alloc_shape을 리셋하여 서브뷰 생성 시 메모리 디스크립터 타입 불일치를 해결한 PR 분석.

#Triton #Compiler #MLIR #MemoryDescriptor #Backend

2025년 10월 27일

[Triton] Aggregate 멤버를 cache key에 포함시키기

JIT 함수에 전달되는 aggregate 타입의 멤버를 cache key에 반영하여 캐시 일관성 보장

#Triton #Frontend #Cache #JIT

2025년 10월 24일

[triton] AMD: gfx1250에서 ttg.async_wait lowering 및 asynccnt 기반 동기화 구현

AMD gfx1250 아키텍처에서 async load가 별도 asynccnt 카운터를 사용하는 것을 반영하여 async_wait lowering과 UpdateAsyncWaitCnt를 구현한 분석.

#Triton #AMD #gfx1250 #Async #LLVM #GPU Architecture

2025년 10월 24일

[Triton] gfx1250 Shared Memory 크기 정확하게 반환하기

AMD gfx1250 타겟에서 TargetInfo가 올바른 shared memory 크기를 반환하도록 switch 문으로 리팩터링

#Triton #AMD #GPU #Shared Memory

2025년 10월 23일

[Triton] AxisInfo의 unrealized_conversion_cast 처리 강화

rank 불일치 시 pessimistic state로 fallback하여 크래시를 방지

#Triton #Compiler

2025년 10월 22일

[triton] [NVIDIA] SM120을 위한 FP4 Native Scaled Matmul 지원 및 성능 최적화 분석

Triton에서 FP4 데이터 타입의 하드웨어 가속을 구현하여 Llama3-8B 벤치마크 성능을 약 2배 향상시킨 사례를 분석합니다.

#Triton #NVIDIA #FP4 #GPU #Optimization #LLM

2025년 10월 20일

[Triton] Gluon 레이아웃 검증 에러 메시지 개선

TMA copy 연산의 레이아웃 검증 실패 시 더 명확한 에러 메시지를 제공하도록 개선

#Triton #Gluon #NVIDIA #Error Handling #DX

2025년 10월 20일

[triton] Expert Parallelism 기본 구현과 Reduce 커널 추가

Triton Kernels 라이브러리에 Expert Parallelism을 위한 기본 구현과 독립적인 reduce 커널을 추가하여 MoE 워크로드의 분산 처리를 지원하는 PR 분석.

#Triton #ExpertParallelism #MoE #Reduce #Distributed #GPU

2025년 10월 16일

[triton] AMD ds_read_tr 명령어 제한 완화로 더 유연한 레이아웃 지원

AMD GPU의 ds_read_tr 명령어에 대한 불필요한 제한을 제거하고 임의의 linear layout에서도 활용 가능하게 개선한 PR 분석.

#Triton #AMD #LDS #LinearLayout #SharedMemory #Optimization

2025년 10월 16일

[triton] Warp Specialization: OptimizePartitionWarps와 SWP 순서 교환으로 어노테이션 보존

OptimizePartitionWarps 패스가 local_load의 루프 어노테이션을 삭제하는 문제를 해결하기 위해 SWP(Software Warp Pipelining) 이후로 실행 순서를 변경한 분석.

#Triton #Warp Specialization #Compiler Pass #MLIR #Pipeline

2025년 10월 14일

[triton] AMD: range analysis 버그 수정 및 buffer-ops의 range analysis 의존성 강화

tl.assume의 제어 흐름 관계 미고려, make_range 범위 오류 등 range analysis의 근본적 버그를 수정하고 buffer-ops가 올바른 범위 검증을 수행하도록 개선한 분석.

#Triton #AMD #Range Analysis #Buffer Operations #Large Tensor #Bug Fix

2025년 10월 12일

[Triton] split_k에 m*n 제약 조건 추가

matmul에서 split_k 사용 시 m*n 크기에 대한 제약을 검증하는 테스트와 로직 추가

#Triton #Compiler

2025년 10월 11일

[triton] Gluon에 mma_scaled 연산 헬퍼 및 실행 테스트 추가

Triton Gluon 프론트엔드에 Blackwell tcgen05_mma_scaled 연산을 지원하는 헬퍼 함수와 실행 테스트를 추가한 PR 분석.

#Triton #Gluon #Blackwell #MMA #Scaled #TensorCore

2025년 10월 9일

[Triton] gfx1250에서 TDM Store 지원 추가

AMD gfx1250 타겟에서 Tensor Data Mover를 통한 shared-to-global 비동기 store 연산 구현

#Triton #AMD #gfx1250 #TDM #Async

2025년 10월 9일

[Triton] Blackwell barrierSlice 타이핑 버그 수정

numStages가 1일 때 barrierSlice 생성에서 발생하는 타입 불일치 버그를 수정

#Triton #NVIDIA #Blackwell #Bug Fix #Barrier

2025년 10월 9일

[Triton] gfx950에서 PaddedLayout + AsyncCopy 파이프라이닝 지원

AMD CDNA 아키텍처에서 padded shared memory 레이아웃을 AsyncCopy와 함께 사용할 수 있도록 파이프라인 lowering을 확장

#Triton #AMD #AsyncCopy #Padding #Pipeline

2025년 10월 7일

[Triton] swizzling=0 matrix descriptor 지원과 WGMMA lowering 일반화

swizzling이 0인 경우의 matrix descriptor 생성과 SharedLinearEncoding 기반의 WGMMA lowering을 구현

#Triton #NVIDIA #WGMMA #Hopper #SharedLayout

2025년 10월 6일

[Triton] ds_read_tr + padded layout에서 vec size를 min interval로 제한

padded shared memory 레이아웃에서 ds_read_tr의 벡터 크기가 padding 간격을 초과하지 않도록 수정

#Triton #AMD #Shared Memory #Padding #Bug Fix

2025년 10월 6일

[triton] Triton GPU 컴파일러 최적화: TMEM Store의 레이아웃 변환 폴딩(Folding) 기법

Triton의 TMEM Store 연산에서 불필요한 레이아웃 변환을 제거하여 Flex Attention 성능 저하를 해결한 최적화 기법을 분석합니다.

#Triton #Compiler #Optimization #MLIR #GPU

2025년 10월 3일

[Triton] debuginfo 테스트 단순화 — subprocess 제거

별도 프로세스를 spawn하던 디버그 정보 테스트를 pytest parametrize와 monkeypatch로 리팩터링

#Triton #Testing #Refactoring #Python

2025년 10월 3일

[Triton] TMEM Store 레이아웃 변환 최적화 — FlexAttention 성능 복구

TMEM Store에 불필요한 layout conversion을 fold하여 FlexAttention 성능 저하 해결

#Triton #MLIR #FlexAttention #Compiler Optimization #NVIDIA

2025년 10월 3일

[triton] tcgen05.cp를 Generic Matrix Descriptor Lowering으로 통합

Triton NVIDIA 백엔드에서 tcgen05.cp 명령어의 SMEM 디스크립터 로딩을 generic matrix descriptor lowering 경로로 통합하여 코드 중복을 줄인 PR 분석.

#Triton #NVIDIA #Blackwell #MatrixDescriptor #LLVM #Backend

2025년 10월 2일

[triton] ConSan: 상태 변경 시 커널 재컴파일을 보장하여 JIT 캐시 무효화

Concurrency Sanitizer 상태를 컴파일 옵션에 포함시켜 활성화/비활성화 시 커널이 자동으로 재컴파일되도록 하는 변경 분석.

#Triton #ConSan #JIT #Cache #Sanitizer #Debugging

2025년 10월 1일