#MLIR

39개의 포스트

[triton] AMD Canonicalize Pointers에서 arith.select의 비대칭 fat pointer 처리 강화

Triton AMD 백엔드의 포인터 정규화 과정에서 한쪽만 base+offset 분리된 arith.select를 안전하게 처리하도록 수정한 PR을 분석합니다.

#Triton #AMD #Compiler #Bug Fix #MLIR

2026년 4월 1일

[triton] GSan AxisInfo 기반 Shadow Update 중복 제거로 2~10배 성능 향상

Triton의 Global Sanitizer에서 AxisInfo의 contiguity 속성을 활용하여 중복 shadow update를 제거하고, FP16 matmul에서 최대 10배 속도 향상을 달성한 PR을 분석합니다.

#Triton #GPU #Sanitizer #Optimization #MLIR

2026년 3월 27일

[triton] triton-ext Plugin API에 문자열 인자 지원 추가

Triton 확장 플러그인의 addPass API에 문자열 인자를 전달할 수 있도록 확장하여, 커스텀 패스의 설정 가능성을 높인 PR을 분석합니다.

#Triton #Plugin #API #MLIR #Extension

2026년 3월 18일

[Triton] FenceAsync에 비동기 읽기 의존성 추가 — st.shared와 copy_local_to_global 간 정합성 보장

비동기 프록시 읽기 연산에 대한 fence 삽입 누락 버그를 수정하여 공유 메모리 쓰기와 글로벌 복사 간 데이터 정합성을 보장한다

#Triton #MLIR #NVIDIA #Memory Fence #GPU Compiler

2026년 3월 2일

[triton] Triton AMD GPU: 버퍼 로드 루프 내 주소 계산 최적화

루프 내 버퍼 로드 시 오프셋 기반 주소 계산을 베이스 포인터 증분 방식으로 변경하여 연산 효율성을 개선했습니다.

#Triton #AMD #Compiler Optimization #MLIR #GPU

2026년 2월 20일

[triton] AMD: PartitionedSharedEncodingAttr의 LLVM lowering 지원으로 공유 메모리 파티셔닝 구현

텐서를 여러 물리적 공유 메모리 파티션에 분할 저장하여 파티션 충돌을 줄이는 PartitionedSharedEncodingAttr의 LLVM IR 변환 구현 분석.

#Triton #AMD #LLVM #Shared Memory #Partitioning #MLIR

2026년 2월 10일

[triton] Generic Multi-CTA convert_layout 지원

Triton의 convert_layout 연산을 multi-CTA 환경에서 범용적으로 처리하도록 확장한 PR을 분석합니다. CTA 간 데이터 전송을 위한 cluster barrier와 distributed shared memory 활용 방식을 살펴봅니다.

#Triton #GPU Compiler #Multi-CTA #Layout Conversion #MLIR

2026년 2월 9일

[triton] FpSan - Floating Point Sanitizer 도입

GPU 커널의 부동소수점 연산 오류를 런타임에 감지하는 FpSan(Floating Point Sanitizer)을 Triton에 도입한 PR을 분석합니다. MLIR 패스를 통해 FP 연산을 integer payload 방식으로 rewrite합니다.

#Triton #GPU Compiler #Floating Point #Sanitizer #MLIR

2026년 2월 6일

[triton] ConSan 컴파일 타임 19분에서 34초로 단축 - 대규모 최적화

Triton Concurrency Sanitizer의 컴파일 시간을 33배 개선한 대규모 PR을 분석합니다. IR 크기 축소, warp-local layout, 헬퍼 함수 중복제거 등 다양한 최적화가 포함됩니다.

#Triton #ConSan #Compile Time #MLIR #Optimization

2026년 2월 5일

[Triton] AMD PartitionedSharedEncodingAttr 도입으로 shared memory 파티셔닝 지원

텐서를 여러 물리적 shared memory 파티션에 분산 배치하여 bank conflict를 줄이는 새로운 encoding attribute 추가

#Triton #AMD #MLIR #Shared Memory #Memory Optimization

2026년 2월 4일

[Triton] AMD PartitionedSharedEncodingAttr 도입 — shared memory 파티션 충돌 감소

텐서를 여러 물리적 shared memory 파티션에 분산 배치하여 bank conflict 감소

#Triton #AMD #MLIR #Shared Memory #Architecture

2026년 2월 2일

[triton] NVIDIA TMA im2col 모드 Tensor Descriptor 지원

NVIDIA TMA의 im2col 모드를 Triton의 tensor descriptor 시스템에 통합한 PR을 분석합니다. TensorDescInterface 도입과 TensorDescIm2ColType 추가를 통해 convolution-friendly 메모리 접근 패턴을 지원합니다.

#Triton #NVIDIA #TMA #Im2col #Convolution #MLIR

2026년 1월 26일

[Triton] AMD PrepareIfCombining 패스 추가 — scf.if 병합 최적화

동일 조건의 인접 scf.if 연산 사이 명령어를 이동시켜 canonicalizer가 if를 병합하도록 지원

#Triton #AMD #MLIR #Compiler Optimization #Control Flow

2026년 1월 24일

[Triton] AxisInfo의 divisibility 초기화 로직 문서화 개선

MulIOp에서 contiguity > 1일 때 divisibility를 1로 리셋하는 이유를 명확히 문서화

#Triton #Documentation #MLIR #AxisInfo #Compiler Analysis

2026년 1월 22일

[triton] [Blackwell] NVIDIA 차세대 아키텍처를 위한 Triton의 tcgen05.ld.red 최적화 분석

Blackwell 아키텍처의 TMEM 로드 및 리덕션 동시 수행 기능을 Triton Gluon에 구현하여 성능을 최적화한 사례를 분석합니다.

#Triton #Blackwell #NVIDIA #GPU #Optimization #MLIR

2026년 1월 16일

[Triton] TritonGPU Barrier 재설계 — 주소 공간별 메모리 가시성 보장

gpu.barrier를 TritonGPU 전용 barrier op으로 교체하여 shared/global 메모리 가시성을 세밀하게 제어한다

#Triton #MLIR #GPU Barrier #Memory Visibility #Compiler Infrastructure

2026년 1월 16일

[triton] Warp Specialization: 데이터 플로우 그래프 기반의 개선된 파티션 스케줄링 패스

기존 파티션 스케줄링을 데이터 플로우 그래프와 incremental heuristic merging 기반으로 재작성하여 범용성을 높인 분석.

#Triton #Warp Specialization #Partition Scheduling #Data Flow Graph #Compiler #MLIR

2026년 1월 16일

[Triton] ReduceOp 로우어링을 LinearLayout 기반으로 개선 및 단순화

ReduceOp 로우어링을 LinearLayout 기반으로 재설계하여 shmem swizzling 활용, 불필요한 round-trip 제거

#Triton #MLIR #Compiler Optimization #LinearLayout #Refactoring

2026년 1월 12일

[Triton] 소규모 async_cp를 위한 최적 레이아웃 선택

작은 텐서의 async copy 시 coalesced encoding을 독립적으로 선택하여 불필요한 convert_layout 제거

#Triton #MLIR #Compiler Optimization #GPU #Async Copy

2026년 1월 9일

[triton] AMD ReorderInstructions에서 no-op sinkDotConversion 최적화 제거

ConvertLayout이 이미 local_load로 대체된 후 실행되어 효과가 없는 sinkDotConversion 최적화를 제거하여 코드 복잡성을 줄인 PR을 분석합니다.

#Triton #AMD #Refactoring #Dead Code #MLIR

2026년 1월 9일

[Triton] Proton GlobalScratchAllocOp 폐기 — TritonGPU 공용 op으로 통합

Proton 전용 GlobalScratchAllocOp을 TritonGPU의 공용 op으로 교체하고, backend 속성으로 할당 정책을 구분한다

#Triton #Proton #MLIR #Refactoring #Op Deprecation

2026년 1월 7일

[triton] Gluon TMA Op Verifier 강화 및 Illegal Instruction Sanitize 모드 추가

Triton Gluon의 TMA 연산 verifier를 강화하고, descriptor와 tensor 간의 element 수 일치 검증, 그리고 illegal instruction sanitize 모드를 추가한 PR 분석.

#Triton #Gluon #TMA #Verifier #Sanitizer #MLIR

2026년 1월 7일

[Triton] WarpSpecializePartitionsOp에 명시적 캡처 전달 — IR 구조 정합성 개선

WarpSpecializeOp의 explicit capture를 실제 소비하는 WarpSpecializePartitionsOp으로 이동하여 IR 구조를 정합적으로 만든다

#Triton #MLIR #Warp Specialization #IR Design #Compiler

2026년 1월 7일

[Triton] AMD TDM L2 Prefetch 백엔드 지원 추가

AMD GPU의 TDM L2 프리페치 하드웨어 기능에 대한 MLIR op 정의와 LLVM lowering을 구현한다

#Triton #AMD #L2 Cache #Prefetch #MLIR #LLVM Lowering

2025년 12월 31일

[Triton] ext slice rematerialization 견고성 개선 — 실패 시 원본 보존

레이아웃 변환 제거 패스에서 ext backward slice 탐색 실패 시 원본 데이터가 오염되는 버그를 수정한다

#Triton #MLIR #Compiler Optimization #Layout Conversion #Bug Fix

2025년 12월 24일

[triton] CGAEncodingAttr::getDefault를 get1CTALayout/get1DLayout로 분리하여 multi-CTA 지원

1CTA 전용이던 getDefault 함수를 명확한 이름의 두 함수로 분리하고, multi-CTA 환경에서의 coalesce 유틸리티를 수정한 분석.

#Triton #MLIR #CGA #Multi-CTA #Encoding #Compiler

2025년 12월 18일

[Triton] Gluon 검증 로직을 C++ verifier로 이동 — 차원 축소 로드 지원

Python assert 기반 검증을 C++ verifier로 이동하여 dimension-reducing load를 올바르게 지원한다

#Triton #Gluon #MLIR #Verifier #Refactoring

2025년 12월 18일

[Triton] AMD scf.if else 분기 누락 버그 수정 — deduceMinCountBetweeOps

scf.if에 else 영역이 없을 때 async wait count가 잘못 계산되는 버그 수정

#Triton #AMD #MLIR #Bug Fix #Compiler

2025년 12월 18일

[triton] Async 연산에 명시적 의미론(Semantics) 문서 추가

Triton의 async_copy, async_commit_group, async_wait 연산에 명시적인 의미론 설명과 동기화 요구사항을 문서화한 PR 분석.

#Triton #AsyncOps #Documentation #MLIR #Semantics #CopyAsync

2025년 12월 16일

[Triton] Gluon Dialect verifier 강화 및 에러 메시지 개선

NVMMASharedEncoding 검증, TMA 함수 verifier 추가, DotOpMMASmemLoader를 fallible하게 변경하여 illegal instruction 방지

#Triton #Gluon #MLIR #Verifier #Error Handling

2025년 12월 14일

[Triton] WGMMA register pipelining에서 누락된 wait 삽입 수정

Persistent matmul epilogue에서 accumulator 접근 시 필요한 wgmma wait 누락 버그 수정

#Triton #NVIDIA #MLIR #Bug Fix #Pipelining

2025년 12월 11일

[triton] Out-of-tree TTIR/TTGIR 패스 플러그인 시스템

Triton에 플러그인 시스템을 도입하여 외부에서 TTIR/TTGIR 컴파일 패스를 등록하고 실행할 수 있도록 한 PR을 분석합니다. 동적 라이브러리 로딩과 C API 기반 확장 메커니즘을 살펴봅니다.

#Triton #Plugin System #MLIR #Compiler Pass #Extensibility

2025년 11월 22일

[triton] tl.cat 연산을 permute+reshape+join으로 재구현하여 결정적(deterministic) 동작 보장

Triton의 tl.cat 연산에서 CatOp을 제거하고 permute, reshape, join 조합으로 대체하여 결정적 결과를 보장하는 변경 분석.

#Triton #Compiler #MLIR #Tensor Operations #Determinism

2025년 11월 19일

[triton] rewrite-partition-dependencies를 insert-aref로 통합하여 Warp Specialization 파이프라인 간소화

Triton Warp Specialization의 partition dependency 재작성 pass를 insert-aref pass에 통합하여 컴파일 파이프라인을 간소화한 PR 분석.

#Triton #WarpSpecialization #MLIR #Compiler #Refactoring

2025년 11월 3일

[Triton] AMD amdgpu.async_wait Op 도입으로 비동기 트랜잭션 의미론 명확화

ttg.async_wait의 commit group 기반 의미론과 분리하여 AMD 하드웨어 명령어 수 기반 async_wait을 별도 Op으로 정의

#Triton #AMD #MLIR #Async Wait #IR Design

2025년 10월 29일

[triton] memdesc_index에서 alloc_shape 리셋으로 메모리 디스크립터 정합성 개선

Triton 컴파일러의 MemDescIndexOp에서 alloc_shape을 리셋하여 서브뷰 생성 시 메모리 디스크립터 타입 불일치를 해결한 PR 분석.

#Triton #Compiler #MLIR #MemoryDescriptor #Backend

2025년 10월 27일

[triton] Warp Specialization: OptimizePartitionWarps와 SWP 순서 교환으로 어노테이션 보존

OptimizePartitionWarps 패스가 local_load의 루프 어노테이션을 삭제하는 문제를 해결하기 위해 SWP(Software Warp Pipelining) 이후로 실행 순서를 변경한 분석.

#Triton #Warp Specialization #Compiler Pass #MLIR #Pipeline

2025년 10월 14일

[triton] Triton GPU 컴파일러 최적화: TMEM Store의 레이아웃 변환 폴딩(Folding) 기법

Triton의 TMEM Store 연산에서 불필요한 레이아웃 변환을 제거하여 Flex Attention 성능 저하를 해결한 최적화 기법을 분석합니다.

#Triton #Compiler #Optimization #MLIR #GPU

2025년 10월 3일

[Triton] TMEM Store 레이아웃 변환 최적화 — FlexAttention 성능 복구

TMEM Store에 불필요한 layout conversion을 fold하여 FlexAttention 성능 저하 해결

#Triton #MLIR #FlexAttention #Compiler Optimization #NVIDIA

2025년 10월 3일