PR Analysis

[triton] AMD Async Load에 ROCDL Op 사용으로 전환

AMD GPU의 async load 연산에서 LLVM intrinsic 문자열 기반 호출을 타입 안전한 ROCDL op으로 교체한 NFC(Non-Functional Change) PR 분석.

#Triton #AMD #ROCDL #AsyncCopy #NFC #Refactoring

2026년 2월 9일

[Open WebUI] Knowledge 파일 배치 추가 시 N+1 쿼리 제거

파일 배치 추가 엔드포인트에서 개별 쿼리를 IN 절 단일 쿼리로 변경하여 N+1 문제 해결.

#Open WebUI #Python #Performance #Database #N+1 Query

2026년 2월 9일

[Ray Serve] stop_replicas()의 pop-all/re-add 사이클 제거

전체 replica를 pop했다 re-add하는 방식 대신, ID set 기반 단일 패스 remove로 최대 6배 속도 향상.

#Ray #Python #Performance #Serve #Algorithm

2026년 2월 9일

[Ray Serve] AutoscalingPolicy의 cloudpickle 역직렬화 결과 캐싱

매 오토스케일링 틱마다 반복되던 cloudpickle.loads()를 캐싱하여 8배 속도 향상.

#Ray #Python #Performance #Serve #Caching

2026년 2월 9일

[triton] FPSan에서 Warp Specialization + TMem 사용 시 크래시 수정

Floating-point Sanitizer가 WarpSpecialize 파티션 내에서 tensor memory 접근 시 scope 외부 값을 참조하여 발생하는 크래시를 수정한 사례를 분석합니다.

#Triton #FPSan #NVIDIA #WarpSpecialize #TensorMemory #BugFix

2026년 2월 9일

[pytorch] CI: TIMM pretrained 모델을 공유 HF 캐시에 캐싱하여 CI 속도 개선

PyTorch CI에서 TIMM pretrained 모델 가중치를 공유 HuggingFace 캐시 디렉토리에서 탐지하고, 미캐싱 시에만 온라인 다운로드를 활성화하는 로직을 추가한 사례를 분석합니다.

#PyTorch #CI #TIMM #HuggingFace #Caching #GitHub Actions

2026년 2월 9일

[Ray Serve] ClusterNodeInfoCache 정렬 버그 수정 및 중복 GCS RPC 제거로 캐시 갱신 최적화

sorted() 반환값 무시 버그, 중복 GCS 연결, 매 틱마다 정적 데이터 재구축 문제를 한꺼번에 수정한 최적화 분석.

#Ray #Python #Performance #Cache #Distributed Systems

2026년 2월 9일

[triton] Membar 분석 함수 호출 시 smem offset 수정

Triton의 membar 분석에서 callee 함수의 shared memory 접근을 caller 컨텍스트로 변환할 때, allocation offset을 올바르게 반영하도록 수정한 PR을 분석합니다.

#Triton #Memory Barrier #Shared Memory #Function Call #Bug Fix

2026년 2월 9일

[triton] 클러스터 환경을 위한 Membar 패스 확장

Triton의 membar 분석을 클러스터 환경에 맞게 확장하여, AllocationSlice에 buffer ID를 추가하고 slice/op 레벨의 세분화된 filter를 지원하는 PR을 분석합니다.

#Triton #Memory Barrier #Cluster #Shared Memory #Static Analysis

2026년 2월 9일

[triton] Generic Multi-CTA convert_layout 지원

Triton의 convert_layout 연산을 multi-CTA 환경에서 범용적으로 처리하도록 확장한 PR을 분석합니다. CTA 간 데이터 전송을 위한 cluster barrier와 distributed shared memory 활용 방식을 살펴봅니다.

#Triton #GPU Compiler #Multi-CTA #Layout Conversion #MLIR

2026년 2월 9일

[Triton] TMA im2col 모드 — Gluon API 구현

TMA im2col 시리즈의 Gluon DSL API 구현으로, Python에서 im2col 모드 TMA 복사를 직접 사용할 수 있게 한다

#Triton #NVIDIA #TMA #im2col #Gluon #Convolution

2026년 2월 9일

[ACE-Step-1.5] Apple Silicon 맥북에서 MLX 네이티브 백엔드로 5Hz LM 추론 속도 혁신

Apple Silicon 맥북의 Metal GPU를 활용하여 5Hz LM 추론 속도를 획기적으로 개선하는 MLX 네이티브 백엔드 도입.

#MLX #Apple Silicon #Metal GPU #LLM Inference #Performance Optimization #ACE-Step

2026년 2월 8일

[Loki] LogQL 벤치마크에 오브젝트 스토리지 지연 시뮬레이션 추가

Loki LogQL 벤치마크에 S3/GCS 같은 오브젝트 스토리지 지연을 시뮬레이션하는 플래그를 추가하여 프로덕션 환경에 가까운 성능 측정을 가능하게 한 PR 분석.

#Grafana Loki #Go #Benchmarking #Object Storage #Latency Simulation #LogQL

2026년 2월 7일

[triton] Blackwell GPU Cluster Launch Control 지원으로 Persistent Kernel 워크로드 밸런싱 구현

Triton Gluon에 NVIDIA Blackwell SM100+ GPU의 CLC(Cluster Launch Control) 기능을 추가하여 persistent kernel에서 동적 작업 분배를 가능하게 한 PR을 분석합니다.

#Triton #NVIDIA #Blackwell #GPU #Gluon

2026년 2월 6일

[Ray] 메모리 모니터 리팩터링: cgroup 경로 주입으로 테스트 가능성 확보

Ray의 메모리 모니터에 cgroup 경로를 주입할 수 있도록 리팩터링하여 가짜 cgroup으로 메모리 사용량을 모킹할 수 있게 한 PR 분석.

#Ray #C++#Memory Monitor #Testability #Dependency Injection #Resource Isolation

2026년 2월 6일

[triton] FpSan - Floating Point Sanitizer 도입

GPU 커널의 부동소수점 연산 오류를 런타임에 감지하는 FpSan(Floating Point Sanitizer)을 Triton에 도입한 PR을 분석합니다. MLIR 패스를 통해 FP 연산을 integer payload 방식으로 rewrite합니다.

#Triton #GPU Compiler #Floating Point #Sanitizer #MLIR

2026년 2월 6일

[Loki] memory.Bitmap 슬라이싱 지원: 비정렬 오프셋 처리

Loki의 memory.Bitmap에 슬라이싱 기능을 추가하고, 워드 경계에 정렬되지 않은 비트맵의 연산을 지원하도록 개선한 PR 분석.

#Grafana Loki #Go #Bitmap #Memory #Data Structure #Performance

2026년 2월 6일

[triton] Triton 컴파일러 최적화: In-thread 트리 리덕션 도입

Triton의 리덕션 연산을 트리 구조로 변환하고 인-스레드 벡터화를 적용하여 Gluon 어텐션 커널 성능을 개선했습니다.

#Triton #Compiler #Optimization #LLVM #GPU

2026년 2월 6일

[Triton] TMA im2col 모드 — LLVM Lowering 구현

TMA im2col 시리즈의 다섯 번째 PR로, im2col descriptor 생성과 TMA 복사의 LLVM IR lowering을 구현한다

#Triton #NVIDIA #TMA #im2col #LLVM #Compiler

2026년 2월 6일

[triton] AMD GFX1250용 Warp-Pipeline f16 GEMM 예제 추가

AMD GFX1250 아키텍처에서 TDM과 warp pipeline을 활용한 f16 GEMM 커널 예제를 추가한 사례를 분석합니다.

#Triton #AMD #GPU #GFX1250 #GEMM #WarpPipeline

2026년 2월 5일