PR Analysis

[Ray] Actor Pool Map Operator 스케줄러 오버헤드 57% 감소

Ray Data의 actor pool 스케줄러에서 protobuf enum 캐싱, dict lookup 최소화, 상수 호이스팅으로 500+ 액터 환경에서 57% 성능 개선을 달성한 PR 분석.

#Ray #Ray Data #Actor Pool #Python Optimization #Protobuf #Performance

2026년 3월 23일

[vllm] ViT Full CUDA Graph - 비전 인코더 CUDA Graph 완전 지원

EncoderCudaGraphManager를 도입하여 ViT 인코더의 CUDA Graph 캡처/리플레이를 구현, 비전 모델 추론 가속

#vllm #Performance

2026년 3월 23일

[Ultralytics] detect/obb Loss 계산의 preprocess를 벡터화하여 학습 속도 향상

배치별 for 루프를 scatter_add 기반 벡터 연산으로 대체하여 detect/obb Loss의 preprocess 단계를 가속합니다.

#Ultralytics #YOLO #PyTorch #Vectorization #Performance

2026년 3월 22일

[sglang] SGLang의 SM120 FP8 Blockwise GEMM 성능 최적화: Pingpong 스케줄 도입

SM120 아키텍처에서 FP8 Blockwise GEMM 연산 시 Pingpong 스케줄을 도입하여 소형 M 사이즈에서 성능을 약 2배 향상시켰습니다.

#CUDA #CUTLASS #GEMM #FP8 #SGLang #SM120

2026년 3월 22일

[Axolotl] LoRA 커널에 bias, dropout, DoRA, embedding 지원 추가

Axolotl의 Triton LoRA 커널을 확장하여 bias 파라미터, dropout, DoRA(Weight-Decomposed LoRA), embedding 레이어를 지원하도록 개선한 분석.

#Axolotl #LoRA #DoRA #Triton #LLM Training #Performance #PEFT

2026년 3월 22일

[Axolotl] Qwen 3.5 모델 Liger 커널 지원 및 fused RMSNorm+Gated 커널 추가

Axolotl에 Qwen 3.5 / Qwen 3.5 MoE 모델용 Liger FLCE 커널 지원과 fused RMSNorm+SiLU gate Triton 커널을 추가한 분석.

#Axolotl #Liger Kernel #Qwen 3.5 #RMSNorm #Triton #LLM Training #Performance

2026년 3월 22일

[Open WebUI] 메모리 항목 삭제 시 확인 대화상자 추가

개별 메모리 삭제에 확인 대화상자를 추가하여 실수 방지 UX 개선

#Open WebUI #Svelte #UX #Performance

2026년 3월 21일

[Axolotl] ScatterMoE LoRA Triton 커널의 autotune 탐색 공간 축소

ScatterMoE LoRA Triton 커널의 autotune 설정에서 불필요하게 큰 block size를 제거하여 컴파일 시간을 단축하고 shared memory 초과를 방지한 분석.

#Axolotl #Triton #ScatterMoE #LoRA #Autotune #Performance #GPU

2026년 3월 21일

[ray] Ray Data의 차세대 데이터 소스 API: DataSourceV2 설계 및 최적화 전략

Ray Data의 새로운 DataSourceV2 아키텍처를 통해 데이터 소스별 최적화와 확장성을 어떻게 달성했는지 분석합니다.

#Ray #DataEngineering #DistributedSystems #Python #PyArrow

2026년 3월 21일

[Triton] AMD RDNA3에서 buffer cache modifier LLVM IR 전파

RDNA3 타겟에서 .cg/.cs/.cv/.wt cache modifier가 무시되던 문제를 수정하여 non-temporal 메모리 접근 지원

#Triton #AMD #RDNA3 #Cache Optimization #LLVM IR

2026년 3월 21일

[triton] Global Sanitizer에 TMA 및 cp.async 연산 부분 지원 추가

Triton의 Global Sanitizer에 tensor descriptor 디코딩과 TMA/cp.async 연산의 메모리 접근 추적 기능을 추가한 PR 분석.

#Triton #GSan #Sanitizer #TMA #AsyncCopy #Debugging

2026년 3월 20일

[axolotl] Context Parallel 이중 시퀀스 분할 버그 수정: noop context manager로 중복 적용 방지

Context Parallel 학습 시 accelerate와 axolotl이 시퀀스를 이중으로 분할하는 문제를 noop context manager 패치로 해결한 사례를 분석합니다.

#Axolotl #Context Parallel #Distributed Training #Bug Fix

2026년 3월 20일

[PaddleOCR] MCP 서버에서 모든 OCR 결과 배치를 파싱하도록 수정

로컬 OCR 결과의 첫 번째 배치만 처리하던 버그를 수정하여 전체 결과를 올바르게 파싱합니다.

#PaddleOCR #MCP #Bug Fix #OCR #Python

2026년 3월 20일

[Ultralytics] Pose Loss의 keypoint 배치 루프를 벡터 연산으로 최적화

Pose 모델 학습에서 keypoint를 배치별로 정리하는 for 루프를 scatter_add 기반 벡터화로 대체합니다.

#Ultralytics #YOLO #Pose Estimation #Vectorization #PyTorch

2026년 3월 20일

[axolotl] Tensor Parallelism batch_size 계산 버그 수정: dp_world_size 기반으로 전환

Tensor Parallelism 환경에서 batch_size와 total_num_steps가 잘못 계산되던 버그를 dp_world_size 기반으로 수정하고, 파라미터화된 테스트를 추가한 사례를 분석합니다.

#Axolotl #Tensor Parallelism #Distributed Training #Bug Fix

2026년 3월 20일

[axolotl] Gemma 3 QLoRA 설정 개선: Vision Tower 동결과 model_type 제거

Gemma 3 모델의 QLoRA 학습 설정에서 불필요한 model_type 명시를 제거하고, unfrozen_parameters로 Vision Tower를 동결하는 패턴을 분석합니다.

#Axolotl #Gemma3 #QLoRA #Fine-tuning #Configuration

2026년 3월 20일

[Axolotl] ScatterMoE LoRA 최적화: 벤치마크, 커널 분할, autograd 통합

ScatterMoE LoRA Triton 커널에 벤치마크 도구를 추가하고, large expert 모델에서 fused/split forward 자동 선택 및 autograd 통합을 최적화한 분석.

#Axolotl #ScatterMoE #LoRA #Triton #MoE #Benchmark #GPU #Performance

2026년 3월 19일

[triton] Custom DSL Plugin Ops 지원

Triton 플러그인 시스템에 custom op 등록 기능을 추가하여, 서드파티가 자체 DSL 연산을 Triton 프론트엔드에 통합할 수 있도록 한 PR을 분석합니다.

#Triton #Plugin System #DSL #Extensibility #Frontend

2026년 3월 19일

[triton] getTranspositionSelectors 알고리즘 단순화 및 복원

다중 mixed transposition에서의 정합성 문제를 해결하고, prmt selector 알고리즘의 수학적 분해를 명확히 정리한 사례를 분석합니다.

#Triton #GPU #LinearLayout #Optimization #Algorithm

2026년 3월 19일

[triton] ConSan Multi-CTA 지원 추가

Triton의 Concurrency Sanitizer(ConSan)에 multi-CTA 클러스터 환경 지원을 추가하여, 클러스터 내 여러 CTA가 공유하는 scratch memory 상태를 올바르게 추적하도록 개선한 PR을 분석합니다.

#Triton #GPU Compiler #Concurrency Sanitizer #Multi-CTA #CUDA

2026년 3월 19일