#Performance Optimization

21개의 포스트

[sglang] AMD ROCm 환경에서의 성능 최적화: Triton을 활용한 Fused QK GemmaRMSNorm 구현

ROCm 플랫폼에서 4개의 개별 커널을 하나의 Triton 커널로 통합하여 QK 정규화 성능을 개선한 사례를 분석합니다.

#SGLang #Triton #ROCm #Performance Optimization #LLM

2026년 4월 25일

[flashinfer] FlashInfer의 고성능 분산 연산: All-Gather Matmul 최적화 분석

FlashInfer에 추가된 All-gather Matmul 연산은 Push-Wait 알고리즘을 통해 분산 환경에서 GEMM 성능을 극대화합니다.

#FlashInfer #Distributed Computing #CUDA #GEMM #Performance Optimization

2026년 4월 24일

[cpython] Python JIT Shim 빌드 프로세스 개선: 런타임 컴파일에서 빌드 타임 링크로

Python JIT shim을 런타임 컴파일에서 빌드 타임 링크로 전환하여 성능과 디버깅 편의성을 개선합니다.

#Python #JIT #Performance Optimization #Build System #CPython #Compiler

2026년 4월 23일

[vllm] vLLM, MXFP4 양자화 MoE 모델을 위한 CUTLASS 기반 SM100 커널 추가로 성능 향상

vLLM이 MXFP4 양자화 MoE 모델 추론을 위한 새로운 CUTLASS 커널을 SM100에 추가하여 성능을 개선했습니다.

#vLLM #MXFP4 #MoE #Quantization #CUTLASS #Performance Optimization #SM100

2026년 4월 18일

[open-webui] Open WebUI 성능 개선: DB 세션 재사용으로 프로필 이미지 로딩 최적화

Open WebUI에서 프로필 이미지 로딩 시 DB 세션 중복 생성을 방지하여 성능을 개선했습니다.

#Python #FastAPI #SQLAlchemy #Performance Optimization #Database

2026년 4월 17일

[vllm] vLLM ROCm Aiter 백엔드 성능 최적화: 불필요한 제로 필링 제거

vLLM ROCm Aiter 백엔드에서 불필요한 GPU 커널 실행을 제거하여 디코드 성능을 개선합니다.

#vLLM #ROCm #Aiter #Performance Optimization #GPU Computing #LLM

2026년 4월 10일

[sglang] SGLang의 AMD GPU 성능 최적화: Aiter CK 커널을 활용한 LayerNorm 오버헤드 제거

AMD GPU 환경에서 LayerNorm의 불필요한 커널 호출을 줄여 성능을 개선한 최적화 사례를 분석합니다.

#SGLang #AMD #ROCm #Performance Optimization #LayerNorm

2026년 4월 9일

[sglang] SGLang Ngram Speculative Decoding 최적화: MatchState 증분 업데이트 성능 개선

Ngram 기반 Speculative Decoding에서 MatchState 업데이트 시 불필요한 힙 할당을 제거하고 성능을 1.4배 향상시킨 사례를 분석합니다.

#SGLang #Speculative Decoding #C++#Performance Optimization #Trie

2026년 4월 6일

[cpython] CPython의 PySet_Contains 최적화: Lock-Free 탐색 도입으로 성능 향상

CPython의 PySet_Contains 함수에 Lock-Free 탐색을 도입하여 성능을 개선한 PR 분석.

#CPython #Python Internals #Performance Optimization #Lock-Free #Concurrency

2026년 4월 3일

[sglang] SGLang Ascend NPU에서 Ring-SP를 활용한 성능 최적화 가이드

Ascend NPU 환경에서 Ring-SP를 통해 Wan2.1 모델의 추론 성능을 약 1.88배 향상시킨 사례와 벤치마크 가이드를 소개합니다.

#SGLang #Ascend NPU #Ring-SP #Performance Optimization #Diffusion Models

2026년 4월 1일

[sglang] SGLang: ROCm 환경에서 Qwen3-VL 디코딩 성능 극대화를 위한 커널 퓨전 최적화

4개의 개별 커널 호출을 단일 HIP 커널로 통합하여 Qwen3-VL 모델의 디코딩 지연 시간을 획기적으로 개선한 최적화 사례 분석.

#SGLang #ROCm #Kernel Fusion #LLM #Performance Optimization

2026년 4월 1일

[sglang] SGLang Whisper 모델의 CUDA Graph 도입 및 성능 최적화 분석

Whisper 모델에 CUDA Graph를 도입하여 처리량을 36% 향상시킨 SGLang의 최적화 기법과 구현 상세를 분석합니다.

#SGLang #Whisper #CUDA Graph #Performance Optimization #LLM

2026년 3월 28일

[논문리뷰] veScale-FSDP: Flexible and High-Performance FSDP at Scale

Cong Xie이 arXiv에 게시한 'veScale-FSDP: Flexible and High-Performance FSDP at Scale' 논문에 대한 자세한 리뷰입니다.

#Review #FSDP #Distributed Training #LLM #GPU Scaling #Memory Optimization #Performance Optimization #Structure-Aware Training #RaggedShard

2026년 2월 26일

[ACE-Step-1.5] Apple Silicon을 위한 네이티브 MLX DiT 백엔드 도입: 2-3배 성능 향상

PyTorch MPS의 오버헤드를 제거하고 Apple Silicon에서 DiT 추론 속도를 2-3배 가속화하는 네이티브 MLX 백엔드 구현.

#Apple Silicon #MLX #Diffusion Transformer #Performance Optimization #PyTorch

2026년 2월 11일

[ACE-Step-1.5] Apple Silicon 맥북에서 MLX 네이티브 백엔드로 5Hz LM 추론 속도 혁신

Apple Silicon 맥북의 Metal GPU를 활용하여 5Hz LM 추론 속도를 획기적으로 개선하는 MLX 네이티브 백엔드 도입.

#MLX #Apple Silicon #Metal GPU #LLM Inference #Performance Optimization #ACE-Step

2026년 2월 8일

[논문리뷰] TimeBill: Time-Budgeted Inference for Large Language Models

Yehan Ma이 arXiv에 게시한 'TimeBill: Time-Budgeted Inference for Large Language Models' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Inference #Time Budgeting #KV Cache Eviction #Response Length Prediction #Execution Time Estimation #Real-time AI #Performance Optimization

2025년 12월 28일

[논문리뷰] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

arXiv에 게시된 'TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times' 논문에 대한 자세한 리뷰입니다.

#Review #Video Generation #Diffusion Models #Acceleration #Quantization #Attention #Step Distillation #Performance Optimization #RTX 5090

2025년 12월 24일

[논문리뷰] SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling

arXiv에 게시된 'SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Reasoning #Test-time Scaling #Resource Allocation #Dual-process Theory #Mathematical Reasoning #Adaptive Computation #Performance Optimization

2025년 12월 1일

[논문리뷰] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Jiahao He이 arXiv에 게시한 'Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation' 논문에 대한 자세한 리뷰입니다.

#Review #World Simulation #Video Generation #Block Diffusion #Semi-Autoregressive #KV Cache Management #Inference Engine #Long Video Generation #Performance Optimization

2025년 11월 26일

[논문리뷰] Workload Schedulers -- Genesis, Algorithms and Differences

Vladimir Getov이 arXiv에 게시한 'Workload Schedulers -- Genesis, Algorithms and Differences' 논문에 대한 자세한 리뷰입니다.

#Review #Workload Scheduling #Process Scheduling #Job Scheduling #Big Data Processing #Resource Management #Distributed Systems #Scheduling Algorithms #Performance Optimization

2025년 11월 16일

[논문리뷰] Trove: A Flexible Toolkit for Dense Retrieval

arXiv에 게시된 'Trove: A Flexible Toolkit for Dense Retrieval' 논문에 대한 자세한 리뷰입니다.

#Review #Dense Retrieval #Retrieval Toolkit #Data Management #Distributed Training #Model Customization #Hard Negative Mining #Hugging Face Integration #Performance Optimization

2025년 11월 9일