#Attention

7개의 포스트

[triton] Triton Gluon Attention 커널의 Autotuning을 통한 성능 최적화 분석

Triton Gluon 예제에서 커널 설정을 동적으로 선택하는 Autotuning 로직을 도입하여 다양한 시나리오에서 성능을 개선했습니다.

#Triton #GPU #Optimization #Attention #DeepLearning

2026년 4월 23일

[논문리뷰] TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

arXiv에 게시된 'TriAttention: Efficient Long Reasoning with Trigonometric KV Compression' 논문에 대한 자세한 리뷰입니다.

#Review #KV Cache #LLM #Attention #RoPE #Compression #Reasoning

2026년 4월 6일

[sglang] TRT-LLM Sparse MLA 커널의 prefill 배치 지원

TRT-LLM sparse MLA 커널이 prefill 배치에서 올바른 page table 변환을 사용하도록 수정하여 정확도 개선

#SGLang #TRT-LLM #MLA #DeepSeek #Attention

2026년 4월 1일

[faster-qwen3-tts] SDPA 전환으로 BF16 StaticCache hidden-state 발산 수정

eager attention에서 SDPA로 전환하여 StaticCache 패딩 길이에 따른 BF16 hidden-state 발산 문제를 해결한다

#faster-qwen3-tts #TTS #CUDA Graphs #Attention

2026년 3월 4일

[pytorch] MPS: 2-pass SDPA의 메모리 손상을 float accumulator 강제로 수정

Apple MPS 백엔드의 2-pass Scaled Dot-Product Attention에서 half precision accumulator로 인한 메모리 손상 버그를 float32 강제 전환으로 해결한 사례를 분석합니다.

#PyTorch #MPS #SDPA #Attention #Precision #Apple Silicon #Bug Fix

2026년 2월 24일

[triton] Triton AMD 백엔드: 8-Wave PingPong Attention 커널 구현 분석

AMD GPU 환경에서 성능 향상을 위한 8-Wave PingPong Attention 커널 구현 및 파이프라이닝 최적화 기법을 살펴봅니다.

#Triton #AMD #GPU #Attention #Optimization

2026년 2월 10일

[논문리뷰] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

arXiv에 게시된 'TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times' 논문에 대한 자세한 리뷰입니다.

#Review #Video Generation #Diffusion Models #Acceleration #Quantization #Attention #Step Distillation #Performance Optimization #RTX 5090

2025년 12월 24일