#attention

5개의 포스트

[sglang] LTX2.3 HQ Denoising 성능 최적화: Attention Skip을 활용한 효율적인 모델 호출

LTX2.3 HQ 가이드 Denoising 과정에서 불필요한 Attention 계산을 건너뛰어 성능을 개선했습니다.

#sglang #optimization #performance #deep learning #denoising #attention

2026년 5월 3일

[vLLM] 기타 Attention Backends: GDN, Flex, Triton, DiffKV, MLA Sparse, CPU/ROCm

vLLM의 다양한 어텐션 백엔드를 분석한다. GatedDeltaNet, FlexAttention, Triton, DiffKV, MLA Sparse, ROCm AIter 등의 구현 특징을 살펴본다.

#vllm #attention #backends #triton #rocm

2026년 4월 8일

[vLLM] RoPE 변형: 15+ 로타리 위치 인코딩

vLLM에 구현된 15가지 이상의 RoPE 변형을 총정리하고, 기본 구현부터 YaRN, Llama3 RoPE까지의 코드 구조를 분석한다.

#vllm #rope #positional-encoding #attention

2026년 4월 7일

[vLLM] FlashInfer: LLM 서빙에 특화된 어텐션 엔진

Prefill과 Decode를 분리 최적화하고 다양한 KV 캐시 포맷을 지원하는 FlashInfer 백엔드의 vLLM 통합 구조를 분석한다.

#vllm #flashinfer #attention #decode-optimization

2026년 4월 7일

[vLLM] FlashAttention: IO-aware 타일링으로 어텐션 연산을 가속하는 원리

GPU 메모리 계층을 고려한 타일링 기법으로 어텐션 연산의 IO 병목을 해결하는 FlashAttention의 vLLM 통합 구조를 분석한다.

#vllm #flash-attention #gpu-optimization #attention

2026년 4월 7일