#flash-attention

2개의 포스트

[vLLM] MTP & DFlash: 다중 토큰 예측과 Flash 기반 드래프팅

vLLM의 DFlash 투기적 디코딩 구현을 분석한다. 다중 토큰 예측(MTP)을 Flash Attention 기반으로 구현한 DFlashProposer의 핵심 로직을 살펴본다.

#vllm #speculative-decoding #mtp #dflash #flash-attention

2026년 4월 8일

[vLLM] FlashAttention: IO-aware 타일링으로 어텐션 연산을 가속하는 원리

GPU 메모리 계층을 고려한 타일링 기법으로 어텐션 연산의 IO 병목을 해결하는 FlashAttention의 vLLM 통합 구조를 분석한다.

#vllm #flash-attention #gpu-optimization #attention

2026년 4월 7일