#triton

4개의 포스트

[vLLM] Tree Attention: 투기적 디코딩용 트리 어텐션

vLLM의 Tree Attention 백엔드를 분석한다. 투기적 디코딩의 트리 구조 토큰 검증을 위한 어텐션 마스크 생성과 Triton 기반 통합 어텐션을 살펴본다.

#vllm #tree-attention #speculative-decoding #triton

2026년 4월 8일

[vLLM] 기타 Attention Backends: GDN, Flex, Triton, DiffKV, MLA Sparse, CPU/ROCm

vLLM의 다양한 어텐션 백엔드를 분석한다. GatedDeltaNet, FlexAttention, Triton, DiffKV, MLA Sparse, ROCm AIter 등의 구현 특징을 살펴본다.

#vllm #attention #backends #triton #rocm

2026년 4월 8일

[vLLM] Lightning & Linear Attention: 선형 어텐션 구현

vLLM의 선형 어텐션 백엔드와 Lightning Attention 구현을 분석한다. SSM 스타일 상태 관리, Triton 커널 기반 diagonal block 연산을 살펴본다.

#vllm #linear-attention #lightning-attention #ssm #triton

2026년 4월 8일

[vLLM] Context Parallelism: 컨텍스트 병렬화

vLLM의 Decode Context Parallelism(DCP) 구현을 분석한다. All-to-All 통신으로 어텐션 출력과 LSE를 교환하고 Triton 커널로 결합하는 방법을 코드 레벨에서 살펴본다.

#vllm #context parallelism #distributed #all-to-all #triton

2026년 4월 7일