#Decoding Speedup

1개의 포스트

[논문리뷰] TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill & Decode Inference

본 논문은 DeepSeek-V2 에서 도입된 Multi-Head Latent Attention (MLA) 이 Tensor Parallelism (TP) 환경에서 KV 캐시 메모리 절감 효과를 잃는 문제를 해결하고자 합니다.

#Review #LLM Inference #Tensor Parallelism #KV Cache Optimization #Latent Attention #Memory Efficiency #Decoding Speedup #Prefill/Decode Separation #Reparameterization

2025년 8월 25일