#Decoding Throughput

1개의 포스트

[논문리뷰] NOSA: Native and Offloadable Sparse Attention

본 논문은 대규모 언어 모델(LLM)의 긴 컨텍스트 디코딩 시 발생하는 메모리 병목 현상, 특히 KV 캐시 크기 가 배치 크기 및 디코딩 처리량을 제한하는 문제를 해결하는 것을 목표로 합니다.

#Review #Sparse Attention #KV Cache Offloading #LLMs #Decoding Throughput #Locality Constraint #Memory Optimization #Trainable Sparse Attention

2025년 10월 16일