#KV Cache

16개의 포스트

[sglang] AMD GPU에서 FP8 KV 캐시 쓰기 최적화: Triton 커널 융합으로 성능 향상

AMD GPU의 FP8 KV 캐시 쓰기 성능을 개선하기 위해 Triton 커널을 융합하여 오버헤드를 줄였습니다.

#AMD GPU #FP8 #Triton Kernel #KV Cache #Optimization #SGLang

2026년 4월 25일

[논문리뷰] KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

arXiv에 게시된 'KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs' 논문에 대한 자세한 리뷰입니다.

#Review #LLM #KV Cache #RAG #Recomputation-Free #Soft-token Adapter #Self-Supervised Distillation #Attention Dynamics

2026년 4월 16일

[vllm] vLLM TurboQuant: KV 캐시 압축으로 LLM 서빙 효율 극대화

vLLM의 TurboQuant는 KV 캐시를 압축하여 메모리 사용량을 줄이고 LLM 서빙 효율을 높입니다.

#vLLM #LLM #KV Cache #Quantization #Optimization #Triton #GPU Memory

2026년 4월 15일

[SGLang] RadixAttention: Radix Tree 기반 프리픽스 캐싱의 핵심

SGLang의 핵심 혁신인 RadixAttention을 분석한다. Radix Tree 자료구조를 활용한 KV 캐시 프리픽스 공유, PagedAttention 대비 5x 성능 향상의 원리를 코드와 함께 살펴본다.

#sglang #RadixAttention #Prefix Caching #Radix Tree #KV Cache

2026년 4월 10일

[sglang] SGLang NIXL 이기종 TP 환경에서 디스어그리게이션 KV 캐시 전송 버그 수정 및 성능 개선

SGLang NIXL에서 이기종 TP 환경의 KV 캐시 전송 문제를 해결하여 디스어그리게이션 서빙 안정성을 높였습니다.

#SGLang #NIXL #KV Cache #Disaggregation #TP Heterogeneous #Optimization

2026년 4월 7일

[논문리뷰] TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

arXiv에 게시된 'TriAttention: Efficient Long Reasoning with Trigonometric KV Compression' 논문에 대한 자세한 리뷰입니다.

#Review #KV Cache #LLM #Attention #RoPE #Compression #Reasoning

2026년 4월 6일

[논문리뷰] Universal YOCO for Efficient Depth Scaling

arXiv에 게시된 'Universal YOCO for Efficient Depth Scaling' 논문에 대한 자세한 리뷰입니다.

#Review #Large Language Models #Recursive Computation #YOCO #Depth Scaling #Inference Efficiency #KV Cache #Decoder-Decoder Architecture

2026년 4월 1일

[sglang] HiSparse 도입: Sparse Attention 모델을 위한 효율적인 KV 캐시 관리

HiSparse는 CPU 메모리를 활용해 유휴 KV 캐시를 저장함으로써, DeepSeek-V3와 같은 Sparse Attention 모델의 배치 사이즈와 처리량을 극대화합니다.

#SGLang #LLM #KV Cache #Sparse Attention #CUDA

2026년 3월 23일

[논문리뷰] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

arXiv에 게시된 'DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models' 논문에 대한 자세한 리뷰입니다.

#Review #Diffusion Models #Vision Language Models #Autoregressive Models #Diffusion Finetuning #Block Diffusion #Multimodal AI #KV Cache

2025년 12월 17일

[논문리뷰] BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

arXiv에 게시된 'BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation' 논문에 대한 자세한 리뷰입니다.

#Review #Block Diffusion #Video Generation #Temporal Consistency #KV Cache #Semi-Autoregressive #Video Quality Metrics #Long Video Generation

2025년 12월 2일

[논문리뷰] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Pinar Yanardag이 arXiv에 게시한 'Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout' 논문에 대한 자세한 리뷰입니다.

#Review #Autoregressive Video Generation #Rotary Positional Embedding #Infinite Video Generation #Action Control #Cinematic Transitions #Video Diffusion Models #KV Cache

2025년 12월 1일

[논문리뷰] Latent Collaboration in Multi-Agent Systems

arXiv에 게시된 'Latent Collaboration in Multi-Agent Systems' 논문에 대한 자세한 리뷰입니다.

#Review #Multi-Agent Systems #Large Language Models #Latent Space #Latent Reasoning #Latent Communication #KV Cache #Computational Efficiency #Training-Free

2025년 11월 26일

[논문리뷰] TiDAR: Think in Diffusion, Talk in Autoregression

arXiv에 게시된 'TiDAR: Think in Diffusion, Talk in Autoregression' 논문에 대한 자세한 리뷰입니다.

#Review #Hybrid LLM Architecture #Diffusion-Autoregressive #Parallel Token Generation #Speculative Decoding #Structured Attention Masks #LLM Inference Acceleration #KV Cache

2025년 11월 12일

[논문리뷰] Attention Is All You Need for KV Cache in Diffusion LLMs

arXiv에 게시된 'Attention Is All You Need for KV Cache in Diffusion LLMs' 논문에 대한 자세한 리뷰입니다.

#Review #Diffusion LLMs #KV Cache #Adaptive Caching #Inference Optimization #Attention Mechanism #Latency Reduction #Generative AI

2025년 10월 17일

[논문리뷰] d^2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

Jiarui Wang이 arXiv에 게시한 'd^2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching' 논문에 대한 자세한 리뷰입니다.

#Review #Diffusion Models #Large Language Models (LLMs)#Inference Acceleration #KV Cache #Bidirectional Attention #Adaptive Caching #Token Selection

2025년 10월 1일

[논문리뷰] LongLive: Real-time Interactive Long Video Generation

arXiv에 게시된 'LongLive: Real-time Interactive Long Video Generation' 논문에 대한 자세한 리뷰입니다.

#Review #Long Video Generation #Real-time #Interactive AI #Autoregressive Models #KV Cache #Streaming Tuning #Attention Sink #Diffusion Models

2025년 9월 29일