#Sparse Attention

36개의 포스트

[논문리뷰] LVSA: Training-Free Sparse Attention for Long Video Diffusion

본 논문은 video diffusion transformers의 긴 영상 생성 과정에서 발생하는 dense self-attention의 연산 효율성 저하와 품질 저하 문제를 해결합니다.

#Review #Video Diffusion Transformers #Sparse Attention #Long Video Generation #Training-Free #FlashInfer #Attention Optimization

2026년 6월 1일

[논문리뷰] OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

본 논문은 기존 Diffusion Transformers(DiTs) 기반 비디오 생성 모델이 가진 2차 복잡도의 연산 비용 문제를 해결하고, 고해상도 비디오 생성 효율을 높이는 것을 목표로 한다.

#Review #Video Generation #Diffusion Transformers #Sparse Attention #Sequence Parallelism #Quantization #Reinforcement Learning

2026년 5월 27일

[논문리뷰] Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

본 논문은 Long-context 추론 시 발생하는 full attention의 이차 비용(quadratic cost) 문제를 해결하기 위해 효율적인 스파스(sparse) 구조로의 전환을 제안한다.

#Review #Long-context LLM #Sparse Attention #Head Specialization #Dynamic Top-pp Selection #Efficient Inference #Self-distillation

2026년 5월 21일

[논문리뷰] MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

본 논문은 Long-context LLM Inference에서 indexer 연산이 전체 비용의 지배적인 비중을 차지하는 문제를 해결하기 위해 MISA를 제안한다.

#Review #Large Language Models #Long-Context #Sparse Attention #Mixture of Experts #Indexer #Inference Efficiency #Retrieval

2026년 5월 10일

[SGLang] Double Sparsity: H-Sparsity와 T-Sparsity의 이중 최적화

SGLang의 Double Sparsity 백엔드를 분석한다. Head-level과 Token-level 두 가지 희소성을 동시에 활용하는 이중 최적화, Dense Attention 대비 메모리 절감 효과를 코드와 함께 살펴본다.

#sglang #Double Sparsity #H-Sparsity #T-Sparsity #Sparse Attention

2026년 4월 11일

[SGLang] NSA (Narrow Sparse Attention): DeepSeek의 스파스 어텐션

SGLang의 NSA 백엔드를 분석한다. DeepSeek의 Narrow Sparse Attention이 선택적 토큰만 어텐션하는 원리, 인덱서 구조, Triton/TileLang 커널을 코드와 함께 살펴본다.

#sglang #NSA #Sparse Attention #DeepSeek #Selective Attention

2026년 4월 11일

[논문리뷰] Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

본 논문은 현대 LLM 추론에서 필수적인 긴 컨텍스트 처리 기법들이 파편화된 메모리 처리 구조로 인해 상당한 성능 저하를 일으킨다는 문제를 해결하고자 한다. 기존 LLM 최적화 방법들은 주로 개별적인 알고리즘 개선에 집중해 왔으며, 하드웨어 수준에서의 체계적인 가속 프레임워크가 부족하다는 한계가 있다.

#Review #LLM Inference #Memory Processing Pipeline #Heterogeneous Systems #GPU-FPGA #Sparse Attention #Retrieval-Augmented Generation

2026년 4월 1일

[논문리뷰] HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

최근 Long-context LLM 환경에서 Token-level sparse attention 은 필수적인 연산 효율화 기법으로 자리 잡았으나, 이를 위한 핵심 모듈인 indexer가 여전히 full-prefix scan 을 수행하며 𝒪(L²) 의 연산 병목을 유발합니다.

#Review #Sparse Attention #Hierarchical Indexing #Long Context #LLM Inference #Computational Efficiency #DeepSeek

2026년 3월 30일

[sglang] HiSparse 도입: Sparse Attention 모델을 위한 효율적인 KV 캐시 관리

HiSparse는 CPU 메모리를 활용해 유휴 KV 캐시를 저장함으로써, DeepSeek-V3와 같은 Sparse Attention 모델의 배치 사이즈와 처리량을 극대화합니다.

#SGLang #LLM #KV Cache #Sparse Attention #CUDA

2026년 3월 23일

[논문리뷰] FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

Large Language Models (LLMs)의 장문 컨텍스트 처리 시 자기회귀(self-attention)의 2차 복잡도로 인한 성능 병목현상 , 특히 계산 집약적인 프리필(prefilling) 단계의 높은 오버헤드 를 해결하는 것이 목표입니다.

#Review #Long-Context LLMs #Prefilling #Sparse Attention #Pattern Discovery #Dynamic Thresholding #Attention Speedup #Transformer Optimization

2026년 3월 8일

[논문리뷰] SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

이 논문은 비디오 확산 모델에서 높은 희소성(sparsity)에서도 생성 품질 저하 없이 효율적인 학습 가능한(trainable) 스파스 어텐션 을 구현하는 것을 목표로 합니다.

#Review #Sparse Attention #Diffusion Models #Video Generation #Hybrid Masking #Distillation Fine-Tuning #Model Acceleration #Top-k #Top-p

2026년 2월 19일

[논문리뷰] Geometry-Aware Rotary Position Embedding for Consistent Video World Model

본 논문은 카메라 제어가 가능한 시각적 월드 모델(predictive visual world models)이 긴 궤적(long trajectories)에서 안정적인 장면 구조를 유지하지 못하고 기하학적 표류(geometric drift)를 겪는 문제 를 해결하는 것을 목표로 합니다.

#Review #Video World Model #Generative AI #Transformer #Positional Encoding #3D Consistency #View Synthesis #Sparse Attention #Loop Closure

2026년 2월 17일

[논문리뷰] GLM-5: from Vibe Coding to Agentic Engineering

본 논문은 AI 모델이 인간의 지시(vibe coding)에 의존하는 것을 넘어 자율적인 계획, 구현 및 반복 이 가능한 Agentic Engineering 패러다임으로 전환하는 것을 목표로 합니다.

#Review #Foundation Model #Agentic AI #Reinforcement Learning #Sparse Attention #Software Engineering #Long-Context Models #GPU Optimization

2026년 2월 17일

[논문리뷰] OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

본 논문은 현대 비전 아키텍처가 시각 신호의 본질적인 중복성과 변별 정보의 희소성을 효율적으로 다루지 못한다는 문제의식에서 출발합니다.

#Review #Multimodal AI #Video Understanding #Sparse Attention #Vision Transformer #Codec-Aligned Processing #Self-Supervised Learning #Predictive Coding #Efficient AI

2026년 2월 15일

[논문리뷰] HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

본 논문은 기존 희소 어텐션(sparse attention) 방법론의 두 가지 근본적인 한계를 해결하고자 합니다. 첫째, 토큰 중요도 예측에 추가적인 프록시(proxy)를 사용하는 복잡성과 성능 저하 문제.

#Review #Sparse Attention #KV Cache Sharing #Hybrid Attention #Long-Context LLMs #Memory Optimization #Token Selection #Transformer Architecture

2026년 2월 4일

[논문리뷰] FASA: Frequency-aware Sparse Attention

대규모 언어 모델(LLMs)이 긴 입력 시퀀스를 처리할 때 발생하는 KV 캐시의 막대한 메모리 사용량과 연산 병목 현상 을 해결하는 것이 목표입니다.

#Review #Sparse Attention #KV Cache Optimization #Rotary Positional Embedding (RoPE)#Frequency Chunks (FCs)#LLMs #Long-Context #Training-Free

2026년 2월 4일

[논문리뷰] Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

대규모 언어 모델(LLMs)에서 O(L²) 의 복잡성을 가지는 어텐션 메커니즘이 긴 컨텍스트 추론의 병목이 되는 문제를 해결하고자 합니다.

#Review #Sparse Attention #Long-Context Inference #LLMs #Token Selection #Efficiency #Transformer #Dynamic Sparsity

2026년 2월 3일

[논문리뷰] Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

표준 어텐션 메커니즘의 이차적인 복잡도로 인한 대규모 언어 모델(LLM)의 긴 컨텍스트 시나리오에서의 확장성 병목 현상을 해결하고자 합니다.

#Review #Transformer #Sparse Attention #Adaptive Sparsity #Efficient LLM #Attention Router #Long-Context #Hybrid Attention

2026년 1월 26일

[논문리뷰] SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

비디오 Diffusion Transformer의 긴 입력 시퀀스로 인해 발생하는 높은 계산 지연 시간 문제를 해결하고, 기존의 스파스 어텐션 방식이 가진 제한된 스파시티 또는 과도한 학습 오버헤드 의 한계를 극복하고자 합니다.

#Review #Video Diffusion Models #Sparse Attention #Linear Attention #Computational Efficiency #Transformer Tuning #Video Generation #LoRA #Gating Mechanism

2026년 1월 25일

[논문리뷰] SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Diffusion Transformer (DiT) 모델은 최첨단 이미지 생성 품질을 제공하지만, 높은 계산 및 메모리 비용으로 인해 엣지 디바이스 에서의 배포가 비실용적인 문제를 해결하는 것이 목표입니다.

#Review #Diffusion Transformers #Edge AI #Efficient Image Generation #Sparse Attention #Elastic Training #Knowledge Distillation #Mobile AI #High-Fidelity

2026년 1월 13일

[논문리뷰] Sliding Window Attention Adaptation

본 논문은 Transformer 기반 LLM의 Self-Attention 메커니즘 이 입력 길이의 제곱에 비례하여 발생하는 높은 연산 비용 문제를 해결하고자 합니다.

#Review #Large Language Models #Sliding Window Attention #Model Adaptation #Long Context #Inference Optimization #Fine-tuning #Chain-of-Thought #Sparse Attention

2025년 12월 14일

[논문리뷰] DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

본 논문은 오픈 소스 대규모 언어 모델(LLM)과 상업용 LLM 간의 성능 격차를 줄이고자 DeepSeek-V3.2 를 소개합니다.

#Review #Large Language Models #Sparse Attention #Reinforcement Learning #Agentic AI #Tool Use #Open-source LLM #DeepSeek

2025년 12월 2일

[논문리뷰] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

본 연구는 대규모 언어 모델(LLM)이 초장문 컨텍스트(ultra-long context) 를 효율적으로 처리하여 '기억하는 기계'를 구축하는 과제를 해결하고자 합니다.

#Review #Large Language Models #Long Context #Sparse Attention #Hierarchical Sparse Attention (HSA)#Length Generalization #Mixture of Experts (MoE)#Transformer

2025년 11월 30일

[논문리뷰] SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

대규모 언어 모델(LLM)에서 quadratic 연산 복잡성 을 갖는 full attention 의 한계를 극복하기 위해, sparse attention 의 성능 저하 및 부족한 sparsity 문제를 해결하고자 합니다.

#Review #Sparse Attention #Full Attention #Large Language Models (LLMs)#Context Length #Attention Sparsity #Alignment Loss #Long-Context Extrapolation

2025년 11월 25일

[논문리뷰] HunyuanVideo 1.5 Technical Report

경량화되면서도 강력한 오픈소스 비디오 생성 모델 Hunyuan Video 1.5 를 개발하여, 8.3억 파라미터로 최첨단 시각 품질과 움직임 일관성을 달성하고, 소비자용 GPU에서 효율적인 추론을 가능하게 하는 것을 목표로 합니다.

#Review #Video Generation #Diffusion Transformer #Sparse Attention #Super-Resolution #Open-Source #Multimodal Understanding #Training Optimization #Efficient Inference

2025년 11월 24일

[논문리뷰] LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

본 논문은 비디오 생성 Diffusion Transformers (DiT)의 Quadratic attention complexity 로 인한 과도한 지연 시간 문제를 해결하고자 합니다.

#Review #Diffusion Transformers #Sparse Attention #Temporal Coherence #Video Generation #Computational Efficiency #FlashAttention #CUDA Kernels

2025년 11월 16일

[논문리뷰] Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning

본 논문은 기존의 테이블 인컨텍스트 학습(ICL) 모델들이 직면한 단일 스케일 피처 처리, 테이블 너비에 대한 Quadratic Scaling 의 조밀한 어텐션, 그리고 순차적 컴포넌트 처리의 한계를 해결하는 것을 목표로 합니다.

#Review #Tabular Data #In-Context Learning #Multi-Scale Attention #Sparse Attention #Foundation Models #Perceiver Architecture

2025년 11월 9일

[논문리뷰] SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

본 논문은 Diffusion Transformer (DiT) 모델, 특히 비디오 생성에서 긴 시퀀스 길이로 인한 어텐션의 2차 시간 복잡도 문제를 해결하고자 합니다.

#Review #Diffusion Transformers #Sparse Attention #Linear Attention #Model Acceleration #Video Generation #Attention Mechanisms #Fine-tuning

2025년 9월 30일

[논문리뷰] Mixture of Contexts for Long Video Generation

본 논문은 Diffusion Transformer (DiT) 기반의 장시간 비디오 생성 모델에서 발생하는 quadratic cost의 self-attention 문제로 인한 연산 및 메모리 비효율성을 해결하고, 모델이 긴 시퀀스에 걸쳐 일관된 장기 기억 을 유지하면서 표류하거나 붕괴되지 않도록 하는 것을 목표로 합니다.

#Review #Long Video Generation #Diffusion Transformers (DiT)#Sparse Attention #Context Routing #Memory Management #Generative Models #Video Synthesis

2025년 8월 29일

[논문리뷰] Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

본 설문조사 논문은 기존 Transformer 기반 대규모 언어 모델(LLMs)의 Quadratic 복잡성 과 높은 연산 및 메모리 요구사항 으로 인한 비효율성 문제를 해결하기 위한 혁신적인 아키텍처를 체계적으로 검토하는 것을 목표로 합니다.

#Review #Large Language Models #Efficient Architectures #Transformer Optimization #Linear Attention #State Space Models #Mixture-of-Experts #Sparse Attention #Diffusion LLMs

2025년 8월 19일

[논문리뷰] Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

본 논문은 대규모 추론 모델(LRMs)의 긴 토큰 생성 과정에서 발생하는 막대한 계산 오버헤드를 해결하는 것을 목표로 합니다.

#Review #Sparse Attention #LLMs #Reasoning Tasks #Efficiency #Training-Free #Global Locality #KV Cache Optimization

2025년 8월 12일

[논문리뷰] LongCat-Video Technical Report

본 논문은 효율적이고 고품질의 장시간 비디오 생성 에 중점을 둔 13.6B 파라미터 규모의 기반 비디오 생성 모델 LongCat-Video 를 제안합니다.

#Review #Video Generation #Diffusion Transformer #RLHF #Sparse Attention #Long Video Generation #Coarse-to-Fine Generation #Multi-task Learning #World Models

2025년 10월 28일

[논문리뷰] NOSA: Native and Offloadable Sparse Attention

본 논문은 대규모 언어 모델(LLM)의 긴 컨텍스트 디코딩 시 발생하는 메모리 병목 현상, 특히 KV 캐시 크기 가 배치 크기 및 디코딩 처리량을 제한하는 문제를 해결하는 것을 목표로 합니다.

#Review #Sparse Attention #KV Cache Offloading #LLMs #Decoding Throughput #Locality Constraint #Memory Optimization #Trainable Sparse Attention

2025년 10월 16일

[논문리뷰] FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

본 논문은 확산 모델 기반 비디오 초해상도(VSR) 기술을 현실 세계에 적용 가능하도록 효율성, 확장성 및 실시간 성능을 확보하는 것을 목표로 합니다. 특히 높은 지연 시간, 과도한 연산량, 초고해상도 비디오에 대한 일반화 능력 부족 등의 기존 확산 기반 VSR 모델의 한계를 극복하고자 합니다.

#Review #Video Super-Resolution (VSR)#Diffusion Models #Real-time VSR #Streaming VSR #Sparse Attention #Distillation #Conditional Decoder #High-resolution

2025년 10월 15일

[논문리뷰] MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

본 논문은 Diffusion Transformers (DiTs) 기반의 긴 비디오 생성에서 발생하는 전체 어텐션의 2차 시간 복잡도 문제 를 해결하고자 합니다.

#Review #Long Video Generation #Sparse Attention #Diffusion Transformers #Mixture-of-Groups Attention #Token Routing #Computational Efficiency #Context Length

2025년 10월 22일

[SGLang] DeepSeek V3.2 지원 추가

SGLang에 DeepSeek V3.2 모델과 Native Sparse Attention(NSA) 백엔드를 추가한다

#SGLang #DeepSeek #Sparse Attention #Model Support

2025년 10월 6일