#Hybrid Attention

9개의 포스트

[SGLang] Hybrid Attention: Dense-Sparse 동적 전환 전략

SGLang의 Hybrid Attention 백엔드를 분석한다. Dense와 Sparse 어텐션을 동적으로 전환하는 전략, 전환 조건과 임계값 설계를 코드와 함께 살펴본다.

#sglang #Hybrid Attention #Dense-Sparse #Dynamic Switching

2026년 4월 11일

[논문리뷰] Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

본 논문은 기존 long-context LLM 추론에서 발생하는 quadratic computational complexity와 기존 하이브리드 어텐션 기법들의 한계를 해결하고자 합니다.

#Review #Large Language Models #Long-context Inference #Hybrid Attention #Dynamic Routing #Layer-level Sparsity #Context-aware

2026년 4월 9일

[논문리뷰] MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

기존 VLA 모델들은 hierarchical 구조나 autoregressive 패러다임에 의존함으로써 발생하는 아키텍처 오버헤드, 장기적 시간 일관성 결여, 그리고 환경 역학(environment dynamics)을 파악하는 명시적 메커니즘 부족이라는 한계에 직면해 있습니다.

#Review #Vision-Language-Action (VLA)#Discrete Diffusion #Multi-modal Generation #Robotic Manipulation #Action Chunking #World Model #Hybrid Attention

2026년 4월 1일

[논문리뷰] HyTRec: A Hybrid Temporal-Aware Attention Architecture for Long Behavior Sequential Recommendation

본 논문은 생성형 추천 시스템에서 초장기 사용자 행동 시퀀스(ultra-long user behavior sequences) 모델링 시 발생하는 효율성과 정확도 간의 근본적인 트레이드오프를 해결하는 것을 목표로 합니다.

#Review #Sequential Recommendation #Hybrid Attention #Temporal-Aware #Long Sequences #Generative Recommendation #Linear Attention #Softmax Attention

2026년 2월 25일

[논문리뷰] Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

본 논문은 11B 활성화 파라미터 를 가진 196B Mixture-of-Experts (MoE) 모델 인 Step 3.5 Flash 를 소개하며, 첨단 에이전트 지능과 컴퓨팅 효율성 간의 격차를 해소하는 것을 목표로 합니다.

#Review #Mixture-of-Experts (MoE)#Sparse Models #Inference Efficiency #Hybrid Attention #Multi-Token Prediction (MTP)#Reinforcement Learning (RL)#Agentic AI #Long-Context Understanding

2026년 2월 11일

[논문리뷰] HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

본 논문은 기존 희소 어텐션(sparse attention) 방법론의 두 가지 근본적인 한계를 해결하고자 합니다. 첫째, 토큰 중요도 예측에 추가적인 프록시(proxy)를 사용하는 복잡성과 성능 저하 문제.

#Review #Sparse Attention #KV Cache Sharing #Hybrid Attention #Long-Context LLMs #Memory Optimization #Token Selection #Transformer Architecture

2026년 2월 4일

[논문리뷰] Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

표준 어텐션 메커니즘의 이차적인 복잡도로 인한 대규모 언어 모델(LLM)의 긴 컨텍스트 시나리오에서의 확장성 병목 현상을 해결하고자 합니다.

#Review #Transformer #Sparse Attention #Adaptive Sparsity #Efficient LLM #Attention Router #Long-Context #Hybrid Attention

2026년 1월 26일

[논문리뷰] Native Hybrid Attention for Efficient Sequence Modeling

본 논문은 Transformer의 O(n²) 연산 복잡도와 선형 어텐션 모델의 낮은 정확도 문제를 해결하기 위해, 효율적이면서도 긴 컨텍스트에서 높은 정확도를 유지할 수 있는 새로운 하이브리드 어텐션 아키텍처를 개발하는 것을 목표로 합니다.

#Review #Sequence Modeling #Hybrid Attention #Transformer Architecture #Linear Attention #Sliding Window Attention #Long Context #Large Language Models (LLMs)#Efficiency

2025년 10월 9일

[논문리뷰] Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

본 논문은 기존의 Softmax Attention 이 긴 시퀀스 길이에서 겪는 계산 및 I/O 오버헤드 문제 를 해결하고, 순수 Linear Attention 모델의 성능 한계를 극복하기 위해 효율적인 하이브리드 아키텍처를 제안합니다.

#Review #Long-Context LLM #Hybrid Attention #Linear Attention #Mixture-of-Experts #FP8 Training #GPU Optimization #Training-Inference Alignment #Reinforcement Learning

2025년 10월 23일