#Speculative Decoding

26개의 포스트

[논문리뷰] Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

본 논문은 기존 Speculative Decoding의 핵심인 다중 토큰 예측(Multi-token prediction) 방식이 갖는 구조적 한계를 극복하고자 합니다.

#Review #Speculative Decoding #Pipeline Parallelism #LLM Inference #Feature Aggregation #Latency Hiding #Throughput

2026년 6월 1일

[논문리뷰] Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

본 논문은 Speculative decoding에서 draft 품질과 연산 비용 간의 trade-off 문제를 해결하는 것을 목표로 합니다.

#Review #Speculative Decoding #LLM Inference #Autoregressive Drafting #Parallel Drafting #Causal Modeling #Low-Rank Correction

2026년 6월 1일

[sglang] 성능 최적화의 함정: DeepSeek-V3.2 정확도 붕괴를 막기 위한 SGLang의 긴급 롤백 분석

EAGLE 드래프트 모델에서 Softmax를 생략하는 최적화가 DeepSeek-V3.2 MTP 모델의 정확도를 96%나 떨어뜨린 이유와 그 해결책을 분석합니다.

#SGLang #Speculative Decoding #DeepSeek-V3 #Performance Optimization #LLM Inference

2026년 5월 26일

[sglang] SGLang EAGLE 디코딩 최적화: 불필요한 Softmax 연산 제거로 성능 향상

SGLang EAGLE 디코딩에서 topk=1일 때 불필요한 Softmax 연산을 제거하여 성능을 개선했습니다.

#SGLang #EAGLE #Speculative Decoding #Performance Optimization #Softmax #Top-k Sampling

2026년 5월 25일

[논문리뷰] Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

본 논문은 기존의 Tree-based Speculative Decoding이 겪고 있는 속도와 정확도(MAT) 사이의 Pareto tradeoff 문제를 해결하고자 한다.

#Review #Speculative Decoding #Tree Construction #Dynamic Pruning #Retrieval-based #GPU-resident #Budget Compensation #Long-context

2026년 5월 19일

[논문리뷰] SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

본 논문은 기존 Speculative Decoding의 Drafter들이 가진 상반된 한계점을 극복하기 위해 제안되었다.

#Review #LLM Inference #Speculative Decoding #Tree-based Verification #Block-Iterative Drafting #Rank-guided Expansion #Serving-time Adaptation

2026년 5월 10일

[논문리뷰] Fast Byte Latent Transformer

본 논문은 byte-level language model이 지닌 고질적인 추론 속도 문제를 해결하는 것을 목적으로 한다. 기존의 바이트 단위 모델은 Subword 모델과 달리 입력 길이가 훨씬 길어지기 때문에, Naive한 자기회귀(Autoregressive) 방식으로는 매우 느린 추론 속도를 보인다는 한계가 있다.

#Review #Byte-level Language Model #BLT #Diffusion #Inference Acceleration #Speculative Decoding #Latent Tokenization

2026년 5월 10일

[vllm] vLLM, Gemma 4 모델에 양자화된 Speculative Decoding 적용: 성능 향상의 비밀

vLLM이 Gemma 4 모델에 Speculative Decoding을 도입하여 추론 속도를 획기적으로 개선한 방법을 분석합니다.

#vLLM #Speculative Decoding #Gemma 4 #LLM 최적화 #양자화

2026년 5월 6일

[논문리뷰] Speculative Decoding for Autoregressive Video Generation

본 논문은 이미지 품질 라우터를 사용하여 블록별로 드래프트된 결과물을 수락하거나 타겟 모델로 재생성하는 SDVG 프레임워크를 제안합니다. 드래프터는 4번의 Denoising step을 통해 후보 블록을 생성하며, 이는 Worst-frame aggregation을 통해 ImageReward로 평가됩니다 .

#Review #Speculative Decoding #Autoregressive Video Generation #Video Diffusion #Training-free #ImageReward

2026년 4월 21일

[SGLang] Speculative Decoding 개요: 원리와 구현 아키텍처

SGLang의 Speculative Decoding 전체 아키텍처를 분석한다. 드래프트-검증 2단계 파이프라인의 원리, 기존 Autoregressive 대비 2-3x 속도 향상, SGLang의 구현 방식을 코드와 함께 살펴본다.

#sglang #Speculative Decoding #Draft-Verify #Acceleration

2026년 4월 12일

[sglang] SGLang Ngram Speculative Decoding 최적화: MatchState 증분 업데이트 성능 개선

Ngram 기반 Speculative Decoding에서 MatchState 업데이트 시 불필요한 힙 할당을 제거하고 성능을 1.4배 향상시킨 사례를 분석합니다.

#SGLang #Speculative Decoding #C++#Performance Optimization #Trie

2026년 4월 6일

[sglang] SGLang Ngram 추측 디코딩: 외부 코퍼스 기반 Suffix Automaton 통합으로 성능 최적화

SGLang의 Ngram 추측 디코딩에 외부 코퍼스 기반 Suffix Automaton을 도입하여 성능을 개선합니다.

#SGLang #Ngram #Speculative Decoding #Suffix Automaton #성능 최적화 #LLM #Python #C++

2026년 4월 6일

[sglang] Ngram Corpus를 Torch cpp_extension에서 TVM FFI로 마이그레이션

Speculative decoding의 ngram corpus 모듈을 torch cpp_extension에서 TVM FFI jit_kernel 기반으로 전환하여 빌드 의존성을 줄이고 JIT 컴파일 경로를 통일

#SGLang #TVM FFI #JIT Kernel #Speculative Decoding

2026년 4월 2일

[논문리뷰] ConFu: Contemplate the Future for Better Speculative Sampling

본 논문은 기존의 speculative decoding 드래프트 모델들이 현재 prefix에만 의존하여 예측하는 방식 때문에 발생하는 오류 누적 문제 를 해결하고자 합니다.

#Review #Speculative Decoding #LLM Inference Acceleration #Draft Model #Future Prediction #Contemplate Tokens #Mixture-of-Experts #Token Acceptance Rate #Speedup Ratio

2026년 3월 10일

[논문리뷰] LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

본 연구는 추론 가속화를 위한 투기적 디코딩(speculative decoding) 에서 드래프트 모델의 토큰 수락률(acceptance rate) 을 직접적으로 최적화하는 새로운 훈련 목표인 LK 손실(LK losses) 을 제안합니다.

#Review #Speculative Decoding #LLM Inference #Acceptance Rate #KL Divergence #Total Variation Distance #Loss Functions #Draft Model Training #Adaptive Learning

2026년 3월 1일

[논문리뷰] RelayGen: Intra-Generation Model Switching for Efficient Reasoning

대규모 추론 모델(LRMs)의 긴 추론 과정에서 발생하는 불균일한 생성 난이도 문제를 해결하여, 상당한 정확도 저하 없이 추론 지연 시간을 줄이는 것 을 목표로 합니다.

#Review #LLM Inference Optimization #Model Switching #Efficient Reasoning #Speculative Decoding #Runtime Adaptation #Discourse-Level Cues #Latency Reduction

2026년 2월 9일

[논문리뷰] Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

본 논문은 기존 의료 LLM이 보이는 수동적인 질문-답변 방식과 개방형 임상 상담에서의 환각 문제를 해결하고자 합니다. 능동적인 정보 획득, 장기적 추론, 적응형 환각 억제 기능을 갖춘 임상 등급의 의사결정 지원 시스템인 Baichuan-M3 를 개발하여 신뢰할 수 있는 의료 의사결정을 목표로 합니다.

#Review #Medical LLM #Clinical Decision Support #Reinforcement Learning #Hallucination Suppression #Multi-task Learning #Speculative Decoding #Quantization #Clinical Inquiry

2026년 2월 8일

[논문리뷰] Scaling Embeddings Outperforms Scaling Experts in Language Models

이 논문은 대규모 언어 모델(LLMs)에서 Mixture-of-Experts (MoE) 아키텍처가 겪는 효율성 한계를 극복하기 위해 임베딩 스케일링 을 새로운 희소성 스케일링 차원으로 탐구하는 것을 목표로 합니다.

#Review #Embedding Scaling #N-gram Embedding #Mixture-of-Experts (MoE)#Large Language Models (LLMs)#Parameter Efficiency #Inference Optimization #Speculative Decoding

2026년 1월 29일

[논문리뷰] DEER: Draft with Diffusion, Verify with Autoregressive Models

본 논문은 autoregressive (AR) 디코딩의 내재된 지연으로 인해 발생하는 LLM 기반 에이전트 및 추론 시스템의 효율성 문제를 해결하고자 합니다. 특히, 기존 AR 기반 드래프터의 단계별 불확실성 누적과 순차적 디코딩으로 인한 제한적인 가속화 문제를 극복하는 것을 목표로 합니다.

#Review #Speculative Decoding #Diffusion LLM #Autoregressive Model #Inference Acceleration #Model Alignment #Code Generation #Block Regeneration

2025년 12월 17일

[논문리뷰] T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

논문은 러시아어 오픈소스 LLM의 한계, 특히 추론 능력과 효율적인 추론을 위한 통합 생태계의 부재를 해결하고자 합니다.

#Review #Russian LLM #Hybrid Reasoning #Speculative Decoding #Cyrillic Tokenizer #Instruction Tuning #Reward Modeling #T-Math Benchmark

2025년 12월 11일

[논문리뷰] TiDAR: Think in Diffusion, Talk in Autoregression

본 연구는 대규모 언어 모델(LLM)의 생성 과정에서 확산 모델(Diffusion Models) 의 빠른 병렬 생성 능력과 자기회귀(Autoregressive, AR) 모델 의 높은 품질을 동시에 달성하는 것을 목표로 합니다.

#Review #Hybrid LLM Architecture #Diffusion-Autoregressive #Parallel Token Generation #Speculative Decoding #Structured Attention Masks #LLM Inference Acceleration #KV Cache

2025년 11월 12일

[논문리뷰] Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

통합 멀티모달 모델에서 확산 디노이징과 자기회귀 디코딩의 반복적인 프로세스로 발생하는 상당한 계산 오버헤드 를 해결하는 것이 주 목표입니다. Hyper-Bagel 이라는 통합 가속 프레임워크를 제안하여 멀티모달 이해 및 생성 작업을 동시에 가속화하면서 원본 모델의 고품질 출력을 유지하고자 합니다.

#Review #Multimodal AI #Acceleration Framework #Speculative Decoding #Diffusion Distillation #Unified Models #Text-to-Image Generation #Image Editing #Computational Efficiency

2025년 9월 24일

[논문리뷰] AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models

대규모 언어 모델(LLMs)이 학습 데이터에서 발생하는 사회적 편향, 특히 인도 사회의 카스트 및 종교 관련 편향 을 반영하여 유해하거나 편향된 출력을 생성하는 문제를 해결하고자 합니다.

#Review #Bias Mitigation #Large Language Models #Speculative Decoding #Constitutional AI #Fairness #Inference-Time Control #Indian Sociocultural Context

2025년 9월 3일

[논문리뷰] Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

본 논문은 순차적인 토큰별 디코딩 과정으로 인해 수천 번의 모델 포워드 패스를 요구하는 자율회귀 텍스트-투-이미지 모델의 느린 추론 속도 문제를 해결하는 것을 목표로 합니다. 병렬 토큰 디코딩을 통해 자율회귀 텍스트-투-이미지 생성 모델의 추론을 가속화하고자 합니다.

#Review #Autoregressive Models #Text-to-Image Generation #Inference Acceleration #Jacobi Decoding #Denoising Diffusion Models #Speculative Decoding #Multi-token Prediction #Fine-tuning

2025년 10월 13일

[논문리뷰] AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

본 논문은 대규모 언어 모델(LLM) 추론 속도 향상을 위한 Speculative Decoding (SD) 과정에서 드래프트 모델과 타겟 모델 간의 불일치 문제를 해결하는 것을 목표로 합니다.

#Review #Speculative Decoding #Knowledge Distillation #LLM Inference #Model Acceleration #Token Filtering #Draft Model #Acceptance Rate

2025년 10월 24일

[논문리뷰] When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

본 논문은 LLM(Large Language Model) 앙상블이 장문(long-form) 생성에서 겪는 불안정성과 비효율성 문제를 해결하는 것을 목표로 합니다.

#Review #LLM Ensembling #Token-level Ensembling #Speculative Decoding #Tokenization Mismatch #Probability Sharpening #Long-form Generation #KV Cache Management

2025년 10월 21일