#LLM Inference

21개의 포스트

[SGLang] 프로젝트 전체 아키텍처 분석 - 개요 및 목차

SGLang의 전체 아키텍처를 17개 계층으로 분석하고, 130개 핵심 모듈과 관련 논문을 정리한 시리즈의 개요 포스트

#sglang #Architecture #LLM Inference #RadixAttention

2026년 4월 9일

[sglang] DeepSeek V3/R1 추론 최적화: DeepEP 공유 전문가(Shared Expert) 융합 기술 분석

DeepEP 환경에서 공유 전문가를 MoE 경로로 통합하여 독립적 연산 오버헤드를 제거하고 추론 성능을 개선하는 최적화 기법을 살펴봅니다.

#SGLang #DeepSeek #MoE #DeepEP #LLM Inference

2026년 4월 9일

[vLLM] 프로젝트 전체 아키텍처 분석 - 개요 및 목차

vLLM의 전체 아키텍처를 11개 계층으로 분석하고, 80+ 핵심 로직과 40+ 관련 논문을 정리한 시리즈의 개요 포스트

#vllm #Architecture #LLM Inference

2026년 4월 7일

[논문리뷰] Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

Jason Cong이 arXiv에 게시한 'Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Inference #Memory Processing Pipeline #Heterogeneous Systems #GPU-FPGA #Sparse Attention #Retrieval-Augmented Generation

2026년 4월 1일

[논문리뷰] HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

Yuxuan Wang이 arXiv에 게시한 'HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention' 논문에 대한 자세한 리뷰입니다.

#Review #Sparse Attention #Hierarchical Indexing #Long Context #LLM Inference #Computational Efficiency #DeepSeek

2026년 3월 30일

[sglang] Dumper 디버그 유틸리티 리팩토링: 설정 구조 개선과 Non-intrusive 모드 도입

SGLang의 dumper.py를 upstream main에서 동기화하며 설정 클래스 구조 개선, CLI key=value 파싱 지원, non-intrusive 모드 등을 추가한 대규모 리팩토링 분석.

#SGLang #Debug #Refactoring #Python #LLM Inference

2026년 3월 30일

[sglang] 미사용 BatchMultimodalOutput/DecodeReq 제거로 코드베이스 정리

SGLang에서 사용되지 않는 BatchMultimodalOutput과 BatchMultimodalDecodeReq 데이터클래스를 제거하여 81줄의 dead code를 정리한 클린업 분석.

#SGLang #Cleanup #Dead Code #Python #LLM Inference

2026년 3월 29일

[논문리뷰] Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

John D. Kelleher이 arXiv에 게시한 'Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Inference #Model Routing #Model Cascading #Efficiency Optimization #Dynamic Model Selection #Multi-LLM Systems #Cost-Performance Trade-off #Adaptive AI Systems

2026년 3월 8일

[sglang] MoE 모델 추론 최적화: Triton 커널 퓨전을 통한 TTFT 28% 개선

MoE 모델 추론 시 `fused_moe_triton`과 `moe_sum_all_reduce` 커널 퓨전으로 TTFT를 28% 개선했습니다.

#MoE #Triton #Kernel Fusion #GPU Optimization #LLM Inference #SGLang

2026년 3월 4일

[논문리뷰] LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

arXiv에 게시된 'LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding' 논문에 대한 자세한 리뷰입니다.

#Review #Speculative Decoding #LLM Inference #Acceptance Rate #KL Divergence #Total Variation Distance #Loss Functions #Draft Model Training #Adaptive Learning

2026년 3월 1일

[논문리뷰] DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

arXiv에 게시된 'DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Inference #KV-Cache #Storage Bottleneck #Agentic Workloads #Dual-Path Loading #PD Disaggregation #RDMA #Adaptive Scheduling

2026년 2월 25일

[ACE-Step-1.5] Apple Silicon 맥북에서 MLX 네이티브 백엔드로 5Hz LM 추론 속도 혁신

Apple Silicon 맥북의 Metal GPU를 활용하여 5Hz LM 추론 속도를 획기적으로 개선하는 MLX 네이티브 백엔드 도입.

#MLX #Apple Silicon #Metal GPU #LLM Inference #Performance Optimization #ACE-Step

2026년 2월 8일

[논문리뷰] TimeBill: Time-Budgeted Inference for Large Language Models

Yehan Ma이 arXiv에 게시한 'TimeBill: Time-Budgeted Inference for Large Language Models' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Inference #Time Budgeting #KV Cache Eviction #Response Length Prediction #Execution Time Estimation #Real-time AI #Performance Optimization

2025년 12월 28일

[논문리뷰] Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

arXiv에 게시된 'Intelligence per Watt: Measuring Intelligence Efficiency of Local AI' 논문에 대한 자세한 리뷰입니다.

#Review #Local AI #LLM Inference #Intelligence per Watt #Edge Computing #Hybrid Cloud #AI Efficiency #Hardware Benchmarking #Query Routing

2025년 11월 11일

[논문리뷰] AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

arXiv에 게시된 'AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders' 논문에 대한 자세한 리뷰입니다.

#Review #Speculative Decoding #Knowledge Distillation #LLM Inference #Model Acceleration #Token Filtering #Draft Model #Acceptance Rate

2025년 10월 24일

[논문리뷰] Direct Multi-Token Decoding

Xifeng Yan이 arXiv에 게시한 'Direct Multi-Token Decoding' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Inference #Multi-token Decoding #Transformer Architecture #Layer Specialization #Cyclical Refilling #Inference Speedup #Model Scaling

2025년 10월 16일

[sglang] SGLang에 Piecewise CUDA Graph 및 Torch Compile 백엔드 도입

SGLang 추론 엔진에 piecewise CUDA graph capture와 torch.compile 백엔드를 통합하여 LLM 서빙 성능을 향상시킨다

#CUDA Graph #torch.compile #LLM Inference #SGLang

2025년 10월 12일

[논문리뷰] PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

Zhenhao Chen이 arXiv에 게시한 'PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits' 논문에 대한 자세한 리뷰입니다.

#Review #Multimodal Dataset #LLM Inference #Behavioral Traits #Causal Representation Learning #Big Five #Multimodal AI #Causal Discovery #Human-Computer Interaction

2025년 9월 16일

[논문리뷰] Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Chunlei Han이 arXiv에 게시한 'Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Inference #Autoscaling #Disaggregated Architecture #Heterogeneous Hardware #Resource Management #Topology-aware Scheduling #GPU Utilization

2025년 8월 28일

[논문리뷰] TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill & Decode Inference

Di Yin이 arXiv에 게시한 'TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill & Decode Inference' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Inference #Tensor Parallelism #KV Cache Optimization #Latent Attention #Memory Efficiency #Decoding Speedup #Prefill/Decode Separation #Reparameterization

2025년 8월 25일

[논문리뷰] Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Fan Xia이 arXiv에 게시한 'Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference' 논문에 대한 자세한 리뷰입니다.

#Review #Diffusion Models #Language Models #Code Generation #Non-Autoregressive Inference #High-Speed Inference #Discrete Diffusion #LLM Inference

2025년 8월 6일