#Latency Reduction

10개의 포스트

[논문리뷰] Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines

본 논문은 산업 자산 운영(Asset Operations) 분야의 에이전트 파이프라인이 겪는 높은 대기 시간과 기존 캐싱 기법의 한계 문제를 해결합니다.

#Review #Agentic Pipeline #Model Context Protocol #Temporal Semantic Caching #Workflow Optimization #Industrial Asset Operations #Latency Reduction

2026년 5월 20일

[vllm] vLLM Mamba2 SSD 커널 웜업: 첫 요청 지연 시간 91% 감소의 비결

vLLM Mamba2 모델의 첫 요청 지연 시간을 91% 줄인 Triton 커널 웜업 최적화 분석.

#vLLM #Mamba2 #Triton #Kernel Optimization #Latency Reduction #Deep Learning Inference

2026년 5월 12일

[논문리뷰] SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

최근 Agentic MLLMs는 반복적인 시각적 도구 호출을 통해 탁월한 추론 능력을 보여주지만, Perception, Reasoning, Tool-calling의 캐스케이드(cascaded) 루프가 심각한 순차적 오버헤드를 발생시킵니다 [cite: 1, Figure 1].

#Review #Agentic MLLMs #Speculative Perception #Speculative Planning #Cognitive Gating #Answer Separability #Throughput Acceleration #Latency Reduction #Heterogeneous Parallelism

2026년 3월 24일

[논문리뷰] Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

확산 모델(Diffusion Models)의 높은 계산 비용으로 인한 추론 지연 문제를 해결하고, 기존 분산 병렬화 방식에서 발생하는 생성 아티팩트 및 비례적 가속 한계를 극복하는 것을 목표로 합니다. 특히, 조건부 확산 모델에서 이미지 품질 저하 없이 선형적 가속을 뛰어넘는 추론 속도 향상 을 달성하고자 합니다.

#Review #Diffusion Models #Distributed Parallelism #Conditional Guidance #Adaptive Scheduling #Generative AI #Latency Reduction #Multi-GPU

2026년 2월 26일

[논문리뷰] SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

확산 모델의 느린 추론 속도를 개선하기 위해 기존 캐싱 방법론이 원시 특징(raw feature) 차이 에만 의존하여 콘텐츠와 노이즈를 혼합하고, 이로 인해 스펙트럼 진화(spectral evolution) 를 간과하는 문제를 해결하고자 합니다.

#Review #Diffusion Models #Model Acceleration #Feature Caching #Spectral Analysis #Generative AI #Image Generation #Video Generation #Latency Reduction

2026년 2월 25일

[논문리뷰] DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents

본 논문은 기존 Autoregressive 모델(ARM) 기반 검색 에이전트의 직렬 실행 구조로 인한 높은 레이턴시 문제를 해결하고, 동시에 Diffusion Large Language Model(dLLM) 의 취약한 추론 및 도구 호출 능력을 개선하여, dLLM을 효율적인 검색 에이전트 백본으로 활용하는 것을 목표로 합니다.

#Review #Diffusion Large Language Models #Search Agents #Latency Reduction #P-ReAct #Agentic Post-training #Supervised Fine-Tuning #Preference Optimization #Parallel Decoding

2026년 2월 10일

[논문리뷰] RelayGen: Intra-Generation Model Switching for Efficient Reasoning

대규모 추론 모델(LRMs)의 긴 추론 과정에서 발생하는 불균일한 생성 난이도 문제를 해결하여, 상당한 정확도 저하 없이 추론 지연 시간을 줄이는 것 을 목표로 합니다.

#Review #LLM Inference Optimization #Model Switching #Efficient Reasoning #Speculative Decoding #Runtime Adaptation #Discourse-Level Cues #Latency Reduction

2026년 2월 9일

[논문리뷰] Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

본 논문은 대규모 언어 모델(LLM)의 순차적(autoregressive, AR) 디코딩으로 인한 높은 지연 시간을 해결하고, AR 모델의 생성 품질과 인과적 추론 특성을 유지하면서 효율적인 병렬 디코딩을 가능하게 하는 것을 목표로 합니다.

#Review #Parallel Decoding #Causal LLM #Jacobi Decoding #Consistency Distillation #Transformer Inference #Latency Reduction #Rejection Recycling #Multi-block Decoding

2025년 12월 17일

[논문리뷰] Attention Is All You Need for KV Cache in Diffusion LLMs

본 논문은 확산 대규모 언어 모델(DLMs)의 추론 과정에서 발생하는 과도한 Key-Value (KV) 캐시 재계산으로 인한 높은 지연 시간을 해결하는 것을 목표로 합니다.

#Review #Diffusion LLMs #KV Cache #Adaptive Caching #Inference Optimization #Attention Mechanism #Latency Reduction #Generative AI

2025년 10월 17일

[논문리뷰] Cache-to-Cache: Direct Semantic Communication Between Large Language Models

본 연구는 기존 멀티-LLM 시스템에서 텍스트 기반(Text-to-Text, T2T) 통신 이 야기하는 정보 손실, 모호성, 토큰 단위 생성 지연과 같은 한계를 극복하는 것을 목표로 합니다.

#Review #Large Language Models (LLMs)#Inter-model Communication #KV-Cache #Semantic Transfer #Multi-LLM Systems #Cache Fusion #Latency Reduction #Knowledge Sharing

2025년 10월 9일