#Memory Optimization

22개의 포스트

[논문리뷰] KVpop -- Key-Value Cache Compression with Predictive Online Pruning

LLM의 실시간 추론에서 KV cache는 문맥 길이에 선형적으로 증가하여 long-context inference의 병목 현상을 유발합니다.

#Review #KV Cache Compression #Large Language Models #Sparse Attention #Predictive Pruning #Inference Efficiency #Memory Optimization

2026년 7월 6일

[axolotl] Axolotl: Long-Context 학습을 위한 Hidden State Offloading 최적화 (Non-Reentrant Checkpointing 지원)

Axolotl의 새로운 Hidden State Offloading은 메모리 효율과 성능을 동시에 개선합니다.

#Axolotl #PyTorch #Gradient Checkpointing #Activation Offloading #LLM Training #Memory Optimization

2026년 7월 2일

[논문리뷰] Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

본 논문은 메모리 기반 LLM agent가 장기적인(long-horizon) 과업 수행 시 발생하는 성능 저하 문제를 해결하기 위해 연구되었습니다.

#Review #LLM Agents #Long-Horizon Reasoning #Belief Entropy #Memory Optimization #Reinforcement Learning #Metacognition

2026년 6월 4일

[SGLang] Sliding Window Attention 캐시: SWA 최적화 설계

SGLang의 Sliding Window Attention 캐시를 분석한다. 고정 윈도우 크기 내 KV 캐시만 유지하는 전략, Mistral 등 SWA 모델 지원, 메모리 절약 효과를 코드와 함께 살펴본다.

#sglang #Sliding Window Attention #SWA Cache #Memory Optimization

2026년 4월 10일

[ACE-Step-1.5] MLX VAE 디코딩 메모리 최적화: Apple Silicon에서 피크 메모리 56% 절감

MLX VAE 디코딩 청크 크기를 줄여 Apple Silicon의 피크 메모리를 56% 절감했습니다.

#MLX #Apple Silicon #VAE #Memory Optimization #Performance

2026년 4월 7일

[Loki] Kafka 파티션 불필요한 Shuffle Sharding 제거

ShardSize가 0일 때 불필요한 shuffle shard 생성을 건너뛰어 메모리 사용량 절감.

#Grafana Loki #Go #Performance #Kafka #Memory Optimization

2026년 4월 1일

[Loki] 캐시 최대 크기 초과 시 조기 중단으로 OOM 방지

증분 인코딩과 크기 체크로 대용량 응답의 불필요한 버퍼링 제거

#Grafana Loki #Cache #Memory Optimization #Performance

2026년 4월 1일

[논문리뷰] Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

기존의 Weight-Decomposed Low-Rank Adaptation (DoRA) 구현은 특히 high-rank 설정에서 심각한 메모리 및 성능 병목 현상을 겪습니다.

#Review #DoRA #Low-Rank Adaptation #Parameter-Efficient Fine-Tuning #Fused Kernels #Memory Optimization #Performance Scaling #Triton

2026년 3월 23일

[Loki] Shard Factor 1일 때 Shuffle Shard 생략으로 메모리 50% 절감

단일 파티션 할당 시 불필요한 ShuffleShard 호출을 건너뛰어 CPU와 메모리 사용량 대폭 절감.

#Grafana Loki #Go #Performance #Memory Optimization #Kafka

2026년 3월 18일

[axolotl] FSDP CPU RAM Efficient Loading 패치: non-rank-0 프로세스의 불필요한 가중치 초기화 방지

FSDP 분산 학습에서 cpu_ram_efficient_loading 사용 시 non-rank-0 프로세스가 가중치를 재초기화하는 문제를 monkeypatch로 해결한 사례를 분석합니다.

#Axolotl #FSDP #Distributed Training #Memory Optimization #Monkeypatch

2026년 3월 16일

[논문리뷰] Flash-KMeans: Fast and Memory-Efficient Exact K-Means

본 논문은 기존 GPU 기반 K-평균 구현이 메모리 I/O 병목 현상 과 아토믹 쓰기 경합 으로 인해 온라인 시스템에서 비효율적이라는 문제를 해결하고자 합니다.

#Review #K-Means Clustering #GPU Acceleration #Memory Optimization #IO-Aware Computing #Online Primitive #Hardware-Aware Algorithms #Contention-Free Operations #AI Workloads

2026년 3월 11일

[Axolotl] 가중치 동기 로딩으로 OOM 방지

MoE 모델 로딩 시 비동기 텐서 전송을 비활성화하여 GPU OOM을 방지하는 수정

#Axolotl #MoE #OOM #Memory Optimization #Quantization

2026년 3월 7일

[논문리뷰] Helios: Real Real-Time Long Video Generation Model

논문은 단일 NVIDIA H100 GPU 에서 19.5 FPS 로 실시간 분 단위 비디오를 생성하고, 기존의 안티-드리프팅(anti-drifting) 휴리스틱이나 가속화 기술 없이도 강력한 품질을 유지하는 최초의 14B 비디오 생성 모델 인 Helios를 개발하는 것을 목표로 합니다.

#Review #Video Generation #Real-Time #Long Video #Diffusion Transformers #Anti-Drifting #Memory Optimization #Distillation #Autoregressive Models

2026년 3월 4일

[논문리뷰] veScale-FSDP: Flexible and High-Performance FSDP at Scale

본 논문은 기존 FSDP(Fully Sharded Data Parallel) 시스템이 블록-wise 양자화 훈련 이나 Shampoo, Muon 과 같은 비-요소별(non-element-wise) 옵티마이저 를 사용하는 구조 인식 훈련(structure-aware training) 에서 겪는 한계를 해결하고자 합니다.

#Review #FSDP #Distributed Training #LLM #GPU Scaling #Memory Optimization #Performance Optimization #Structure-Aware Training #RaggedShard

2026년 2월 26일

[논문리뷰] Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

논문은 오토-회귀 비디오 생성 모델의 주요 병목인 KV-cache 메모리 문제 를 해결하고자 합니다.

#Review #Auto-Regressive Video Generation #KV-Cache Quantization #Memory Optimization #Long Video Generation #Video Diffusion Models #Semantic-Aware Smoothing #Progressive Residual Quantization

2026년 2월 4일

[논문리뷰] HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

본 논문은 기존 희소 어텐션(sparse attention) 방법론의 두 가지 근본적인 한계를 해결하고자 합니다. 첫째, 토큰 중요도 예측에 추가적인 프록시(proxy)를 사용하는 복잡성과 성능 저하 문제.

#Review #Sparse Attention #KV Cache Sharing #Hybrid Attention #Long-Context LLMs #Memory Optimization #Token Selection #Transformer Architecture

2026년 2월 4일

[Triton] AMD PartitionedSharedEncodingAttr 도입으로 shared memory 파티셔닝 지원

텐서를 여러 물리적 shared memory 파티션에 분산 배치하여 bank conflict를 줄이는 새로운 encoding attribute 추가

#Triton #AMD #MLIR #Shared Memory #Memory Optimization

2026년 2월 4일

[Loki] 데이터 오브젝트 Plain Value 디코더 최적화로 처리량 93% 향상

Grafana Loki의 dataobj에서 Plain Value 디코더를 Arrow 스타일 메모리 표현, []byte 기반 디코딩, 포인터 간접 참조 최소화로 재작성하여 디코딩 처리량을 93% 향상시킨 최적화를 분석합니다.

#Grafana Loki #Go #Performance #Decoder #Memory Optimization #Benchmark

2026년 1월 15일

[논문리뷰] Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

본 논문은 비디오 생성 분야에서 Direct Preference Optimization (DPO) 의 효율성을 유지하면서, 기존 방법론이 가진 비싼 데이터 구축, 불안정한 훈련, 과도한 메모리 소비라는 고유한 비디오 태스크의 난제를 해결하는 것을 목표로 합니다.

#Review #Video Generation #Direct Preference Optimization #SFT Regularization #GT-Pair #Memory Optimization #Diffusion Models #I2V #T2V

2025년 11월 9일

[논문리뷰] MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

대규모 MoE 기반 LLM(예: DeepSeek-V3-0324 , Kimi-K2-Instruct )의 막대한 메모리 요구사항으로 인한 배포 병목 현상을 해결하고자 합니다.

#Review #Mixture-of-Experts (MoE)#LLM Compression #Matrix Decomposition #Parameter Efficiency #Deep Learning #Memory Optimization

2025년 8월 12일

[논문리뷰] LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

대규모 언어 모델(LLMs)에서 증가하는 Key-Value(KV) 캐시 크기로 인한 GPU 메모리 사용량 증가와 느린 추론 속도 문제를 해결하는 것이 목표입니다.

#Review #LLM #KV Cache Optimization #Model Pruning #Efficient Decoding #Memory Optimization #Static Sparsity #Transformer

2025년 8월 7일

[논문리뷰] NOSA: Native and Offloadable Sparse Attention

본 논문은 대규모 언어 모델(LLM)의 긴 컨텍스트 디코딩 시 발생하는 메모리 병목 현상, 특히 KV 캐시 크기 가 배치 크기 및 디코딩 처리량을 제한하는 문제를 해결하는 것을 목표로 합니다.

#Review #Sparse Attention #KV Cache Offloading #LLMs #Decoding Throughput #Locality Constraint #Memory Optimization #Trainable Sparse Attention

2025년 10월 16일