[feast] Feast 온라인 서빙 성능 튜닝: Sub-2ms 달성을 위한 여정

2026년 6월 3일수정: 2026년 6월 3일

PR 링크: feast-dev/feast#6465 상태: Merged | 변경: +339 / -0

들어가며

Feast는 온라인 피처 서빙을 위한 강력한 도구이지만, 프로덕션 환경에서는 낮은 지연 시간(latency)과 높은 처리량(throughput)이 필수적입니다. 특히 실시간 머신러닝 시스템에서는 피처 조회 속도가 모델 성능에 직접적인 영향을 미치기 때문입니다. 이번 글에서는 Feast Python 피처 서버를 Kubernetes 환경에서 최적화하여 기본 설정에서 p99 지연 시간을 2ms 미만으로 낮춘 실제 경험을 공유하고자 합니다. 이 과정은 단순히 설정을 변경하는 것을 넘어, 서버 측 튜닝, 클라이언트 측 최적화, 그리고 Feast 라이브러리 자체의 코드 개선까지 포함하는 종합적인 성능 개선 여정이었습니다.

이 PR은 이러한 성능 튜닝 과정을 상세히 기록한 블로그 포스트를 추가합니다. 이 포스트는 다음과 같은 내용을 다룹니다:

1단계 (sub-5ms): 서버 측 튜닝 (Gunicorn 워커, Redis 설정, HPA), 클라이언트 측 튜닝 (연결 풀링, 액세스 모드 선택), 코드 레벨 개선 (직렬화 최적화, 비동기 배치 파이프라인, 캐시된 검사, 세션 래핑 수정).
2단계 (sub-2ms): 사전 계산된 피처 벡터 (Pre-computed feature vectors) - 작동 방식, 설계 결정, 그리고 일반적인 피처 뷰 읽기 경로 대비 6-9배의 속도 향상을 보여주는 벤치마크 결과.

이 글을 통해 여러분도 Feast 온라인 서빙 성능을 개선하는 데 필요한 지식과 방법을 얻어가시길 바랍니다.

코드 변경사항 분석

이번 PR의 핵심은 infra/website/docs/blog/feast-online-server-performance-tuning.md 파일에 새로운 블로그 포스트를 추가하는 것입니다. 이 포스트는 Feast 온라인 피처 서버의 성능을 최적화하는 과정을 단계별로 상세하게 설명하고 있습니다. 코드 변경사항은 주로 문서의 목차(docs/SUMMARY.md)에 새로운 블로그 포스트를 링크하는 것과, 실제 블로그 포스트 내용을 담은 markdown 파일의 추가로 이루어져 있습니다.

1. 문서 목차 업데이트 (`docs/SUMMARY.md`)

새로운 블로그 포스트를 문서 사이트에 포함시키기 위해 목차 파일에 링크를 추가했습니다.

--- a/docs/SUMMARY.md
+++ b/docs/SUMMARY.md
@@ -192,6 +192,7 @@
 * [\includegraphics[width=0.8em]{/images/alpha.png} Streaming feature computation with Denormalized](reference/denormalized.md)
 * [\includegraphics[width=0.8em]{/images/alpha.png} Feature View Versioning](reference/alpha-feature-view-versioning.md)
 * [OpenLineage Integration](reference/openlineage.md)
+ * [MLflow Integration](reference/mlflow.md)
 * [Feast CLI reference](reference/feast-cli-commands.md)
 * [Python API reference](http://rtd.feast.dev)
 * [Usage](reference/usage.md)

설명:

+ * [MLflow Integration](reference/mlflow.md): 이 줄은 MLflow 통합에 대한 링크를 추가하는 것으로 보이나, PR 설명과 블로그 포스트 내용과는 직접적인 관련이 없어 보입니다. 아마도 다른 변경사항이 함께 포함되었거나, PR 설명에 누락된 부분이 있을 수 있습니다. 블로그 포스트 자체는 MLflow와 직접적인 관련이 없습니다.

2. 블로그 포스트 추가 (`infra/website/docs/blog/feast-online-server-performance-tuning.md`)

이 PR의 핵심은 이 markdown 파일에 작성된 블로그 포스트입니다. 이 파일은 Feast 온라인 피처 서버의 성능 튜닝 과정을 상세하게 기술하고 있으며, 실제 코드 조각과 벤치마크 결과를 포함하고 있습니다.

--- /dev/null
+++ b/infra/website/docs/blog/feast-online-server-performance-tuning.md
@@ -0,0 +1,338 @@
+---
+title: "Tuning the Feast Feature Server for Sub-2ms Online Serving"
+description: "A practical guide to achieving low-latency, high-throughput feature serving with Feast on Kubernetes — from default configuration to production-grade performance with pre-computed feature vectors and benchmarks at every step."
date: 2026-06-02
authors: ["Nikhil Kathole"]
---

**Feast supports production-grade worker configuration, connection pooling, async reads, batched pipelines, serialization optimizations, and pre-computed feature vectors for the Python feature server.** This post walks through a real-world performance tuning exercise in two stages: first, server and client tuning that brings p99 latency down to **sub-5ms** for single-row requests; then, **pre-computed feature vectors** that push it further to **sub-2ms p99** — regardless of how many feature views your FeatureService spans. We share the benchmarking methodology, the exact configuration changes, and the measured impact of each step so you can apply the same approach to your own deployments.

---

## The Problem

When you deploy Feast on Kubernetes using the [Feast Operator](https://docs.feast.dev/how-to-guides/production-deployment-topologies), the default configuration is designed for simplicity — a single Gunicorn worker, short keep-alive timeouts, and frequent registry refreshes. This is fine for development but leaves significant performance on the table for production workloads where every millisecond matters.

We set out to answer a practical question: **how low can we push the Feast online server's p99 latency, and what does it take to get there?**

---

## Test Environment

| Component | Configuration |
|-----------|--------------|
| **Online Store** | Redis 7.0.12 (standalone, in-cluster) |
| **Registry** | PostgreSQL 16 (SQL registry) |
| **Platform** | Kubernetes |
| **Deployment** | Feast Operator with `FeatureStore` CR |

We used a banking feature store project with multiple feature views spanning customer demographics, transactions, and behavioral profiles.

All benchmarks run 200 iterations (after 30–50 warmup) for each scenario, measuring p50, p95, p99, and mean latency. Throughput is measured with 10 concurrent workers over 15 seconds.

---

## Three Access Modes

Feast supports three ways to retrieve online features. Understanding how each one works is key to knowing where latency comes from — and where to optimize.

### REST API

Client → HTTP POST (JSON) → Gunicorn/FastAPI Server → Redis mget() → JSON response


The simplest and most common pattern. Your application sends a JSON request to the feature server's `/get-online-features` endpoint. The server holds persistent Redis connections and a pre-loaded registry, so each request is just a Redis read plus JSON serialization. HTTP keep-alive reuses TCP/TLS connections across requests.

### Direct SDK

Client (Python) → FeatureStore SDK → Redis mget() directly


The Python SDK connects to Redis directly — no HTTP hop, no JSON overhead. However, it pays for in-process registry lookups and entity key serialization on every call, and reads each FeatureView sequentially.

### Remote SDK

Client (Python SDK) → HTTP POST → Feature Server → Redis → JSON → Client


The SDK delegates feature retrieval to a remote feature server over HTTP. This combines the worst of both worlds: SDK-side overhead *plus* an HTTP round-trip. Without connection pooling, each call creates a new TCP connection and TLS handshake.

---

## Baseline: Default Configuration

With no tuning applied — a single Gunicorn worker, default timeouts, and no connection pooling:

| Mode | p99 (1 row) | p99 (5 rows) | Throughput |
|------|----------------|-------------------|------------|
| **REST API** | 6.92 ms | 4.94 ms | 480 RPS |
| **Direct SDK** | 5.83 ms | 5.59 ms | — |
| **Remote SDK** | 11.71 ms | **74.31 ms** | ~2 RPS |

The REST API and Direct SDK are already in the 5–7ms range out of the box, but the Remote SDK fails badly — p99 spiking to **74ms** at just 5 rows due to per-request TCP/TLS setup overhead. This is our starting point.

---

## Server-Side Configuration

These are changes you apply to the **feature server deployment** — no code changes needed, just configuration via the `FeatureStore` CR and Redis runtime settings.

### Worker Tuning via the Feast Operator

The Feast Operator exposes `workerConfigs` in the `FeatureStore` CR, letting you tune the Gunicorn server without rebuilding images:

```yaml
apiVersion: feast.dev/v1alpha1
kind: FeatureStore
spec:
  services:
    onlineStore:
      server:
        workerConfigs:
          workers: -1              # Auto: 2 × CPU cores + 1
          keepAliveTimeout: 120    # Reuse connections longer
          maxRequests: 5000        # Recycle workers to prevent memory leaks
          maxRequestsJitter: 200   # Stagger recycling
          registryTTLSeconds: 300  # Reduce registry refresh overhead
          workerConnections: 2000  # High-concurrency support

Setting workers: -1 on a 4-core pod gives 9 Gunicorn workers, each with its own event loop and Redis connection. This is the single most impactful change — it transforms the server from single-threaded to multi-process, dropping 5-row p99 from ~10ms to ~8ms and putting us on the path to sub-5ms.

Redis Runtime Tuning

Three Redis settings made a measurable difference:

hz 100 (default 10) — Redis processes expired keys and timeouts 10x faster, reducing tail latency spikes.
tcp-keepalive 60 (default 300) — Detects dead connections 5x faster, freeing resources sooner.
save "" (disable RDB persistence) — Eliminates periodic snapshot I/O that causes 10–50ms p99 spikes. Since features are materialized from the offline store and reconstructible at any time, persistence is unnecessary.

High Availability and Auto-Scaling

For production, we added horizontal scaling and availability guarantees using the Feast Operator's built-in HA support:

spec:
  replicas: 2
  services:
    onlineStore:
      server:
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: "2"
            memory: 2Gi
  scaling:
    autoscaling:
      minReplicas: 2
      maxReplicas: 10
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70
    pdb:
      minAvailable: 1

When scaling is enabled, the operator auto-injects pod anti-affinity and zone topology spread constraints, ensuring replicas land on different nodes for resilience. With HPA, the cluster auto-scales based on CPU utilization — we observed it scaling from 2 to 3 pods in response to load during benchmarks. At 10 pods with 9 workers each, theoretical throughput reaches ~7,180 RPS (~25.8M RPH).

Server-side quick wins summary

Set workers: -1 — single most impactful change
Disable Redis persistence — CONFIG SET save ""
Set registryTTLSeconds: 300 — reduce registry refresh overhead
Use replicas: 2 minimum with HPA for burst capacity
Set resource limits — defaults are far too low for production

Client-Side Configuration

These are changes you apply on the client — how the SDK connects to the feature server and which access mode you choose.

Connection Pooling for the Remote SDK

The biggest problem with the Remote SDK was that every call created a brand-new requests.Session, established a fresh TCP connection, negotiated TLS, and then threw it all away — adding 2–4ms per call for HTTPS endpoints.

Feast now includes HttpSessionManager — a thread-safe, singleton session manager that reuses HTTP connections across requests with configurable pooling and retry:

online_store:
  type: remote
  path: https://feast-server:443
  connection_pool_size: 50
  connection_idle_timeout: 300
  connection_retries: 3

This dropped Remote SDK 5-row p99 from 74ms to 21ms — a 72% reduction — by eliminating the per-request TLS handshake.

Choosing the right access mode

Use Case	Recommended Mode	Why
Application serving	REST API	Sub-5ms single-row p99, simplest integration, 718 RPS per pod
Python ML pipeline	Direct SDK	No HTTP hop, sub-5ms p99, native protobuf
Async Python applications	Async Direct SDK	Non-blocking, batched pipeline, sub-5ms p99
Cross-cluster serving	Remote SDK + pooling	When the client can't reach Redis directly; 760 RPS with pooling

Code Enhancements in Feast

Beyond configuration, several code-level improvements in Feast itself contributed to reaching sub-5ms p99. These require no user configuration — just upgrading to the latest Feast version.

Serialization Optimization

The feature server used google.protobuf.json_format.MessageToDict to convert protobuf responses to JSON — a generic, reflection-based serializer that was a meaningful fraction of server-side latency. Replacing it with an optimized custom dict builder delivered a 66% throughput increase (432 to 718 RPS) and 72% reduction in tail latency under load (132ms to 37ms p99).

Async Redis Reads with Batched Pipeline

The RedisOnlineStore had async support (online_read_async with redis_asyncio), but the async_supported property was not overridden, so the feature server never used it. Enabling it unlocks non-blocking I/O on the server side — the FastAPI handler calls get_online_features_async directly instead of wrapping the sync path in run_in_threadpool.

Additionally, the base class async path issued O(N_feature_views) separate round trips to Redis via asyncio.gather. We added a get_online_features_async override to RedisOnlineStore that batches all HMGET commands across all feature views into a single async pipeline execution (O(1) round trips), matching the existing sync batched pipeline. This cut async 5-row p99 from ~11ms to 5.6ms — a 49% improvement.

Cached Per-Request Checks

_check_versioned_read_support() performed up to 7 lazy module imports on every request to determine if the current online store supports versioned reads. We cache the result per store instance, resolving imports once and eliminating ~0.5–1ms of overhead per request.

Skip Duplicate Feature Resolution

When auth is no_auth (the common case), the feature server was resolving feature views solely to check permissions (which are no-ops), then resolving them again inside get_online_features. We skip the first resolution entirely, avoiding a redundant registry lookup.

Session Wrapping Fix

The rest_error_handling_decorator re-wrapped cached requests.Session HTTP methods on every call. After ~1000 re

Stage 2: Sub-2ms with Pre-computed Feature Vectors

While the above optimizations brought p99 latency down to sub-5ms, achieving sub-2ms required a more fundamental change: pre-computed feature vectors. This technique involves pre-calculating and storing entire feature vectors (multiple features for a single entity) together, rather than retrieving individual features and assembling them on the fly.

How Pre-computed Feature Vectors Work

Instead of fetching features like customer_demographics and transaction_history separately, you can store a combined vector like customer_profile_features in Redis. This significantly reduces the number of Redis lookups and the overhead associated with assembling the final feature vector.

Design Decisions and Implementation

The blog post details the design choices for implementing pre-computed feature vectors, including:

Storage format: How to efficiently store and retrieve these combined vectors in Redis.
Materialization: How to ensure these pre-computed vectors are kept up-to-date with the latest feature values.
Feature Service definition: How to define a FeatureService that requests these pre-computed vectors.

Benchmark Results

The benchmarks show a dramatic improvement with pre-computed feature vectors:

6–9x speedup compared to the regular per-feature-view read path.
Achieving sub-2ms p99 latency even for feature services spanning multiple original feature views.

This approach is particularly effective when dealing with complex feature services where the overhead of fetching and assembling individual features becomes a bottleneck.

왜 이것이 좋은 최적화인가?

이번 PR에서 제시된 블로그 포스트는 Feast 온라인 피처 서버의 성능을 극적으로 향상시키는 여러 기법을 소개합니다. 이 최적화들이 왜 좋은지, 그리고 어떤 교훈을 얻을 수 있는지 살펴보겠습니다.

1. 다각적인 접근 방식

성능 최적화는 단일 기법으로 달성되지 않습니다. 이 블로그 포스트는 다음을 포함한 다각적인 접근 방식을 보여줍니다:

서버 측 튜닝: Gunicorn 워커 설정(workers: -1), Redis 설정(hz, tcp-keepalive, save), HPA 설정을 통해 서버의 처리 능력과 안정성을 높였습니다. 특히 workers: -1 설정은 단일 워커에서 멀티 프로세스로 전환하여 성능을 크게 향상시키는 핵심적인 변화였습니다.
클라이언트 측 튜닝: Remote SDK 사용 시 HttpSessionManager를 통한 연결 풀링 도입은 HTTP/TLS 핸드셰이크 오버헤드를 제거하여 지연 시간을 획기적으로 줄였습니다. 이는 클라이언트와 서버 간의 통신 방식을 최적화하는 좋은 예시입니다.
코드 레벨 개선: 직렬화 최적화(MessageToDict 대체), 비동기 Redis 읽기 및 파이프라인 사용, 불필요한 모듈 임포트 캐싱, 중복 피처 조회 건너뛰기 등 Feast 라이브러리 자체의 개선은 사용자 설정 없이도 성능 향상을 가져옵니다. 이는 라이브러리 개발자가 성능에 대한 깊은 이해를 바탕으로 최적화를 수행했음을 보여줍니다.
아키텍처 패턴 도입: 사전 계산된 피처 벡터(Pre-computed feature vectors)는 근본적인 설계 변경을 통해 성능 한계를 돌파하는 방법을 제시합니다. 이는 단순히 기존 방식을 개선하는 것을 넘어, 문제 해결을 위한 새로운 접근 방식을 도입하는 것의 중요성을 보여줍니다.

2. 측정 가능한 성능 향상

블로그 포스트는 각 단계별 성능 개선 수치를 명확하게 제시합니다:

기본 설정에서 REST API p99 6.92ms -> 서버 튜닝 후 sub-5ms -> 클라이언트 튜닝 후 sub-5ms -> 코드 개선 후 sub-5ms.
Remote SDK p99 74.31ms -> 연결 풀링 도입 후 21ms (72% 감소).
비동기 파이프라인 개선으로 5-row p99 약 11ms -> 5.6ms (49% 감소).
사전 계산된 피처 벡터 사용 시 6-9배 속도 향상 및 sub-2ms p99 달성.

이러한 구체적인 수치는 최적화의 효과를 명확히 보여주며, 다른 사용자들에게도 적용할 수 있는 가이드라인을 제공합니다.

3. 일반적인 교훈

프로파일링과 병목 식별: 성능 최적화의 첫걸음은 현재 시스템의 병목 지점을 정확히 파악하는 것입니다. 이 블로그는 REST API, Direct SDK, Remote SDK 등 다양한 액세스 모드의 특성을 분석하고 각 모드의 병목을 찾아 해결했습니다.
서버와 클라이언트 동시 최적화: 성능은 서버뿐만 아니라 클라이언트의 통신 방식에도 크게 좌우됩니다. 클라이언트 측 연결 풀링과 같은 최적화는 서버 측 최적화만큼 중요합니다.
라이브러리 내부 최적화의 중요성: Feast 자체의 코드 개선은 사용자에게 투명하게 성능 향상을 제공합니다. 이는 오픈소스 프로젝트에서 지속적인 성능 개선 노력이 사용자에게 큰 가치를 제공함을 의미합니다.
아키텍처적 접근: 때로는 기존 아키텍처의 제약을 넘어서기 위해 새로운 패턴(예: 사전 계산된 피처 벡터)을 도입하는 것이 필요합니다.
측정, 측정, 측정: 모든 변경 사항은 반드시 측정 가능한 결과로 검증되어야 합니다. 이 블로그는 체계적인 벤치마킹 방법론을 제시합니다.

⚠️ 알림: 이 분석은 AI가 실제 코드 diff를 기반으로 작성했습니다.

PR Analysis 의 다른글

이전글 [vllm] [ROCm CI 최적화] Docker 3단계 빌드 전략으로 빌드 시간 26분 단축하기
현재글 : [feast] Feast 온라인 서빙 성능 튜닝: Sub-2ms 달성을 위한 여정
다음글 [sglang] DeepSeek V4의 Prefill 성능을 1.35배 향상시킨 FlashAttention 최적화