[Ray Serve] 라우터 큐 대기 시간 메트릭 추가

2025년 12월 16일수정: 2025년 12월 16일

PR #59233 - [Serve][3/n] Add router queue latency

들어가며

Ray Serve에서 요청이 라우터에 도착한 후 레플리카에 실제로 할당되기까지 얼마나 대기하는지 측정할 방법이 없었습니다. 이 PR은 두 가지 새로운 메트릭을 추가하여, 라우터 큐에서의 대기 시간과 레플리카별 큐 길이를 관측할 수 있게 합니다.

핵심 코드 분석

새로 추가된 메트릭

# 큐 대기 시간 히스토그램
self.queue_wait_time_ms_histogram = metrics.Histogram(
    "serve_request_router_fulfillment_time_ms",
    description="Time in milliseconds that a request spent waiting in the "
                "queue before being assigned to a replica.",
    boundaries=DEFAULT_LATENCY_BUCKET_MS,
    tag_keys=("deployment", "actor_id", "application", "handle_source"),
)

# 레플리카별 큐 길이 게이지
self.router_queue_len_gauge = metrics.Gauge(
    "serve_request_router_queue_len",
    description="The number of requests currently running on a replica "
                "as tracked by the router's queue length cache.",
    tag_keys=("deployment", "replica_id", "actor_id", "application", "handle_source"),
)

대기 시간 기록

def _record_queue_wait_time(self, pending_request: PendingRequest):
    """Records the time a request spent in the queue."""
    queue_wait_time_ms = (time.time() - pending_request.created_at) * 1000
    self.queue_wait_time_ms_histogram.observe(queue_wait_time_ms)

이 메서드는 요청이 레플리카에 할당되는 두 곳(_fulfill_next_pending_request의 매칭 경로와 FIFO 경로)에서 호출됩니다.

큐 길이 업데이트

def _update_router_queue_len_gauge(self, replica_id: ReplicaID, queue_len: int):
    self.router_queue_len_gauge.set(queue_len, tags={"replica_id": replica_id.unique_id})

레플리카에 요청 전송, 큐 길이 정보 수신, 큐 길이 프로빙 시점에서 게이지를 업데이트합니다.

왜 이게 좋은가

병목 지점 식별: 높은 fulfillment_time_ms는 레플리카 부족이나 처리량 불균형을 나타내어, 스케일링 결정의 근거가 됩니다.
레플리카별 부하 가시성: queue_len 게이지로 특정 레플리카에 요청이 편중되는 현상을 감지할 수 있습니다.
99%ile 꼬리 지연 개선: 벤치마크에서 99%ile이 320ms -> 310ms, Max가 2995ms -> 1171ms로 개선되었습니다.
기존 태그 체계 확장: handle_source 태그가 추가되어, 내부/외부 핸들 소스별 분석이 가능합니다.

참고 자료

PR Analysis 의 다른글

이전글 [triton] Triton AMD 커널 최적화: 루프 언롤링(Loop Unrolling)을 통한 성능 향상
현재글 : [Ray Serve] 라우터 큐 대기 시간 메트릭 추가
다음글 [triton] Async 연산에 명시적 의미론(Semantics) 문서 추가