[vllm] MP Executor로 멀티 노드 분산 추론 지원

2025년 11월 16일수정: 2025년 11월 16일

PR 링크: vllm-project/vllm#23691 상태: Merged | 변경: +930/-82

들어가며

vLLM V1 엔진에서 멀티 노드 분산 추론이 가능해졌다. 기존에는 단일 노드 내 멀티 GPU만 지원했던 MultiprocExecutor가 여러 노드에 걸친 텐서 병렬(TP)과 파이프라인 병렬(PP)을 지원하도록 확장되었다.

핵심 코드 분석

설정 구성

def create_vllm_config(
    tensor_parallel_size: int = 1,
    pipeline_parallel_size: int = 1,
    distributed_executor_backend: str = "mp",
    nnodes: int = 1,
    node_rank: int = 0,
    master_port: int = 0,
) -> VllmConfig:
    # 멀티 노드 설정
    if nnodes > 1 or node_rank > 0:
        vllm_config.parallel_config.nnodes = nnodes
        vllm_config.parallel_config.node_rank = node_rank
        vllm_config.parallel_config.master_port = master_port
    if nnodes > 1:
        vllm_config.parallel_config.disable_custom_all_reduce = True

멀티 노드에서는 custom all_reduce가 비활성화된다. 노드 간 통신은 NCCL을 통해 이루어지며, 이는 NVLink/IB 등 고속 인터커넥트를 활용한다.

Worker 관리와 헬스 체크

# Worker 초기화 및 RPC 기반 통신
executor = MultiprocExecutor(vllm_config=vllm_config)
assert executor.world_size == 2
assert len(executor.workers) == 2

# 집합적 RPC 호출과 헬스 체크
executor.check_health()
assert not executor.is_failed

# 실패 콜백 등록
executor.register_failure_callback(callback)

Worker 프로세스의 생성, 모니터링, 실패 감지가 체계적으로 구현되어 있다. 실패 시 콜백을 통해 graceful shutdown이 가능하다.