[SGLang] EPLB: Expert-Parallel Load Balancing 알고리즘

2026년 4월 12일수정: 2026년 4월 12일

들어가며

Expert Parallel 환경에서 특정 전문가에 토큰이 몰리면 해당 GPU만 과부하되고 나머지는 유휴 상태가 된다. EPLB(Expert-Parallel Load Balancing)는 실행 중 전문가 사용량을 모니터링하고, 인기 있는 전문가를 여러 GPU에 복제하여 부하를 균등하게 분산하는 시스템이다.

구조도

┌─────────────────────────────────────────────────────┐
│                   EPLBManager                       │
│  (매 N iteration마다 재균형 트리거)                    │
├─────────────────────────────────────────────────────┤
│                                                     │
│  ExpertDistributionRecorder ──► logical_count 수집   │
│          │                                          │
│          ▼                                          │
│  eplb_algorithms.rebalance_experts()                │
│  ┌─────────────────────────────────────┐            │
│  │ deepseek / deepseek_vec /           │            │
│  │ elasticity_aware                    │            │
│  └────────────┬────────────────────────┘            │
│               │                                     │
│               ▼                                     │
│  ExpertLocationMetadata                             │
│  ┌─────────────────────────────────────┐            │
│  │ physical_to_logical_map             │            │
│  │ logical_to_all_physical_map         │            │
│  │ logical_to_rank_dispatch_physical   │            │
│  └────────────┬────────────────────────┘            │
│               │                                     │
│               ▼                                     │
│  model_runner.update_expert_location()              │
└─────────────────────────────────────────────────────┘

핵심 코드 분석

1. EPLBManager: 재균형 오케스트레이터

EPLBManager는 제너레이터 패턴으로 매 N iteration마다 재균형을 트리거한다.

class EPLBManager:
    def __init__(self, model_runner):
        self._rebalance_num_iterations = server_args.eplb_rebalance_num_iterations
        self._main_generator = self._entrypoint()

    def on_forward_pass_end(self):
        next(self._main_generator)

    def _entrypoint(self):
        while True:
            for _ in range(self._rebalance_num_iterations):
                yield
            yield from self.rebalance()

on_forward_pass_end가 매 forward pass 후 호출되며, N번 카운트 후 rebalance를 실행한다.

2. rebalance: 재균형 실행

def rebalance(self):
    dump_record_output = get_global_expert_distribution_recorder().dump_record(
        output_mode="object"
    )
    logical_count = dump_record_output["logical_count"]
    average_utilization_rate = dump_record_output[
        "average_utilization_rate_over_window"
    ]

    if not self._check_rebalance_needed(average_utilization_rate):
        return

    expert_location_metadata = ExpertLocationMetadata.init_by_eplb(
        self._server_args, self._model_runner.model_config, logical_count
    )

    for chunk_index, update_layer_ids in enumerate(update_layer_ids_chunks):
        if len(update_layer_ids_chunks) > 1:
            yield  # 레이어 청크 간 yield로 서빙 중단 최소화
        self._model_runner.update_expert_location(
            expert_location_metadata, update_layer_ids=update_layer_ids,
        )

GPU 활용률이 임계값(eplb_min_rebalancing_utilization_threshold) 이상이면 재균형을 건너뛴다. 레이어를 청크 단위로 업데이트하여 서빙 중단을 최소화한다.

3. ExpertLocationMetadata: 물리-논리 매핑

핵심 데이터 구조는 물리적 전문가(GPU에 실제 배치된 슬롯)와 논리적 전문가(모델의 원래 전문가 번호) 간의 매핑이다.

@dataclass
class ExpertLocationMetadata:
    physical_to_logical_map: torch.Tensor    # (layers, num_physical_experts)
    logical_to_all_physical_map: torch.Tensor # (layers, num_logical_experts, X)
    logical_to_all_physical_map_num_valid: torch.Tensor
    logical_to_rank_dispatch_physical_map: Optional[torch.Tensor]

num_physical_experts = num_logical_experts + ep_num_redundant_experts로, 복제를 위한 여분 슬롯을 포함한다. 예를 들어 256개 논리 전문가 + 64개 여분 = 320개 물리 슬롯이다.

4. init_by_eplb: 알고리즘 기반 초기화

@staticmethod
def init_by_eplb(server_args, model_config, logical_count):
    physical_to_logical_map, logical_to_all_physical_map, expert_count = (
        eplb_algorithms.rebalance_experts(
            tokens_per_expert=logical_count,
            num_physical_experts=num_physical_experts,
            num_local_physical_experts=num_physical_experts // ep_size,
            num_groups=num_groups,
            num_nodes=num_nodes,
            algorithm=eplb_algorithms.compute_algorithm(...),
        )
    )

5. 재균형 알고리즘 선택

class EplbAlgorithm(Enum):
    deepseek = auto()
    deepseek_hierarchical = auto()
    deepseek_vec = auto()
    deepseek_vec_hierarchical = auto()
    elasticity_aware = auto()
    elasticity_aware_hierarchical = auto()

def compute_algorithm(raw_algorithm, num_groups, num_nodes):
    if raw_algorithm != "auto":
        return EplbAlgorithm[raw_algorithm]
    if (num_groups is not None) and (num_groups % num_nodes == 0):
        return EplbAlgorithm.deepseek_hierarchical
    else:
        return EplbAlgorithm.deepseek

hierarchical 변형은 노드 간 통신을 줄이기 위해 같은 노드 내에서 먼저 재배치를 시도한다.

6. 최근접 전문가 탐색

토큰을 라우팅할 때, 같은 GPU > 같은 노드 > 원격 노드 순서로 가장 가까운 물리 전문가를 선택한다.

def _find_nearest_expert(candidate_physical_expert_ids,
                          num_local_gpu_physical_experts, moe_ep_rank, ...):
    # 1. 후보가 하나면 바로 반환
    if len(candidate_physical_expert_ids) == 1:
        return candidate_physical_expert_ids[0]

    # 2. 같은 GPU의 전문가 우선
    same_gpu = [pid for pid in candidate_physical_expert_ids
                if _compute_gpu_id_of_physical_expert(pid, ...) == moe_ep_rank]
    if len(same_gpu) > 0:
        return same_gpu[0]

    # 3. 같은 노드의 전문가
    same_node = [pid for pid in candidate_physical_expert_ids
                 if _compute_node_id_of_physical_expert(pid, ...) == node_rank]
    if len(same_node) > 0:
        return same_node[0]

    return -1  # 찾지 못함

7. ExpertDistributionRecorder: 사용량 수집

class ExpertDistributionRecorder(ABC):
    def on_select_experts(self, topk_ids: torch.Tensor):
        pass

    def on_deepep_dispatch_normal(self, local_physical_count_of_layer, ...):
        pass

    def start_record(self):
        ...

    def dump_record(self, output_mode="file"):
        ...

라우팅 시마다 on_select_experts가 호출되어 각 전문가별 토큰 수를 누적한다. 순환 버퍼로 최근 N개 윈도우의 통계를 유지한다.

재균형 예시

논리 전문가: [0, 1, 2, 3, 4, 5, 6, 7]  (8개)
물리 슬롯:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]  (10개, 여분 2개)

토큰 분포: Expert 2와 5가 인기

재균형 전: physical→logical = [0,1,2,3,4,5,6,7,2,5]
                                              ↑ ↑ 복제
재균형 후: Expert 2는 물리 슬롯 2, 8에 존재
          Expert 5는 물리 슬롯 5, 9에 존재

참고

SGLang 의 다른글

이전글 [SGLang] Elastic Expert Parallelism: 동적 전문가 스케일링
현재글 : [SGLang] EPLB: Expert-Parallel Load Balancing 알고리즘
다음글 [SGLang] FlashInfer + TensorRT-LLM MoE: 하이브리드 MoE 커널