[SGLang] Model Configuration 시스템: 모델 설정 관리

2026년 4월 14일수정: 2026년 4월 14일

들어가며

SGLang은 수백 종의 모델을 지원한다. 각 모델은 아키텍처, 어텐션 방식, 컨텍스트 길이, 양자화 등 다양한 속성이 다르다. ModelConfig 클래스는 HuggingFace config를 읽어 SGLang 런타임에 필요한 모든 속성을 일관된 인터페이스로 제공한다.

구조도

ServerArgs (서버 인자)
    │
    ▼
ModelConfig.from_server_args()
    │
    ├── get_config()          ── HuggingFace PretrainedConfig 로드
    ├── get_hf_text_config()  ── 텍스트 모델 config 추출
    ├── get_context_length()  ── 컨텍스트 길이 파악
    │
    ▼
ModelConfig
    ├── hf_config             ── 원본 HuggingFace config
    ├── hf_text_config        ── 텍스트 모델 config
    ├── context_len           ── 최종 컨텍스트 길이
    ├── num_hidden_layers     ── 레이어 수
    ├── head_dim              ── 어텐션 헤드 차원
    ├── num_attention_heads   ── 어텐션 헤드 수
    ├── num_key_value_heads   ── KV 헤드 수
    ├── dtype                 ── 데이터 타입
    ├── is_generation         ── 생성 모델 여부
    ├── is_multimodal         ── 멀티모달 여부
    └── attention_arch        ── MLA / MHA

핵심 코드 분석

ModelConfig 초기화

ModelConfig의 생성자는 모델 경로에서 HuggingFace config를 로드하고, 다양한 속성을 파생한다.

class ModelConfig:
    def __init__(self, model_path, trust_remote_code=True, revision=None,
                 context_length=None, dtype="auto", quantization=None, ...):
        self.hf_config = get_config(self.model_path,
            trust_remote_code=trust_remote_code, revision=revision,
            model_override_args=self.model_override_args)
        self.hf_text_config = get_hf_text_config(self.hf_config)
        
        self.is_generation = is_generation_model(self.hf_config.architectures, ...)
        self.is_multimodal = enable_multimodal and (
            is_multimodal_model(self.hf_config.architectures)
            or has_multimodal_subconfig)
        self.dtype = _get_and_verify_dtype(self.hf_text_config, dtype)
        
        self._derive_context_length(context_length)
        self._derive_model_shapes()
        self._derive_hybrid_model()
        self._verify_quantization()

from_server_args: 팩토리 메서드

ServerArgs에서 ModelConfig를 생성하는 팩토리 메서드는 양자화, 오버라이드 설정 등을 적절히 전달한다.

@staticmethod
def from_server_args(server_args, model_path=None, is_draft_model=False, **kwargs):
    quantization = (
        server_args.speculative_draft_model_quantization
        if is_draft_model else server_args.quantization)
    return ModelConfig(
        model_path=model_path or server_args.model_path,
        trust_remote_code=server_args.trust_remote_code,
        context_length=server_args.context_length,
        model_impl=server_args.model_impl,
        quantization=quantization, ...)

어텐션 아키텍처 분류

SGLang은 MLA(Multi-head Latent Attention)와 MHA(Multi-Head Attention)를 구분한다. DeepSeek V2/V3 등은 MLA 아키텍처를 사용한다.

class AttentionArch(IntEnum):
    MLA = auto()
    MHA = auto()

Draft 모델 설정

Speculative Decoding에서 draft 모델은 아키텍처 이름을 변환하여 경량 변형을 사용한다.

def _config_draft_model(self):
    if is_draft_model and self.hf_config.architectures[0] in [
        "DeepseekV3ForCausalLM", "GlmMoeDsaForCausalLM"]:
        self.hf_config.architectures[0] = "DeepseekV3ForCausalLMNextN"
    if is_draft_model and self.hf_config.architectures[0] == "MiMoForCausalLM":
        self.hf_config.architectures[0] = "MiMoMTP"

Hybrid SWA 모델 감지

Sliding Window Attention을 사용하는 하이브리드 모델(Gemma4, MiMo 등)은 레이어별로 SWA와 Full Attention을 구분한다.

def _derive_hybrid_model(self):
    self.is_hybrid_swa = (
        is_hybrid_swa_model(self.hf_config.architectures)
        and not self.disable_hybrid_swa_memory)
    if self.is_hybrid_swa:
        self.swa_attention_layer_ids, self.full_attention_layer_ids = (
            get_hybrid_layer_ids(self.hf_config.architectures, self.hf_text_config))

NSA 감지

DeepSeek NSA(Native Sparse Attention) 모델은 index_topk 속성의 존재로 감지한다.

def is_deepseek_nsa(config) -> bool:
    architectures = getattr(config, "architectures", None)
    index_topk = getattr(config, "index_topk", None)
    return (architectures is not None
            and architectures[0] in ["DeepseekV3ForCausalLM", ...]
            and index_topk is not None)

설정 오버라이드 체계

우선순위	소스	예시
1 (최고)	ServerArgs	`--context-length 32768`
2	model_override_args	`--json-model-override-args '{}'`
3	override_config_file	`--override-config-file config.json`
4 (최저)	HuggingFace config	`config.json` 기본값

참고

소스 코드: python/sglang/srt/configs/model_config.py
HuggingFace Transformers PretrainedConfig 문서

SGLang 의 다른글

이전글 [SGLang] Batch Overlap: 연산-통신 오버랩 최적화
현재글 : [SGLang] Model Configuration 시스템: 모델 설정 관리
다음글 [SGLang] Server Args: 300+ 서버 인자 완전 가이드