[llm-compressor] Oneshot 진입점: 한 번의 호출로 끝나는 압축 파이프라인

2026년 4월 13일수정: 2026년 4월 13일

들어가며

llm-compressor의 사용자 대부분은 내부 구조를 모른 채 from llmcompressor import oneshot 한 줄로 압축을 시작한다. HuggingFace 모델 경로, 캘리브레이션 데이터셋, 레시피 YAML 세 개만 넘기면 양자화된 체크포인트가 디스크에 저장된다. 이 "한 번의 호출" 뒤에는 인자 파싱, 모델 로딩, 캘리브레이션 데이터로더 생성, 세션 초기화, 파이프라인 선택, 모디파이어 적용, 체크포인트 저장이라는 7단계의 흐름이 숨어 있다.

이 글은 llm-compressor의 진입점인 Oneshot 클래스와 oneshot() 함수를 코드 레벨에서 해부해, 각 단계가 어떤 하위 모듈로 위임되는지 추적한다. 전체 파이프라인의 조감도는 프로젝트 아키텍처 개요에서 확인할 수 있다.

공식 문서

핵심 구조/코드 분석

`oneshot()` 함수: 평평한 Kwargs 진입점

사용자 친화적 진입점은 src/llmcompressor/entrypoints/oneshot.py의 최상위 함수 oneshot()이다. 40개 이상의 파라미터를 전부 평평한 키워드 인자로 노출해, 사용자는 별도의 dataclass를 import할 필요 없이 한 번의 함수 호출로 시작할 수 있다.

def oneshot(
    # Model arguments
    model: str | PreTrainedModel,                 # HF 모델 경로 또는 로드된 모델 인스턴스
    config_name: str | None = None,               # 별도 config 경로
    tokenizer: str | PreTrainedTokenizerBase | None = None,  # 토크나이저 (모델과 다를 경우)
    processor: str | ProcessorMixin | None = None,# 멀티모달 프로세서
    precision: str = "auto",                      # 가중치 로딩 dtype (auto/fp16/bf16)
    tie_word_embeddings: bool = True,             # lm_head ↔ embed_tokens 공유 유지 여부
    save_compressed: bool = True,                 # compressed-tensors 포맷으로 저장
    # Recipe arguments
    recipe: str | list[str] | None = None,        # 레시피 YAML 경로 또는 인스턴스 리스트
    recipe_args: list[str] | None = None,         # 레시피 내부 변수 오버라이드 (key=val)
    stage: str | None = None,                     # 멀티스테이지 레시피에서 실행할 스테이지
    # Dataset arguments
    dataset: str | Dataset | DataLoader | None = None,   # 캘리브레이션 데이터
    batch_size: int = 1,                          # 캘리브레이션 배치 크기
    num_calibration_samples: int = 512,           # 캘리브레이션 샘플 수 (GPTQ/AWQ 권장 128~512)
    max_seq_length: int = 384,                    # 토크나이즈 최대 길이
    pipeline: str | None = "independent",         # 파이프라인 종류 (basic/sequential/data_free/independent)
    sequential_targets: list[str] | None = None,  # sequential 파이프라인에서 분할할 레이어 타입
    sequential_offload_device: str = "cpu",       # 중간 활성화 오프로드 디바이스
    quantization_aware_calibration: bool = True,  # forward 시 이미 양자화된 가중치 사용
    sequential_prefetch: bool = False,            # 다음 배치 미리 로딩 (GPU 메모리 여유 시)
    moe_calibrate_all_experts: bool = True,       # MoE 모델에서 모든 전문가에 토큰 공급
    output_dir: str | None = None,                # 결과 저장 디렉토리 (None이면 저장 안 함)
    log_dir: str | None = None,                   # 로그 파일 디렉토리
    **kwargs,
) -> PreTrainedModel:
    local_args = {
        k: v for k, v in locals().items() if k not in ("local_args", "kwargs")
    }
    one_shot = Oneshot(**local_args, **kwargs)
    one_shot()
    return one_shot.model

파라미터는 네 그룹으로 묶여 있다.

그룹	주요 파라미터	용도
Model	`model`, `precision`, `tie_word_embeddings`, `save_compressed`	모델 로딩과 저장 형식 제어
Recipe	`recipe`, `recipe_args`, `stage`	어떤 Modifier를 어떻게 적용할지 선언
Dataset	`dataset`, `num_calibration_samples`, `max_seq_length`, `pipeline`, `sequential_targets`	캘리브레이션 데이터와 파이프라인 선택
Misc	`output_dir`, `log_dir`	결과 저장 경로, 로그 출력

함수 본문은 놀라울 정도로 단순하다. locals()로 모든 인자를 한 딕셔너리로 모은 뒤 Oneshot 클래스에 그대로 위임한다. 이 얇은 래퍼 패턴 덕분에 사용자 API는 변하지 않으면서 내부 dataclass 구조는 자유롭게 리팩터링할 수 있다.

`Oneshot` 클래스: 3단계 라이프사이클

실제 로직은 Oneshot.__init__과 Oneshot.__call__에 나뉘어 있다. 클래스 독스트링이 명시하듯 라이프사이클은 Preprocessing → Oneshot Calibration → Postprocessing 3단계로 구성된다.

class Oneshot:
    def __init__(
        self,
        log_dir: str | None = None,  # 로그 파일 저장 디렉토리
        **kwargs,                    # oneshot() 의 모든 인자가 여기로 들어옴
    ):
        # 1) 토크나이저 병렬성 경고 억제 (FastTokenizer ↔ datasets.map 충돌 회피)
        if TOKENIZERS_PARALLELISM_ENV not in os.environ:
            os.environ[TOKENIZERS_PARALLELISM_ENV] = "false"

        # 2) 파일 로거 부착 (옵션)
        log_file = os.environ.get("LLM_COMPRESSOR_LOG_FILE", "").strip()
        if log_file:
            logger.add(str(Path(log_file).expanduser()), level="DEBUG")
        elif log_dir:
            date_str = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
            logger.add(f"{log_dir}/oneshot_{date_str}.log", level="DEBUG")

        # 3) 평평한 kwargs → 세 개의 Arguments dataclass 로 파싱
        model_args, dataset_args, recipe_args, output_dir = parse_args(**kwargs)

        self.model_args = model_args
        self.dataset_args = dataset_args
        self.recipe_args = recipe_args
        self.output_dir = output_dir

        # 4) 모델/프로세서 초기화 (가중치 로딩, lm_head untie 패치 등)
        pre_process(model_args, dataset_args, output_dir)

        self.model = self.model_args.model
        self.processor = self.model_args.processor
        self.recipe = self.recipe_args.recipe

__init__은 부작용이 많다. 환경 변수 조작, 로거 부착, 모델 가중치 로딩까지 모두 생성자에서 수행한다. 이는 llm-compressor가 "스크립트 실행"과 "파이썬 라이브러리 호출" 두 용도를 모두 지원해야 하기 때문이다. CLI 스크립트 입장에서는 Oneshot(**kwargs); inst() 두 줄로 모든 세팅이 끝나야 편리하다.

parse_args()는 src/llmcompressor/args/utils.py에 정의된 dispatcher로, 평평한 kwargs를 ModelArguments, DatasetArguments, RecipeArguments 세 dataclass와 output_dir로 분리한다. 실제 모델 로딩은 pre_process()가 수행하며, 내부적으로는 transformers.AutoModelForCausalLM.from_pretrained 계열 API를 호출한다.

`call`: 캘리브레이션과 레시피 적용

생성자가 부작용을 다 끝낸 덕분에, 호출 부분은 매우 얇다.

def __call__(self):
    # 1) 캘리브레이션 데이터로더 생성 — DatasetArguments 에 따라 c4/wikitext/custom 등 선택
    calibration_dataloader = get_calibration_dataloader(
        self.dataset_args, self.processor
    )

    # 2) 레시피의 모든 Modifier 를 모델에 적용
    self.apply_recipe_modifiers(
        calibration_dataloader=calibration_dataloader,
        recipe_stage=self.recipe_args.stage,  # 멀티스테이지 레시피 중 실행할 스테이지
    )

    # 3) 결과 저장 — save_compressed 여부에 따라 compressed-tensors 포맷으로 직렬화
    post_process(
        model_args=self.model_args,
        recipe_args=self.recipe_args,
        output_dir=self.output_dir,
    )

세 단계의 경계가 명확하다. get_calibration_dataloader는 src/llmcompressor/datasets/의 dataset utils로 위임되고, post_process는 HuggingFace save_pretrained 호출을 래핑한다. 압축 본체는 apply_recipe_modifiers에서 일어난다.

`apply_recipe_modifiers`: 세션과 파이프라인의 접점

def apply_recipe_modifiers(
    self,
    calibration_dataloader: DataLoader | None,  # 캘리브레이션 배치 반복자 (data_free 일 경우 None)
    recipe_stage: str | None = None,            # 실행할 레시피 스테이지 이름
):
    session = active_session()   # 전역 CompressionSession 싱글톤
    session.reset()              # 이전 실행의 Lifecycle 잔존 상태 제거

    with norm_calibration_context(self.model), moe_calibration_context(
        self.model,
        calibrate_all_experts=self.dataset_args.moe_calibrate_all_experts,
    ):
        # 1) 세션에 model, recipe, 데이터를 주입하고 Modifier 들을 on_initialize
        session.initialize(
            model=self.model,
            start=-1,
            recipe=self.recipe,
            recipe_stage=recipe_stage,
            recipe_args=self.recipe_args.recipe_args,
            calib_data=calibration_dataloader,
            sequential_targets=self.dataset_args.sequential_targets,
        )

        # 2) 레시피에 들어있는 Modifier 목록을 근거로 CalibrationPipeline 선택
        user_pipeline = self.dataset_args.pipeline
        pipeline = CalibrationPipeline.from_modifiers(
            session.lifecycle.recipe.modifiers, user=user_pipeline
        )

        # 3) 파이프라인 실행 — basic/sequential/data_free/independent
        pipeline(
            self.model,
            calibration_dataloader,
            self.dataset_args,
        )

    session.finalize()  # 모든 Modifier 의 on_finalize 훅 실행 → 가중치 최종화

이 메서드가 llm-compressor의 "엔진 시동 키"에 해당한다. 두 개의 컨텍스트 매니저 (norm_calibration_context, moe_calibration_context)로 감싸는 것은, 캘리브레이션 중에만 RMSNorm/MoE 라우터의 동작을 특수하게 패치하기 위함이다. MoE 모델에서는 기본 라우팅만 사용하면 일부 expert가 토큰을 전혀 받지 못해 양자화 스케일이 부정확해지므로, moe_calibrate_all_experts=True로 모든 expert에 토큰을 강제 공급한다.

핵심은 세 줄이다.

session.initialize(...) — CompressionSession을 준비한다. 이 호출 안에서 레시피가 파싱되고 각 Modifier의 on_initialize 훅이 호출된다.
CalibrationPipeline.from_modifiers(...) — Pipeline Registry가 Modifier 목록을 보고 적절한 파이프라인을 고른다. GPTQ가 포함되어 있으면 자동으로 sequential, 단순 PTQ만 있으면 basic이 선택된다.
pipeline(...) — 실제 forward 루프. Modifier의 on_event 훅이 배치마다 호출되면서 통계를 누적하거나 가중치를 수정한다.

마지막의 session.finalize()는 각 Modifier의 on_finalize 훅을 호출해 가중치 최종화(예: GPTQ의 H 갱신 → W 업데이트)를 수행한다.

왜 이 설계인가

1. 평평한 함수 vs. 클래스의 이중 API. 사용자는 oneshot(model=..., recipe=...) 한 줄로 시작하고 싶고, 고급 사용자는 Oneshot(...) 인스턴스의 속성(oneshot.model, oneshot.recipe)에 접근하고 싶다. 이중 API는 두 요구를 모두 만족시킨다. 함수는 단지 Oneshot을 감싼 얇은 래퍼라서 유지보수 부담이 거의 없다.

2. __init__에서 부작용을 허용한다. 일반적으로 생성자에 무거운 IO를 넣는 것은 안티패턴이지만, llm-compressor는 "한 번 쓰고 버리는 스크립트" 용도이기 때문에 지연 초기화를 할 동기가 없다. 대신 Oneshot(**kwargs) 한 줄이면 모델이 GPU에 올라와 있어 디버깅이 쉽다.

3. parse_args로 평평한 kwargs를 분리한다. 내부적으로 ModelArguments, DatasetArguments, RecipeArguments로 나누어야 각 컴포넌트가 자신이 필요한 설정만 선택적으로 사용할 수 있다. 평평한 API와 구조화된 내부 상태라는 두 요구가 parse_args라는 단 하나의 dispatcher로 연결된다.

4. 파이프라인 자동 선택. 사용자가 pipeline="independent"로 지정해도, CalibrationPipeline.from_modifiers가 Modifier 종류를 보고 필요 시 sequential로 승격시킨다. GPTQ처럼 레이어 단위 데이터 순회가 필요한 알고리즘을 사용자가 잘못 설정해 basic 파이프라인으로 실행하는 실수를 방지한다.

5. 두 개의 캘리브레이션 컨텍스트. norm_calibration_context는 RMSNorm의 스케일을 임시 보정하고, moe_calibration_context는 MoE 라우터를 전수 라우팅으로 교체한다. 이 둘을 전역 상태가 아닌 컨텍스트 매니저로 두어, 캘리브레이션이 끝나면 원본 동작으로 자동 복원된다. 호출자가 정리(cleanup) 코드를 신경 쓸 필요가 없다.

마무리

oneshot()은 llm-compressor의 얇지만 중요한 "파사드"다. 이 파일 안에 실제 압축 로직은 한 줄도 없지만, 모델 로딩부터 결과 저장까지 모든 외부 컴포넌트를 올바른 순서로 연결한다. 다음 글에서는 이 함수가 생성하는 CompressionSession과 그 내부의 Lifecycle 상태 머신을 살펴본다.

참고 자료

공식 문서: llm-compressor README
예제: examples/quantization_w4a16/llama3_example.py
소스 코드:
관련 포스트:

llm-compressor 의 다른글

이전글 [llm-compressor] 프로젝트 전체 아키텍처 분석 - 개요 및 목차
현재글 : [llm-compressor] Oneshot 진입점: 한 번의 호출로 끝나는 압축 파이프라인
다음글 [llm-compressor] Model-Free Entrypoint: 모델 정의 없이 체크포인트만으로 PTQ

댓글

관련 포스트

llm-compressor 의 다른글