[llm-compressor] Pruning Overview: OBCQ 계열 Modifier 구조

2026년 4월 13일수정: 2026년 4월 13일

들어가며

가지치기(Pruning)는 양자화와 함께 LLM 압축의 양대 기법이다. 가중치 일부를 0으로 만들어 "sparse matrix"를 만들고, 추론 엔진이 sparsity를 활용해 속도와 메모리를 개선한다. llm-compressor는 세 가지 pruning 알고리즘을 제공한다.

SparseGPT — 헤시안 기반 최적 가지치기 (논문)
Wanda — 가중치 크기 × 활성화 L2 기반 (논문)
Magnitude Pruning — 단순 크기 기반

이 글은 세 알고리즘의 공통 베이스와 src/llmcompressor/modifiers/pruning/와 src/llmcompressor/modifiers/obcq/ 디렉토리 구조를 해부한다.

핵심 구조/코드 분석

디렉토리 레이아웃

modifiers/
├── obcq/
│   ├── __init__.py
│   └── sgpt_base.py              # SparseGPT/Wanda 공통 베이스
│
└── pruning/
    ├── helpers.py
    ├── constant/                  # 마스크 고정 Modifier
    │   └── base.py
    ├── magnitude/                 # 크기 기반
    │   └── base.py
    ├── sparsegpt/                 # SparseGPT 구현
    │   ├── base.py
    │   ├── sgpt_base.py
    │   └── sgpt_sparsify.py
    ├── wanda/                     # Wanda 구현
    │   ├── base.py
    │   └── wanda_sparsify.py
    └── utils/
        └── pytorch/
            ├── layer_mask.py
            └── mask_factory.py

obcq(OBC Quantization — Optimal Brain Compression)와 pruning이 분리된 것은 역사적 이유다. 과거에는 OBCQ 이름으로 sparsification 로직을 시작했는데, 현재는 주로 pruning/ 아래에 구현체가 있다. obcq/sgpt_base.py와 pruning/sparsegpt/sgpt_base.py는 같은 베이스 클래스의 다른 경로이며, 최신 코드는 후자를 쓴다.

공통 베이스: `SGPTBaseModifier`

class SGPTBaseModifier(Modifier):
    """
    Base class for SparseGPT/Wanda style pruning modifiers.
    Shares hessian/activation accumulation and layer-by-layer pruning lifecycle.
    """
    sparsity: float | None = None                # 목표 희소성 (0.0~1.0)
    sparsity_profile: str | None = None           # "2:4" 같은 N:M 프로파일
    mask_structure: str = "unstructured"          # unstructured / 2:4 / 4:8 / block
    targets: list[str] | None = None              # 대상 레이어
    ignore: list[str] = field(default_factory=list)

    _hessians: Dict[torch.nn.Module, torch.Tensor] = PrivateAttr(default_factory=dict)
    _num_samples: Dict[torch.nn.Module, torch.Tensor] = PrivateAttr(default_factory=dict)

    @abstractmethod
    def _calibrate_and_prune_module(self, module):
        """Subclass implements the actual pruning algorithm"""
        raise NotImplementedError()

SparseGPT와 Wanda는 이 베이스를 상속하고, _calibrate_and_prune_module 하나만 다르게 구현한다. 공통으로 쓰이는 것은:

헤시안/활성화 누적 버퍼 (_hessians)
forward pre-hook 등록
SEQUENTIAL_EPOCH_END 이벤트 처리

Mask Structure: Unstructured vs N:M

Structure	의미	추론 가속
`"unstructured"`	임의 위치 0	압축 저장만 효과. GPU 계산 가속 미미
`"2:4"`	4개마다 최소 2개가 0	NVIDIA Ampere+ 에서 2배 가속
`"4:8"`	8개마다 최소 4개가 0	일부 HW 지원
`"block"`	블록 단위 0	structured, HW 의존

2:4 sparsity는 NVIDIA Ampere(A100) 이후에서 하드웨어 지원된다. 각 4개 연속 가중치 중 최소 2개가 0이면 Tensor Core가 이를 감지해 계산을 스킵한다. 이론상 2배 가속이 가능하다. llm-compressor는 mask_structure="2:4"를 지정해 이 패턴을 만든다.

`mask_factory.py`: 마스크 생성 유틸리티

def generate_mask(
    importance: torch.Tensor,        # 가중치 중요도 텐서 (예: |W| 또는 |W| * ||X||)
    sparsity: float,                 # 0.0~1.0
    mask_structure: str = "unstructured",
) -> torch.Tensor:
    """
    Generate binary mask (1=keep, 0=prune) based on importance.
    """
    if mask_structure == "unstructured":
        # 전체에서 하위 sparsity 비율 제거
        k = int(importance.numel() * (1 - sparsity))
        threshold = torch.topk(importance.flatten(), k, largest=True).values[-1]
        return (importance >= threshold).float()

    elif mask_structure == "2:4":
        # 4개 묶음마다 상위 2개 유지
        return _apply_2_4_sparsity(importance)

    elif mask_structure == "4:8":
        return _apply_4_8_sparsity(importance)

각 sparsification 알고리즘(SparseGPT, Wanda)은 자기만의 importance score를 계산하고 이 팩토리에 넘긴다. SparseGPT는 w_i^2 / H^-1_{ii} (OBS 기반), Wanda는 |w_i| * ||x_i||_2 (가중치 × 활성화)다.

`MagnitudePruningModifier`: 가장 단순한 케이스

class MagnitudePruningModifier(Modifier):
    sparsity: float = 0.5
    mask_structure: str = "unstructured"
    targets: list[str] = ...

    def on_initialize(self, state: State, **kwargs) -> bool:
        # 캘리브레이션 데이터 불필요 — data_free
        for name, module in match_named_modules(state.model, self.targets):
            importance = module.weight.abs()   # |W| 가 중요도
            mask = generate_mask(importance, self.sparsity, self.mask_structure)
            module.weight.data *= mask         # 직접 0 만들기

        return True

MagnitudePruningModifier는 캘리브레이션 데이터가 필요 없다. 가중치 절댓값만 보고 결정하므로 Data-Free Pipeline로 실행된다. 코드는 30줄 안팎이다.

`ConstantPruningModifier`: 마스크 유지

class ConstantPruningModifier(Modifier):
    """
    Maintain existing sparsity mask — used to prevent fine-tuning
    from un-pruning previously pruned weights.
    """
    targets: list[str] = ...

    def on_initialize(self, state: State, **kwargs) -> bool:
        # 현재 가중치의 0 위치를 마스크로 저장
        for name, module in match_named_modules(state.model, self.targets):
            mask = (module.weight.data != 0).float()
            module.register_buffer("_pruning_mask", mask)
        return True

    def on_event(self, state: State, event: Event, **kwargs):
        # 매 스텝 훅 — 마스크 재적용해 복원 방지
        if event.type_ == EventType.OPTIM_POST_STEP:
            for name, module in match_named_modules(state.model, self.targets):
                module.weight.data *= module._pruning_mask

이 Modifier는 "이미 pruning된 모델을 fine-tuning할 때 마스크가 풀리지 않도록" 유지한다. PTQ만 쓰는 시나리오에서는 필요 없지만, sparse fine-tuning 시 필수다.

왜 이 설계인가

1. 공통 베이스 SGPTBaseModifier. SparseGPT와 Wanda는 importance 계산만 다르고 나머지 라이프사이클(hook, sequential epoch, mask application)은 같다. 공통 로직을 베이스로 빼서 중복 제거.

2. mask_factory의 통일된 인터페이스. 모든 pruning 알고리즘이 "중요도 텐서 → 마스크 텐서" 단일 함수를 거친다. 덕분에 2:4, unstructured 등 다양한 구조를 일괄 지원한다.

3. Magnitude는 Data-Free. 가장 단순한 Modifier라 data_free 파이프라인으로 돌면 충분하다. 사용자가 실수로 sequential에 올려도 _infer_pipeline이 알아서 처리.

4. ConstantPruningModifier의 분리. "새로 pruning"과 "기존 마스크 유지"는 다른 의도다. 한 Modifier에 묶지 않고 별도 클래스로 분리해 명확성을 높였다.

5. obcq와 pruning 이원화는 과거 유산. 현재는 pruning/sparsegpt/가 주 구현이고, obcq/는 deprecated 경로. Modifier Factory의 deprecated 경로 목록에 등록되어 있다.

마무리

Pruning Overview로 pruning 계층의 구조를 이해했다. 다음 글은 이 구조 위의 첫 구현체, SparseGPT를 본다.

참고 자료

소스 코드:
관련 포스트:

llm-compressor 의 다른글

이전글 [llm-compressor] Logarithmic Equalization: 로그 스케일 채널 균등화
현재글 : [llm-compressor] Pruning Overview: OBCQ 계열 Modifier 구조
다음글 [llm-compressor] SparseGPT: 원샷 LLM 가지치기 구현

댓글

관련 포스트

llm-compressor 의 다른글