[llm-compressor] Events: 배치 라이프사이클 훅과 에폭 계산 로직

2026년 4월 13일수정: 2026년 4월 13일

들어가며

Lifecycle의 event() 메서드는 모든 Modifier에 이벤트를 브로드캐스트한다. 그 "이벤트"를 정의하는 것이 src/llmcompressor/core/events/event.py의 EventType enum과 Event dataclass다. 이 파일은 작지만 흥미롭다. 배치·에폭·옵티마이저의 라이프사이클을 한 enum으로 통합하고, should_update라는 조건 검사 메서드로 "이 이벤트에서 이 Modifier가 작동해야 하는가"를 한 줄에 판정한다.

핵심 구조/코드 분석

`EventType` enum

@unique
class EventType(Enum):
    # training lifecycle — Modifier 생명주기
    INITIALIZE = "initialize"                         # 세션 초기화
    FINALIZE = "finalize"                             # 세션 종료

    # batch lifecycle — 배치 단위 훅
    BATCH_START = "batch_start"                       # 배치 시작
    LOSS_CALCULATED = "loss_calculated"               # 손실 계산 완료
    BATCH_END = "batch_end"                           # 배치 종료

    # calibration lifecycle — 캘리브레이션 전용
    CALIBRATION_EPOCH_START = "calibration_epoch_start"  # 캘리브레이션 에폭 시작
    SEQUENTIAL_EPOCH_END = "sequential_epoch_end"        # sequential 파이프라인의 레이어 에폭 종료
    CALIBRATION_EPOCH_END = "calibration_epoch_end"      # 캘리브레이션 에폭 종료

    # step lifecycle — 옵티마이저 스텝
    OPTIM_PRE_STEP = "optim_pre_step"                 # optimizer.step() 직전
    OPTIM_POST_STEP = "optim_post_step"               # optimizer.step() 직후

그룹	이벤트	주로 쓰는 시나리오
training	INITIALIZE / FINALIZE	모든 시나리오
batch	BATCH_START / LOSS_CALCULATED / BATCH_END	QAT, 캘리브레이션 루프
calibration	CALIBRATION_EPOCH_START / END, SEQUENTIAL_EPOCH_END	PTQ 전용
step	OPTIM_PRE_STEP / OPTIM_POST_STEP	파인튜닝

흥미롭게도 SEQUENTIAL_EPOCH_END는 Sequential Pipeline에서만 쓰인다. 이 이벤트는 "한 레이어 전체를 캘리브레이션 데이터로 다 돌렸다"는 신호로, GPTQ 같은 Modifier가 해당 레이어의 가중치를 최종화하는 시점을 잡는다. 한 배치 끝(BATCH_END)과는 의미가 다르다.

`Event` dataclass

@dataclass
class Event:
    type_: Optional[EventType] = None            # 이벤트 타입
    steps_per_epoch: Optional[int] = None        # 에폭 당 스텝 수 (훈련 시)
    batches_per_step: Optional[int] = None       # 스텝 당 배치 수 (grad accumulation)
    invocations_per_step: int = 1                # 스텝 래퍼 호출 횟수 (AMP 구버전 호환)
    global_step: int = 0                         # 글로벌 스텝 카운터
    global_batch: int = 0                        # 글로벌 배치 카운터

Event 객체는 이벤트 종류뿐 아니라 "언제 발생했는가"를 수치로 기록한다. 훈련 시나리오에서 "현재 1.5 에폭 지점"이라는 표현은 global_step=15000, steps_per_epoch=10000에서 epoch_full = 1.5로 계산된다.

에폭 계산 속성들

@property
def epoch_based(self) -> bool:
    return self.steps_per_epoch is not None       # steps_per_epoch 가 있으면 에폭 기반

@property
def epoch(self) -> int:
    if not self.epoch_based:
        raise ValueError("Event is not epoch based")
    return self.global_step // self.steps_per_epoch   # 정수 에폭

@property
def epoch_full(self) -> float:
    if not self.epoch_based:
        raise ValueError("Event is not epoch based")
    return self.global_step / float(self.steps_per_epoch)   # 소수 포함 에폭

@property
def epoch_step(self) -> int:
    if not self.epoch_based:
        raise ValueError("Event is not epoch based")
    return self.global_step % self.steps_per_epoch   # 현재 에폭 내 스텝 수

@property
def epoch_batch(self) -> int:
    if not self.epoch_based:
        raise ValueError("Event is not epoch based")
    batches_per_epoch = (
        self.steps_per_epoch * self.batches_per_step
        if self.batches_per_step
        else self.steps_per_epoch
    )
    return self.global_batch % batches_per_epoch

이 속성들은 "글로벌 스텝 하나만 있으면 에폭/에폭내 스텝/에폭내 배치를 모두 계산할 수 있다"는 철학을 보여준다. 외부 호출자는 steps_per_epoch와 global_step만 세팅하면 되고, 나머지 계산은 Event가 책임진다.

`current_index`: 에폭 기반이든 스텝 기반이든

@property
def current_index(self) -> float:
    if not self.epoch_based:
        return self.global_step
    epoch_full = self.epoch_full
    if epoch_full - self.epoch > 1.0:
        raise ValueError("Too many steps per epoch for epoch based event")
    return epoch_full

@current_index.setter
def current_index(self, value: float):
    if not self.epoch_based:
        self.global_step = int(value)
        self.global_batch = (
            self.global_step
            if self.batches_per_step is None or self.batches_per_step < 2
            else self.global_step * self.batches_per_step
        )
    else:
        self.global_step = int(value * self.steps_per_epoch)
        self.global_batch = (
            self.global_step
            if self.batches_per_step is None or self.batches_per_step < 2
            else self.global_step * self.batches_per_step
        )

current_index는 Modifier가 "지금이 내가 작동해야 할 시점인가"를 판단할 때 쓰는 단일 스칼라다. 에폭 기반이면 소수 포함 에폭을, 아니면 글로벌 스텝을 반환한다. 세터는 반대로 "지금을 이 인덱스로 세팅하라"를 받아 global_step과 global_batch를 역산한다.

`should_update`: 작동 조건 검사

def should_update(
    self, start: Optional[float], end: Optional[float], update: Optional[float]
) -> bool:
    current = self.current_index
    if start is not None and current < start:
        return False
    if end is not None and current > end:
        return False
    return update is None or update <= 0.0 or current % update < 1e-10

각 Modifier는 "start부터 end까지, update 간격마다 작동"처럼 시간 범위를 선언한다. 이 메서드가 그 범위 체크를 한 줄로 수행한다. update % current < 1e-10 부분은 부동소수점 오차를 고려한 "정수 배수" 체크다. 예를 들어 start=0.0, end=10.0, update=0.5라면 0.0, 0.5, 1.0, 1.5, …, 10.0에서만 True를 반환한다.

이 패턴은 SparseML 시대의 유산이다. 훈련 중 특정 에폭 범위에서 점진적 프루닝을 적용하는 시나리오를 위해 설계되었으며, oneshot PTQ에서는 대부분 한 번만 작동하므로 start=end=0.0로 쓴다.

`new_instance`: 파생 이벤트 생성

def new_instance(self, **kwargs) -> "Event":
    instance = deepcopy(self)
    for key, value in kwargs.items():
        setattr(instance, key, value)
    return instance

기존 이벤트 상태를 복사한 뒤 일부 필드만 바꾼 새 이벤트를 만든다. Lifecycle이 Event(type_=event_type)로 단순하게 새 객체를 만드는 대신, 기존 Event의 steps_per_epoch 같은 공통 메타데이터를 유지하고 싶을 때 사용한다.

왜 이 설계인가

1. 단일 enum으로 모든 훅 포괄. 훈련/배치/캘리브레이션/스텝 라이프사이클을 하나의 EventType enum에 담아서, Modifier는 if event.type_ == EventType.BATCH_START 같은 단일 분기로 필요한 훅만 처리한다. 이벤트 종류를 추가할 때도 enum에만 새 값을 넣으면 된다.

2. epoch_based 자동 판별. steps_per_epoch가 None이냐 아니냐로 에폭 기반/스텝 기반을 자동 판별하므로, 사용자는 훈련 시나리오와 oneshot 시나리오를 같은 API로 다룰 수 있다.

3. current_index getter/setter 대칭. Modifier는 event.current_index로 현재 위치를 읽고, 필요 시 event.current_index = x로 위치를 앞당기거나 되돌릴 수 있다. 이는 테스트 시 "특정 에폭의 동작을 재현"하기 쉽게 만든다.

4. 부동소수점 오차 대응. current % update < 1e-10 체크는 0.5 * 3이 1.5가 아닐 수 있는 부동소수점의 함정을 피한다. 이 한 줄이 없으면 Modifier가 "update마다 작동"을 못 하는 버그가 난다.

5. 캘리브레이션 이벤트 분리. CALIBRATION_EPOCH_START/END와 SEQUENTIAL_EPOCH_END는 훈련 배치 이벤트와 다른 생명주기를 가지며, Lifecycle의 _event_order 순서 검증에 포함되지 않는다. 덕분에 두 시나리오가 서로를 간섭하지 않는다.