[vLLM] 기타 Model Layers: Pooler, Resampler, Vocab Parallel Embedding 등

2026년 4월 8일수정: 2026년 4월 8일

들어가며

LLM은 어텐션과 FFN 외에도 다양한 레이어를 사용한다. vLLM은 vllm/model_executor/layers/에서 이러한 레이어들을 텐서 병렬화, 양자화 등과 통합된 형태로 구현하고 있다. 이 글에서는 VocabParallelEmbedding, Resampler, Pooler를 중심으로 살펴본다.

핵심 구조/코드 분석

VocabParallelEmbedding: 분산 임베딩

DEFAULT_VOCAB_PADDING_SIZE = 64

class UnquantizedEmbeddingMethod(QuantizeMethodBase):
    def create_weights(self, layer, input_size_per_partition, output_partition_sizes,
                       input_size, output_size, params_dtype, **extra_weight_attrs):
        weight = Parameter(
            torch.empty(sum(output_partition_sizes), input_size_per_partition,
                        dtype=params_dtype),
            requires_grad=False,
        )
        set_weight_attrs(weight, {"input_dim": 1, "output_dim": 0})
        layer.register_parameter("weight", weight)

VocabParallelEmbedding은 어휘(vocab)를 텐서 병렬 워커들에 걸쳐 분할한다. output_partition_sizes로 각 워커가 담당하는 어휘 범위를 결정하고, input_dim=1(hidden 차원), output_dim=0(vocab 차원)으로 분할 축을 지정한다. 패딩 크기 64는 GPU 메모리 정렬을 위한 것이다.

from vllm.distributed import (
    divide,
    get_tensor_model_parallel_rank,
    get_tensor_model_parallel_world_size,
    tensor_model_parallel_all_reduce,
)

분산 통신 유틸리티를 사용하여 forward 시 all-reduce로 결과를 합산한다.

Resampler: 멀티모달 피쳐 리샘플링

"""
Shared resampler perceiver network used in multimodal models.
Example models: Qwen (Qwen-VL), MiniCPM-V 2.0
"""

def get_abs_pos(abs_pos: torch.Tensor, tgt_size: torch.Tensor | int) -> torch.Tensor:
    src_size = int(math.sqrt(abs_pos.size(0)))
    dtype = abs_pos.dtype
    if isinstance(tgt_size, int):
        tgt_size = (tgt_size, tgt_size)
    if src_size == tgt_size[0] and src_size == tgt_size[1]:
        return abs_pos

Resampler는 Perceiver 아키텍처 기반으로, Vision Encoder의 가변 길이 피쳐 시퀀스를 고정 길이로 리샘플링한다. get_abs_pos는 소스 해상도와 타겟 해상도가 다를 때 위치 임베딩을 보간(interpolation)한다. Qwen-VL, MiniCPM-V 2.0 등에서 사용된다.

Pooler: 시퀀스 레벨 표현 추출

# vllm/model_executor/layers/pooler/abstract.py
class Pooler(nn.Module, ABC):
    """시퀀스 레벨 표현을 추출하는 추상 Pooler"""

# vllm/model_executor/layers/pooler/tokwise/poolers.py
class TokenPooler(Pooler):
    """토큰 기반 풀링 (특정 토큰의 hidden state 추출)"""

# vllm/model_executor/layers/pooler/activations.py
class PoolerActivation(nn.Module, ABC): ...
class PoolerIdentity(PoolerActivation): ...
class PoolerNormalize(PoolerActivation): ...
class PoolerMultiLabelClassify(PoolerActivation): ...
class PoolerClassify(PoolerActivation): ...

Pooler 시스템은 계층적으로 설계되어 있다:

Pooler: 어떤 토큰을 선택할지 (CLS, 마지막, 평균 등)
PoolerHead: 선택된 토큰에 적용할 변환 (Linear 등)
PoolerActivation: 최종 활성화 (Identity, Normalize, Classify 등)

이 조합으로 임베딩 모델, 분류 모델 등 다양한 풀링 모델을 지원한다.

양자화 통합

class UnquantizedEmbeddingMethod(QuantizeMethodBase):
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        if current_platform.is_cpu():
            from vllm.model_executor.layers.utils import dispatch_cpu_unquantized_gemm

모든 레이어가 QuantizeMethodBase를 통해 양자화와 통합된다. 양자화되지 않은 경우에도 UnquantizedEmbeddingMethod가 일관된 인터페이스를 제공하며, CPU 플랫폼에서는 전용 GEMM 디스패치를 사용한다.

가중치 속성 시스템

set_weight_attrs(weight, {"input_dim": 1, "output_dim": 0})
set_weight_attrs(weight, extra_weight_attrs)

set_weight_attrs로 텐서에 메타데이터를 부착한다. input_dim과 output_dim은 텐서 병렬 분할 시 어떤 축을 분할할지, 가중치 로딩 시 어떤 차원이 partition에 대응하는지를 나타낸다.

왜 이 설계인가

QuantizeMethodBase 통합: 일반 레이어와 양자화 레이어가 동일한 인터페이스를 사용한다. 모델 코드에서 양자화 여부에 따라 분기할 필요 없이, quant_method.create_weights()와 quant_method.process_weights_after_loading()만 호출하면 된다.
Resampler의 공유 구현: Qwen-VL과 MiniCPM-V가 같은 Perceiver Resampler를 사용하므로, 공통 구현을 layers/resampler.py에 두고 공유한다. 이렇게 하면 버그 수정이 한 곳에서 이루어진다.
Pooler의 3단계 파이프라인: 토큰 선택 -> 변환 -> 활성화의 3단계로 분리하면, 각 단계를 독립적으로 교체할 수 있다. 예를 들어 같은 CLS 토큰 풀링에 대해 Identity(임베딩), Normalize(정규화 임베딩), Classify(분류)를 쉽게 전환할 수 있다.

참고 자료

vLLM 의 다른글

이전글 [vLLM] 기타 Attention Backends: GDN, Flex, Triton, DiffKV, MLA Sparse, CPU/ROCm
현재글 : [vLLM] 기타 Model Layers: Pooler, Resampler, Vocab Parallel Embedding 등
다음글 [vLLM] Tree Attention: 투기적 디코딩용 트리 어텐션