[faster-qwen3-tts] 공식 Qwen3-TTS 기반으로 포팅 및 벤치마크 대폭 향상

2026년 2월 20일수정: 2026년 2월 20일

PR 링크: andimarafioti/faster-qwen3-tts#11 상태: Merged | 변경: +459 / -89

들어가며

faster-qwen3-tts는 초기에 커뮤니티 Qwen3-TTS-streaming fork를 기반으로 개발되었다. 이 PR은 공식 QwenLM/Qwen3-TTS 저장소를 base dependency로 전환하고, upstream API 변경에 맞춰 코드를 포팅한다. 부수적으로 repetition penalty 벡터화 최적화가 포함되어 벤치마크가 크게 향상된다.

핵심 코드 분석

Dependency 전환

# pyproject.toml
# Before
dependencies = [
    "qwen-tts @ git+https://github.com/dffdeeq/Qwen3-TTS-streaming.git",
]

# After
dependencies = [
    "qwen-tts @ git+https://github.com/QwenLM/Qwen3-TTS.git",
]

Repetition Penalty 벡터화

이 PR의 핵심 성능 개선은 decode 루프에서 repetition penalty를 벡터화한 것이다.

Before:

if repetition_penalty != 1.0 and len(all_codec_ids) > 0:
    n_recent = min(50, len(all_codec_ids))
    recent = torch.stack([c[0] for c in all_codec_ids[-n_recent:]])
    for prev_tok in recent.unique():
        s = logits[0, 0, prev_tok]
        logits[0, 0, prev_tok] = s / rep_pen if s > 0 else s * rep_pen

After:

if repetition_penalty != 1.0 and len(all_codec_ids) > 0:
    n_recent = min(50, len(all_codec_ids))
    recent = torch.stack([c[0] for c in all_codec_ids[-n_recent:]])
    unique_toks = recent.unique()
    tok_logits = logits[0, 0, unique_toks]
    logits[0, 0, unique_toks] = torch.where(
        tok_logits > 0,
        tok_logits / repetition_penalty,
        tok_logits * repetition_penalty,
    )

Upstream API 호환

# 공식 API가 있으면 사용, 없으면 fallback
if hasattr(self.model, "_build_assistant_text"):
    input_texts = [self.model._build_assistant_text(text)]
else:
    input_texts = [f"<|im_start|>assistant\n{text}<|im_end|>\n..."]

벤치마크 향상

GPU	Before RTF	After RTF	향상
RTX 4090 (0.6B)	4.56	5.56	+22%
H100 (0.6B)	3.47	4.19	+21%
Jetson Orin (0.6B)	1.38	1.57	+14%

왜 이게 좋은가

안정적 기반: 공식 저장소 기반으로 전환하여 upstream 변경 추적이 용이해진다.
성능 향상: Python for 루프를 torch.where 벡터 연산으로 바꿔 step당 수 ms 절약. 빠른 GPU에서 효과가 극대화된다.
호환성 유지: hasattr 기반 fallback으로 두 버전 모두 지원한다.

정리

기반 라이브러리 전환과 동시에 decode 루프의 작은 비효율을 제거하여 20% 이상의 성능 향상을 달성했다. CUDA graph에서 Python 코드 비용이 얼마나 큰 비중을 차지하는지 보여주는 좋은 사례다.

참고 자료

이 글은 AI(Claude)의 도움을 받아 작성되었습니다. 코드 분석과 해석에서 오류가 있을 수 있으니, 정확한 내용은 원본 PR을 참고해주세요.

PR Analysis 의 다른글

이전글 [faster-qwen3-tts] 로컬 모델 경로를 HuggingFace Hub ID로 전환하여 배포 간소화
현재글 : [faster-qwen3-tts] 공식 Qwen3-TTS 기반으로 포팅 및 벤치마크 대폭 향상
다음글 [faster-qwen3-tts] 패키지 리네이밍 및 코드 간소화