[faster-qwen3-tts] 모드 간 성능 동등성 검증 및 벤치마크 비교 문서화

2026년 2월 21일수정: 2026년 2월 21일

PR 링크: andimarafioti/faster-qwen3-tts#18 상태: Merged | 변경: +146 / -0

들어가며

faster-qwen3-tts는 VoiceClone(xvec), VoiceClone(ICL), CustomVoice 세 가지 모드를 지원한다. 각 모드는 prefill 단계에서 다른 입력을 구성하지만, CUDA graph으로 캡처된 decode 단계는 동일하다. 이 PR은 모드 간 성능 동등성을 정량적으로 검증하는 벤치마크를 추가하고 결과를 문서화한다.

핵심 코드 분석

compare_modes.py 벤치마크

def bench_stream(fn, label):
    # TTFA 측정 (streaming)
    ttfas = []
    for _ in range(TTFA_RUNS):
        t0 = time.perf_counter()
        gen = fn(max_new_tokens=512, chunk_size=CHUNK_SIZE, streaming=True)
        _chunk, _sr, _timing = next(gen)
        ttfas.append((time.perf_counter() - t0) * 1000)

    # RTF 측정 (non-streaming)
    rtfs = []
    for _ in range(RTF_RUNS):
        t0 = time.perf_counter()
        audio_list, sr = fn(max_new_tokens=512, streaming=False)
        total = time.perf_counter() - t0
        rtfs.append(len(audio_list[0]) / sr / total)

    print(f"{label:>18} | TTFA={np.mean(ttfas):.0f}ms | RTF={np.mean(rtfs):.3f}")

측정 결과 (README에 추가)

Mode	TTFA (ms)	RTF	ms/step
VoiceClone xvec	152 +/- 11	5.470 +/- 0.032	15.2 +/- 0.1
VoiceClone full ICL	149 +/- 1	5.497 +/- 0.026	15.2 +/- 0.1
CustomVoice	148 +/- 1	5.537 +/- 0.020	15.0 +/- 0.1