[Ray Train] 벤치마크에 첫 번째 배치 시간 포함하여 정확한 처리량 측정

2026년 1월 8일수정: 2026년 1월 8일

PR 링크: ray-project/ray#59949 상태: Merged | 변경: +4 / -4

들어가며

Ray Train의 벤치마크에서 처리량(throughput)을 계산할 때, 첫 번째 배치를 받기까지의 시간(iter_first_batch)을 제외하고 있었습니다. 이로 인해 preserve-order 옵션 사용 시 첫 번째 배치 시간이 67초로 매우 느린데도 불구하고, 처리량 수치가 비정상적으로 높게 나타나는 왜곡이 발생했습니다.

핵심 코드 분석

Before: iter_first_batch 제외

train_time = (
    metrics["train/dataset_creation_time"]
    + self._metrics["train/step"].get()
    # Exclude the time it takes to get the first batch.
    # + self._metrics["train/iter_first_batch"].get()
    + self._metrics["train/iter_batch"].get()
)

After: iter_first_batch 포함

train_time = (
    metrics["train/dataset_creation_time"]
    + self._metrics["train/step"].get()
    # Include the time it takes to get the first batch.
    + self._metrics["train/iter_first_batch"].get()
    + self._metrics["train/iter_batch"].get()
)

동일한 변경이 validation 처리량 계산에도 적용되었습니다.

왜 이게 좋은가

정확한 비교: preserve-order 없이 iter_first_batch = 15초, preserve-order 사용 시 iter_first_batch = 67초인 경우, 이 시간을 제외하면 preserve-order가 오히려 더 빨라 보이는 잘못된 결론에 도달합니다.
실제 사용자 경험 반영: 사용자 입장에서 첫 번째 배치까지의 대기 시간도 전체 학습 시간의 일부입니다.
벤치마크 신뢰성: 벤치마크가 실제 성능을 정확하게 반영해야 올바른 최적화 방향을 잡을 수 있습니다.

단 4줄의 변경이지만, 벤치마크 결과의 왜곡을 수정하여 올바른 성능 비교를 가능하게 한 중요한 수정입니다. 벤치마크에서 특정 구간을 제외하면 결과가 왜곡될 수 있다는 좋은 교훈입니다.

참고 자료

PR Analysis 의 다른글

이전글 [Triton] Proton GlobalScratchAllocOp 폐기 — TritonGPU 공용 op으로 통합
현재글 : [Ray Train] 벤치마크에 첫 번째 배치 시간 포함하여 정확한 처리량 측정
다음글 [triton] SwiGLU 커널에 ex2.approx.ftz 적용으로 1-2 GBps 성능 개선