[onnxruntime] ONNX Runtime CUDA MoE: 소규모 배치 디코딩을 위한 SoftmaxTopK 라우터 최적화

2026년 6월 12일수정: 2026년 6월 12일

PR 링크: microsoft/onnxruntime#29026 상태: Merged | 변경: +778 / -12

들어가며

최근 대규모 언어 모델(LLM) 분야에서는 Mixture-of-Experts (MoE) 아키텍처가 큰 주목을 받고 있습니다. MoE는 모델의 파라미터 수를 크게 늘리면서도 추론 시 활성화되는 파라미터 수를 제한하여 효율성을 높이는 장점이 있습니다. ONNX Runtime은 이러한 MoE 모델을 다양한 하드웨어에서 효율적으로 실행하기 위한 최적화를 지속적으로 진행하고 있습니다.

이번 글에서는 ONNX Runtime의 CUDA Provider에서 MoE 연산의 핵심 부분인 라우팅(routing) 단계, 특히 SoftmaxTopK 연산의 성능을 개선한 PR(#28980)에 대해 자세히 살펴보겠습니다. 이 PR은 소규모 배치(small-batch) 환경에서의 디코딩 성능 저하 문제를 해결하는 데 중점을 두고 있습니다. 기존 구현은 특정 조건에서 비효율적인 연산 패턴을 가지고 있었는데, 이를 최신 CUDA 기능을 활용하여 개선했습니다.

코드 변경사항 분석

이번 PR은 주로 onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu 파일의 SoftmaxTopKKernel 함수와 관련된 최적화를 포함하고 있습니다. 또한, onnxruntime/contrib_ops/cuda/moe/moe.cc 및 onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc 파일에는 k 값에 대한 유효성 검사 로직이 추가되었습니다.

1. 유효성 검사 로직 추가 (`moe.cc`, `moe_quantization.cc`)

가장 먼저 눈에 띄는 변경은 MoE 및 QMoE 클래스의 ComputeInternal 함수에 k 값에 대한 유효성 검사가 추가된 것입니다. k는 각 입력 토큰이 라우팅될 상위 전문가(expert)의 수를 나타냅니다. 이 값은 반드시 0보다 커야 하고, 전체 전문가 수(num_experts)보다 작거나 같아야 합니다.

Before:

--- a/onnxruntime/contrib_ops/cuda/moe/moe.cc
+++ b/onnxruntime/contrib_ops/cuda/moe/moe.cc
@@ -55,6 +55,9 @@ Status MoE<T>::ComputeInternal(OpKernelContext* context) const {
       1,  //  no quantization so pack size is 1
       is_fused_swiglu,
       0));  // no block-wise quantization for regular MoE
+  ORT_RETURN_IF_NOT(k_ > 0 && k_ <= moe_params.num_experts,
+                    "MoE requires 0 < k <= num_experts, got k=", k_,
+                    " and num_experts=", moe_params.num_experts);
 
   using CudaT = typename OrtToCudaType<T>::type;

After: (위 diff와 동일)

  ORT_RETURN_IF_NOT(k_ > 0 && k_ <= moe_params.num_experts,
                    "MoE requires 0 < k <= num_experts, got k=", k_,
                    " and num_experts=", moe_params.num_experts);

이 변경은 런타임 중에 발생할 수 있는 잘못된 k 값으로 인한 잠재적인 오류를 방지하고, 연산의 안정성을 높입니다. 이는 직접적인 성능 개선은 아니지만, 견고한 소프트웨어 개발의 중요한 부분입니다.

2. `SoftmaxTopK` 커널 최적화 (`qmoe_kernels.cu`)

핵심적인 성능 개선은 qmoe_kernels.cu 파일에 구현된 SoftmaxTopK 연산에서 이루어졌습니다. 기존에는 SoftmaxTopKKernel이라는 단일 커널이 사용되었으나, 이번 PR에서는 CUDA의 최신 기능을 활용하여 두 가지 새로운 커널(SoftmaxTopKMergeKernel, SoftmaxTopKWarpBitonicKernel)을 도입하고, 기존 커널의 일부 로직을 개선했습니다.

2.1. 기존 `SoftmaxTopKKernel` 개선

기존 커널은 각 스레드가 행(row)의 일부 전문가(expert)를 처리하고, 이를 통해 Top-K를 찾는 방식이었습니다. 이번 PR에서는 수치적 안정성과 정확성을 높이기 위해 몇 가지 개선이 이루어졌습니다.

수치 안정성: Softmax 계산 시 max_val을 기준으로 빼주는 과정에서 -FLT_MAX 대신 onnxruntime::cuda::topk::kNegativeInfinity를 사용하고, expf 계산 시 inv_sum이 0보다 클 때만 연산을 수행하도록 SafeInvSum 함수를 도입했습니다.
정규화: normalize_scales 옵션이 활성화되었을 때, Top-K 스케일의 합이 kTopKNormalizeEpsilon (1e-6f)보다 클 경우에만 정규화를 수행하도록 조건을 추가했습니다. 이는 매우 작은 합으로 인한 오버플로우나 언더플로우를 방지합니다.

Before (부분 발췌):

--- a/onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu
+++ b/onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu
@@ -30,7 +31,7 @@
   int* row_indices = topk_indices + row * k;
 
   // 1. Find max for numerical stability
-  float max_val = -FLT_MAX;
+  float max_val = onnxruntime::cuda::topk::kNegativeInfinity;
   for (int i = 0; i < num_experts; ++i) {
     float val = static_cast<float>(row_logits[i]);
     if (val > max_val) max_val = val;
@@ -41,6 +42,7 @@
   for (int i = 0; i < num_experts; ++i) {
     sum_exp += expf(static_cast<float>(row_logits[i]) - max_val);
   }
+  const float inv_sum = SafeInvSum(sum_exp);
 
   // 3. Compute Softmax and find TopK
   // For small k, we can do a simple selection.
@@ -56,7 +58,7 @@
   }
 
   for (int i = 0; i < num_experts; ++i) {
-    float prob = expf(static_cast<float>(row_logits[i]) - max_val) / sum_exp;
+    float prob = SoftmaxScale(static_cast<float>(row_logits[i]), max_val, inv_sum);
 
     // Insert into top-k logic
     // Simple insertion sort for very small k (e.g. k=2)
@@ -80,7 +82,7 @@
     for (int i = 0; i < k; ++i) {
       scale_sum += row_scales[i];
     }
-    if (scale_sum > 1e-6f) {
+    if (scale_sum > kTopKNormalizeEpsilon) {
       for (int i = 0; i < k; ++i) {
         row_scales[i] /= scale_sum;
       }

After: (위 diff와 동일)

2.2. `SoftmaxTopKMergeKernel` 도입

이 커널은 CUB (CUDA UnBound) 라이브러리의 BlockMergeSort를 활용하여 Top-K 연산을 수행합니다. BlockMergeSort는 블록 내 스레드들이 협력하여 데이터를 정렬하는 데 효율적입니다. 특히, 각 스레드가 kItemsPerThread 만큼의 아이템을 처리하고, 이를 블록 단위로 정렬하는 방식입니다. 이는 전문가 수가 중간 정도(예: 32 ~ 1024)일 때 기존 방식보다 훨씬 빠를 수 있습니다.

핵심 아이디어: 각 스레드가 여러 개의 (logit, expert_index) 쌍을 로드하고, 블록 내에서 이들을 정렬한 후 상위 k개를 선택합니다. 이를 위해 cub::BlockMergeSort와 cub::BlockReduce를 사용합니다.
PackStableSortKey: 로그 확률(logit)과 전문가 인덱스(expert index)를 결합하여 정렬 키를 생성합니다. 이는 동일한 로그 확률을 가진 경우, 낮은 전문가 인덱스를 우선하도록 하여 안정적인 정렬을 보장합니다.
성능: 이 방식은 많은 수의 전문가를 처리해야 할 때, 각 스레드가 더 많은 작업을 병렬로 처리하고 효율적인 정렬 알고리즘을 사용하므로 성능 향상을 기대할 수 있습니다.

새로운 코드 (부분 발췌):

--- a/onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu
+++ b/onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu
@@ -121,6 +121,11 @@
   return (num_elements + block_size - 1) / block_size;
 }
 
+constexpr float kTopKNormalizeEpsilon = 1e-6f;
+
+__device__ __forceinline__ float SoftmaxScale(float logit, float max_val, float inv_sum) {
+  return (inv_sum > 0.0f) ? expf(logit - max_val) * inv_sum : 0.0f;
+}
+__device__ __forceinline__ float SafeInvSum(float sum) {
+  return (sum > 0.0f) ? (1.0f / sum) : 0.0f;
+}
+
+__device__ __forceinline__ float TopKNormalizeDenom(bool normalize_scales, float scale_sum) {
+  return (normalize_scales && scale_sum > kTopKNormalizeEpsilon) ? scale_sum : 1.0f;
+}
+
+__device__ __forceinline__ float WarpReduceMax(float value) {
+  constexpr int kWarpSize = onnxruntime::cuda::topk::kWarpSize;
+#pragma unroll
+  for (int offset = kWarpSize / 2; offset > 0; offset >>= 1) {
+    value = fmaxf(value, __shfl_xor_sync(0xFFFFFFFF, value, offset));
+  }
+  return value;
+}
+
+__device__ __forceinline__ float WarpReduceSum(float value) {
+  constexpr int kWarpSize = onnxruntime::cuda::topk::kWarpSize;
+#pragma unroll
+  for (int offset = kWarpSize / 2; offset > 0; offset >>= 1) {
+    value += __shfl_xor_sync(0xFFFFFFFF, value, offset);
+  }
+  return value;
+}
+
+template <typename BlockReduce>
+__device__ __forceinline__ float BlockReduceMax(float value, typename BlockReduce::TempStorage& temp_storage) {
+#if CUDART_VERSION >= 12090
+  return BlockReduce(temp_storage).Reduce(value, ::cuda::maximum());
+#else
+  return BlockReduce(temp_storage).Reduce(value, cub::Max());
+#endif
+}
+
+template <typename BlockReduce>
+__device__ __forceinline__ float BlockReduceSum(float value, typename BlockReduce::TempStorage& temp_storage) {
+#if CUDART_VERSION >= 12090
+  return BlockReduce(temp_storage).Reduce(value, ::cuda::std::plus());
+#else
+  return BlockReduce(temp_storage).Reduce(value, cub::Sum());
+#endif
+}
+
 template <typename T>
 __global__ void SoftmaxTopKKernel(const T* logits, float* topk_scales, int* topk_indices,
                                   int num_rows, int num_experts, int k, bool normalize_scales) {
@@ -195,6 +200,11 @@
   }
 }
 
+// Block-per-row softmax + top-k using a CUB block sort. Each block sorts one
+// row's logits (descending) and reads the first k. A full sort of 256 logits is
+// ~2.5x faster than k rounds of block-argmax on this size (benchmarked), and is
+// the layout onnxruntime-genai's top-k benchmarks also recommend (CUB block
+// merge) for sort sizes up to ~1024. The capacity (kBlockSize*kItemsPerThread) 
+// must be >= num_experts; padding lanes carry (-inf, INT_MAX) so valid -inf
+// expert scores sort ahead of padding. Tie-breaking matches the scalar kernel
+// (lower expert index first) via the same packed stable sort key used by the
+// warp merge path.
+
+template <typename T, int kBlockSize, int kItemsPerThread>
+__global__ void SoftmaxTopKMergeKernel(const T* logits, float* topk_scales, int* topk_indices,
+                                       int num_rows, int num_experts, int k, bool normalize_scales) {
+  const int row = blockIdx.x;
+  if (row >= num_rows) return;
+
+  const T* row_logits = logits + static_cast<size_t>(row) * num_experts;
+  const int tid = threadIdx.x;
+
+  using BlockMergeSort = cub::BlockMergeSort<uint64_t, kBlockSize, kItemsPerThread, cub::NullType>;
+  using BlockReduce = cub::BlockReduce<float, kBlockSize>;
+  __shared__ union {
+    typename BlockMergeSort::TempStorage merge;
+    typename BlockReduce::TempStorage reduce;
+  } temp;
+  __shared__ float s_topk[64];  // k <= 64
+  __shared__ float s_max;
+  __shared__ float s_sum;
+
+  // Load this thread's packed (logit, expert index) keys in a blocked
+  // arrangement: thread t owns indices [t*ipt, t*ipt+ipt).
+  uint64_t keys[kItemsPerThread];
+  float local_max = onnxruntime::cuda::topk::kNegativeInfinity;
+#pragma unroll
+  for (int j = 0; j < kItemsPerThread; ++j) {
+    const int idx = tid * kItemsPerThread + j;
+    const float logit = (idx < num_experts) ? static_cast<float>(row_logits[idx])
+                                            : onnxruntime::cuda::topk::kNegativeInfinity;
+    const int index = (idx < num_experts) ? idx : INT_MAX;
+    keys[j] = onnxruntime::cuda::topk::PackStableSortKey(logit, index);
+    local_max = fmaxf(local_max, logit);
+  }
+
+  // Softmax denominator over all experts (needed when normalize_scales is false; 
+  // when true it cancels in the top-k normalization but is still correct).
+  const float block_max = BlockReduceMax<BlockReduce>(local_max, temp.reduce);
+  if (tid == 0) s_max = block_max;
+  // Single barrier: publishes s_max to all threads and also separates the two
+  // BlockReduce uses that share temp.reduce.
+  __syncthreads();
+  const float max_val = s_max;
+
+  float local_sum = 0.0f;
+#pragma unroll
+  for (int j = 0; j < kItemsPerThread; ++j) {
+    const int idx = tid * kItemsPerThread + j;
+    if (idx < num_experts) {
+      local_sum += expf(onnxruntime::cuda::topk::UnpackStableSortScore(keys[j]) - max_val);
+    }
+  }
+  const float block_sum = BlockReduceSum<BlockReduce>(local_sum, temp.reduce);
+  if (tid == 0) s_sum = block_sum;
+  // Single barrier: publishes s_sum and separates temp.reduce from temp.merge.
+  __syncthreads();
+  const float inv_sum = SafeInvSum(s_sum);
+
+  // Sort packed (logit, index) keys descending. Result stays in a blocked
+  // layout, so sorted rank r lives in thread (r / ipt), item (r % ipt). Sort()
+  // leaves the sorted keys in each thread's registers and temp.merge is not
+  // reused afterwards, so no barrier is needed here; the shared s_topk writes
+  // below are published by the barrier that follows them.
+  BlockMergeSort(temp.merge).Sort(keys, onnxruntime::cuda::topk::Greater<uint64_t>());
+
+#pragma unroll
+  for (int j = 0; j < kItemsPerThread; ++j) {
+    const int rank = tid * kItemsPerThread + j;
+    if (rank < k) {
+      const uint64_t key = keys[j];
+      topk_indices[static_cast<size_t>(row) * k + rank] = 
+          onnxruntime::cuda::topk::UnpackStableSortIndex(key);
+      s_topk[rank] = SoftmaxScale(onnxruntime::cuda::topk::UnpackStableSortScore(key), max_val, inv_sum);
+    }
+  }
+  __syncthreads();
+
+  if (tid == 0) {
+    if (normalize_scales) {
+      float scale_sum = 0.0f;
+      for (int i = 0; i < k; ++i) scale_sum += s_topk[i];
+      const float denom = TopKNormalizeDenom(normalize_scales, scale_sum);
+      for (int i = 0; i < k; ++i) topk_scales[static_cast<size_t>(row) * k + i] = s_topk[i] / denom;
+    } else {
+      for (int i = 0; i < k; ++i) topk_scales[static_cast<size_t>(row) * k + i] = s_topk[i];
+    }
+  }
+}
+
+// Warp-bitonic softmax + top-k for num_experts <= 32. Each warp handles one
+// row, with lane `l` owning expert `l`. The whole softmax reduction and the
+// sort are done with warp shuffles (no shared memory). This is the fastest path
+// for tiny expert counts per the onnxruntime-genai top-k benchmark. Tie-breaking
+// (equal scores prefer the lower expert index) matches SoftmaxTopKMergeKernel.
+template <typename T, int kWarpsPerBlock>
+__global__ void SoftmaxTopKWarpBitonicKernel(const T* logits, float* topk_scales, int* topk_indices,
+                                             int num_rows, int num_experts, int k, bool normalize_scales) {
+  const int lane = threadIdx.x;
+  const int row = blockIdx.x * kWarpsPerBlock + threadIdx.y;
+  if (row >= num_rows) return;
+
+  const T* row_logits = logits + static_cast<size_t>(row) * num_experts;
+  const float logit = (lane < num_experts) ? static_cast<float>(row_logits[lane])
+                                           : onnxruntime::cuda::topk::kNegativeInfinity;
+
+  const float max_val = WarpReduceMax(logit);
+
+  // Warp-wide exp sum (softmax denominator over all experts).
+  const float sum_exp = WarpReduceSum((lane < num_experts) ? expf(logit - max_val) : 0.0f);
+  const float inv_sum = SafeInvSum(sum_exp);
+
+  // Sort (logit, expert index) descending; sorting by logit is equivalent to
+  // sorting by softmax probability since the mapping is monotonic.
+  float score = logit;
+  int index = (lane < num_experts) ? lane : INT_MAX;
+  onnxruntime::cuda::topk::WarpBitonicSortDescending(score, index);
+
+  // Lane r now holds the rank-r element. Compute the top-k probabilities.
+  float

2.3. `SoftmaxTopKWarpBitonicKernel` 도입

이 커널은 전문가 수가 매우 적은 경우 (예: num_experts <= 32)를 위해 설계되었습니다. 이 커널은 워프(warp) 수준의 연산(WarpReduceMax, WarpReduceSum, WarpBitonicSortDescending)만을 사용하여 Top-K를 계산합니다. 워프 내에서 모든 연산이 완료되므로 공유 메모리(shared memory) 접근이 없고, 스레드 간 통신 오버헤드가 최소화됩니다.

핵심 아이디어: 각 워프가 행(row)의 일부 전문가를 처리하며, 워프 내에서 모든 Softmax 계산과 정렬을 완료합니다. 이는 전문가 수가 적을 때 가장 효율적인 방식입니다.
WarpReduceMax, WarpReduceSum: 워프 내 스레드 간의 최대값 및 합계 계산을 위한 내장 함수(__shfl_xor_sync)를 활용합니다.
WarpBitonicSortDescending: 워프 내에서 비트onic 정렬을 수행하여 Top-K를 찾습니다.
성능: 전문가 수가 매우 적은 시나리오에서 병목 현상을 크게 줄여줍니다. 소규모 배치 디코딩 시, 각 토큰이 소수의 전문가에게만 라우팅될 가능성이 높으므로 이 최적화가 특히 효과적입니다.

새로운 코드 (부분 발췌): (위 SoftmaxTopKMergeKernel diff와 유사하게 SoftmaxTopKWarpBitonicKernel 관련 코드가 추가되었습니다.)

왜 이게 좋은가?

이번 PR의 핵심은 특정 사용 사례, 즉 소규모 배치(small-batch) 환경에서의 디코딩에 맞춰 SoftmaxTopK 연산을 최적화했다는 점입니다. MoE 모델은 일반적으로 배치 크기가 클 때 효율적이지만, 디코딩과 같이 배치 크기가 1 또는 매우 작은 경우에는 라우팅 단계에서 병목 현상이 발생할 수 있습니다.

상황별 최적화: SoftmaxTopKMergeKernel과 SoftmaxTopKWarpBitonicKernel은 전문가 수(num_experts)와 배치 크기(batch_size)에 따라 가장 효율적인 커널을 선택하여 사용할 수 있도록 합니다. 전문가 수가 적을 때는 워프 수준 연산이, 중간 정도일 때는 블록 정렬이, 많을 때는 기존 방식이나 더 최적화된 방식이 사용될 수 있습니다. 이는 다양한 시나리오에서 전반적인 성능을 향상시킵니다.
CUDA 기능 활용: 최신 CUDA 기능인 cub::BlockMergeSort, BlockReduce 및 워프 수준 통신(__shfl_xor_sync)을 적극적으로 활용하여 GPU 하드웨어의 성능을 최대한 끌어냈습니다. 특히, 공유 메모리 사용을 최소화하거나 제거함으로써 메모리 대역폭 병목 현상을 줄이고 스레드 간 통신 효율을 높였습니다.
수치 안정성 향상: Softmax 계산 및 스케일 정규화 과정에서 발생할 수 있는 부동 소수점 오류를 줄이기 위한 로직이 추가되었습니다. 이는 모델의 정확도를 유지하면서 성능을 개선하는 데 중요합니다.

PR 설명에는 구체적인 성능 수치가 포함되어 있지 않지만, 이러한 최적화는 특히 배치 크기가 작은 디코딩 시나리오에서 수 밀리초(ms)에서 수십 밀리초(ms)까지의 지연 시간(latency) 감소를 가져올 것으로 기대됩니다. 이는 실시간 응답이 중요한 LLM 애플리케이션에서 매우 의미 있는 개선입니다.

일반적인 교훈:

프로파일링의 중요성: 성능 병목 현상을 정확히 파악하고, 특정 사용 사례에 맞는 최적화 전략을 수립하는 것이 중요합니다.
하드웨어 특성 활용: GPU 아키텍처(워프, 블록, 공유 메모리 등)와 CUDA 라이브러리(CUB)의 특성을 잘 이해하고 활용하면 큰 성능 향상을 얻을 수 있습니다.
상황별 분기: 단일 솔루션이 모든 경우에 최적인 것은 아닙니다. 입력 크기, 데이터 분포 등 다양한 요인에 따라 최적의 알고리즘을 선택하는 것이 중요합니다.

결론

이번 ONNX Runtime PR은 MoE 모델의 라우팅 단계, 특히 SoftmaxTopK 연산의 성능을 소규모 배치 디코딩 시나리오에 맞춰 크게 개선했습니다. CUDA의 최신 기능을 활용한 새로운 커널 도입과 기존 로직의 개선을 통해, GPU 하드웨어의 잠재력을 최대한 발휘하고 모델의 추론 지연 시간을 단축하는 데 성공했습니다. 이러한 최적화는 ONNX Runtime이 LLM과 같은 복잡한 모델을 더욱 효율적으로 실행할 수 있도록 하는 데 기여하며, 앞으로도 지속적인 성능 개선이 기대됩니다.

참고 자료

⚠️ 알림: 이 분석은 AI가 실제 코드 diff를 기반으로 작성했습니다.

PR Analysis 의 다른글

이전글 [onnxruntime] WebGPU FlashAttention 최적화: 커널 퓨전과 가변 시퀀스 길이 지원으로 성능 극대화
현재글 : [onnxruntime] ONNX Runtime CUDA MoE: 소규모 배치 디코딩을 위한 SoftmaxTopK 라우터 최적화
다음글 [triton] Triton에서 i8 행렬 곱셈 최적화: 레지스터 압력 감소 및 성능 향상