[loki] Grafana Loki의 Shuffle Sharding 알고리즘 최적화: 성능 향상의 비결

2026년 6월 9일수정: 2026년 6월 9일

PR 링크: grafana/loki#22269 상태: Merged | 변경: +190 / -26

들어가며

분산 시스템에서 데이터 샤딩은 필수적인 요소입니다. Grafana Loki는 시계열 데이터를 효율적으로 저장하고 검색하기 위해 다양한 샤딩 전략을 사용하며, 그중 하나가 바로 ShuffleSharding입니다. 이 전략은 특정 키(예: 테넌트 ID 또는 세그멘테이션 키)를 기반으로 데이터를 여러 샤드로 분산시키는 역할을 합니다. 하지만 샤딩 과정에서 발생하는 계산 비용은 시스템의 전반적인 성능에 영향을 미칠 수 있습니다. 최근 Grafana Loki의 grafana/loki 레포지토리에서는 ShuffleSharding 알고리즘의 성능을 개선하기 위한 PR이 제출되었습니다. 이 글에서는 해당 PR의 코드 변경 사항을 상세히 분석하고, 왜 이러한 변경이 성능 향상으로 이어졌는지, 그리고 어떤 기술적 교훈을 얻을 수 있는지 살펴보겠습니다.

이 PR은 주로 ShuffleSharding 알고리즘이 파티션의 수(n)와 요청된 샤드의 수(k)에 따라 비효율적으로 동작하는 부분을 개선하는 데 초점을 맞추고 있습니다. 특히, k가 n에 가까울 때 발생하는 과도한 정렬 비용을 줄이는 것이 핵심 목표입니다.

코드 분석: 파일별 변경 사항

`pkg/distributor/rendezvous/shuffle_sharder.go`

이 파일은 ShuffleSharder의 핵심 로직을 담고 있으며, 이번 PR에서 가장 많은 변경이 이루어진 곳입니다.

1. `ShuffleShard` 함수의 조건부 로직 개선

기존에는 numShards가 0이거나 전체 파티션 수보다 클 경우, 전체 파티션 수를 사용하도록 처리했습니다. 새로운 코드에서는 numShards가 전체 파티션 수(n)와 같거나 클 경우, 단순히 원본 ShuffleSharder를 반환하여 불필요한 계산을 피합니다.

Before:

 func (r *ShuffleSharder) ShuffleShard(shuffleShardKey string, numShards int) *ShuffleSharder {
- 	if numShards == 0 || numShards > len(r.partitions) {
- 		numShards = len(r.partitions)
- 	}
+	 n := len(r.partitions)
+	 if numShards == 0 || numShards >= n {
+		 return r
+	 }
 
 	 key := crc64.Checksum([]byte(shuffleShardKey), table)
- 	scores := make([]partitionAndScore, len(r.partitions))
- 	for i, partition := range r.partitions {
- 		originalHash := r.hashes[i]
- 		scores[i] = partitionAndScore{
- 			partition,
- 			xorshiftMult64(key ^ originalHash),
- 			hash: originalHash,
- 		}
- 	}
 
- 	// This sort dominates the cost of shuffle sharding.
- 	// Possible future optimization - using a size-limited max heap to keep track of the k largest scores.
- 	sort.Slice(scores, func(i, j int) bool {
- 		return scores[i].score > scores[j].score
- 	})
 
- 	subpartitions := make([]int32, numShards)
- 	subHashesSet := make([]uint64, numShards)
- 	for i := 0; i < numShards; i++ {
- 		subpartitions[i] = scores[i].partition
- 		subHashesSet[i] = scores[i].hash // Avoid recalculating these hashes
- 	}
- 	return &ShuffleSharder{subpartitions, subHashesSet}
+ 
+	if numShards <= n-numShards {
+		// numShards <= n/2: select the top-numShards elements. O(n log numShards).
+		top := make([]indexAndScore, numShards)
+		selectTopKIndices(top, r.hashes, key, false)
+		partitions := make([]int32, numShards)
+		hashes := make([]uint64, numShards)
+		for i, s := range top {
+			partitions[i] = r.partitions[s.index]
+			hashes[i] = r.hashes[s.index]
+		}
+		return &ShuffleSharder{partitions, hashes}
+	}
+
+	// numShards > n/2: cheaper to find the bottom-(n-numShards) elements to exclude, then
+	// return everything else. O(n log (n-numShards)).
+	numExclude := n - numShards
+	bottom := make([]indexAndScore, numExclude)
+	selectTopKIndices(bottom, r.hashes, key, true)
+	excluded := make([]bool, n)
+	for _, b := range bottom {
+		excluded[b.index] = true
+	}
+	partitions := make([]int32, 0, numShards)
+	hashes := make([]uint64, 0, numShards)
+	for i := range n {
+		if !excluded[i] {
+			partitions = append(partitions, r.partitions[i])
+			hashes = append(hashes, r.hashes[i])
+		}
+	}
+	return &ShuffleSharder{partitions, hashes}
 }
 
 func (r *ShuffleSharder) Size() int {

2. `selectTopKIndices` 함수 도입 및 알고리즘 변경

기존 코드에서는 모든 파티션에 대한 스코어를 계산한 후, sort.Slice를 사용하여 전체를 정렬했습니다. 이 sort.Slice 호출이 O(N log N)의 시간 복잡도를 가지며, 특히 N이 클 때 성능 병목이 되었습니다. PR에서는 이 부분을 두 가지 경우로 나누어 최적화했습니다.

numShards <= n - numShards (즉, numShards <= n/2): 이 경우, 상위 numShards개의 요소를 찾는 것이 효율적입니다. 이를 위해 selectTopKIndices 함수를 사용하며, 이 함수는 크기가 numShards인 힙(heap)을 사용하여 O(N log k)의 시간 복잡도로 상위 k개의 요소를 찾습니다. 여기서 k는 numShards입니다.
numShards > n - numShards (즉, numShards > n/2): 이 경우, 상위 numShards개를 찾는 것보다 하위 n - numShards개를 찾아서 제외하는 것이 더 효율적입니다. selectTopKIndices 함수는 invertOrdering=true 옵션을 사용하여 하위 n-numShards개의 요소를 찾습니다. 이 또한 O(N log (n-k))의 시간 복잡도를 가집니다. 여기서 k는 n-numShards입니다.

selectTopKIndices 함수는 Go의 내장 container/heap 패키지를 사용하지 않고 직접 구현되었습니다. 이는 리뷰 댓글에서 언급된 것처럼, Go의 내장 힙 구현이 슬라이스를 힙에 할당할 때 메모리 누수(slice escaping to the heap)를 유발하여 성능 저하를 일으킬 수 있기 때문입니다. 직접 구현한 힙은 이러한 문제를 피하면서 O(N log k) 또는 O(N log (n-k))의 효율성을 유지합니다.

새로운 구조체 및 함수:

type indexAndScore struct {
	index int
	score uint64
}

// selectTopKIndices fills h (pre-allocated, len defines k) with the k items with the highest
// scores from hashes.
+// Pass invertOrdering=true to instead select the k lowest-scoring items.
+// Caller pre-allocates heap so its backing array doesn't escape to the heap. This is a significant
+// performance optimization - benchmarks are ~45% slower using built-in heap.
+func selectTopKIndices(heap []indexAndScore, hashes []uint64, key uint64, invertOrdering bool) {
+ k := len(heap)
+ // Put the first k elements into the heap
+ for i := range k {
+ heap[i] = indexAndScore{i, calculateScore(key, hashes[i], invertOrdering)}
+ }
+ // Make it into a valid heap
+ for i := k/2 - 1; i >= 0; i-- {
+ siftDown(heap, i)
+ }
+ // Push the rest of the elements into the heap
+ for i := k; i < len(hashes); i++ {
+ if score := calculateScore(key, hashes[i], invertOrdering); score > heap[0].score {
+ heap[0] = indexAndScore{i, score}
+ siftDown(heap, 0)
+ }
+ }
+}
+
+func calculateScore(key uint64, hash uint64, invertOrdering bool) uint64 {
+ score := xorshiftMult64(key ^ hash)
+ if invertOrdering {
+ score = score ^ (^uint64(0))
+ }
+ return score
+}
+
+func siftDown(h []indexAndScore, i int) {
+ n := len(h)
+ for {
+ smallest := i
+ if l := 2*i + 1; l < n && h[l].score < h[smallest].score {
+ smallest = l
+ }
+ if r := 2*i + 2; r < n && h[r].score < h[smallest].score {
+ smallest = r
+ }
+ if smallest == i {
+ break
+ }
+ h[i], h[smallest] = h[smallest], h[i]
+ i = smallest
+ }
+}
+
+// Original partitionAndScore struct removed as it's no longer used.
+// type partitionAndScore struct {
+// 	partition int32
+// 	score     uint64
+// 	hash      uint64
+// }
 
 // https://vigna.di.unimi.it/ftp/papers/xorshift.pdf

3. `ShuffleShard_MajorityPartitions` 테스트 추가

numShards > n/2인 경우의 로직을 검증하기 위한 새로운 테스트 케이스가 추가되었습니다. 이는 새로운 알고리즘이 올바르게 동작하는지 확인하는 데 중요합니다.

+func TestShuffleShard_MajorityPartitions(t *testing.T) {
+	// Exercises the max-heap (inverse) path where numShards > n/2.
+	partitions := []int32{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
+	s := NewShuffleSharder(partitions)
+	numShards := 8 // 8 > 10/2
+
+	result := s.ShuffleShard("some-key", numShards)
+	require.Len(t, result.partitions, numShards)
+
+	// Each partition in the result must come from the original set.
+	for _, p := range result.partitions {
+		assert.Contains(t, partitions, p)
+	}
+
+	// Over many shard keys, each partition should appear roughly equally.
+	counts := make(map[int32]int)
+	numKeys := 1000
+	for i := 0; i < numKeys; i++ {
+		sub := s.ShuffleShard(fmt.Sprintf("tenant-%d", i), numShards)
+		for _, p := range sub.partitions {
+			counts[p]++
+		}
+	}
+	expected := numKeys * numShards / len(partitions)
+	tolerance := float64(expected) * 0.3
+	for _, p := range partitions {
+		assert.InDelta(t, expected, counts[p], tolerance, "partition %d: appeared %d times, expected ~%d", p, counts[p], expected)
+	}
+}

왜 이게 좋은가: 성능 향상 및 일반적 교훈

성능 수치 분석

PR에서 제공된 벤치마크 결과는 이 최적화의 효과를 명확하게 보여줍니다.

Before:

BenchmarkShuffleShard/n=200,k1=200,k2=200-12         	   70806	     15539 ns/op
BenchmarkShuffleShard/n=200,k1=190,k2=180-12         	   78914	     15409 ns/op
BenchmarkShuffleShard/n=200,k1=20,k2=10-12           	  137419	      8435 ns/op
BenchmarkShuffleShard/n=200,k1=100,k2=50-12          	  105848	     11688 ns/op
BenchmarkShuffleShard/n=1000,k1=990,k2=980-12        	   10000	    115228 ns/op
BenchmarkShuffleShard/n=1000,k1=500,k2=250-12        	   16069	     73546 ns/op
BenchmarkShuffleShard/n=1000,k1=20,k2=10-12          	   24585	     48835 ns/op

After:

BenchmarkShuffleShard/n=200,k1=200,k2=200-12         	 5633262	       199.7 ns/op
BenchmarkShuffleShard/n=200,k1=190,k2=180-12         	  604243	      2005 ns/op
BenchmarkShuffleShard/n=200,k1=20,k2=10-12           	 1585666	       722.7 ns/op
BenchmarkShuffleShard/n=200,k1=100,k2=50-12          	  558837	      2198 ns/op
BenchmarkShuffleShard/n=1000,k1=990,k2=980-12        	  153055	      7991 ns/op
BenchmarkShuffleShard/n=1000,k1=500,k2=250-12        	   96634	     12535 ns/op
BenchmarkShuffleShard/n=1000,k1=20,k2=10-12          	  884674	      1399 ns/op

결과를 비교해보면, 특히 n이 크고 k가 n에 가까운 시나리오에서 엄청난 성능 향상이 있었습니다.

n=200, k1=200, k2=200의 경우: 15539 ns/op에서 199.7 ns/op으로 약 77배 빨라졌습니다.
n=1000, k1=990, k2=980의 경우: 115228 ns/op에서 7991 ns/op으로 약 14배 빨라졌습니다.

이러한 성능 향상은 기존의 O(N log N) 정렬 기반 접근 방식에서 O(N log k) 또는 O(N log (n-k))의 힙 기반 접근 방식으로 변경되었기 때문입니다. 특히 k가 n에 비해 작을 때, 또는 n-k가 n에 비해 작을 때 이득이 극대화됩니다.

일반적 교훈

문제 상황에 맞는 알고리즘 선택의 중요성: 모든 데이터를 정렬하는 것은 k개의 최댓값/최솟값을 찾는 데 있어 비효율적입니다. k개의 최댓값/최솟값을 찾는 문제는 힙(heap) 자료구조를 사용하여 훨씬 효율적으로 해결할 수 있습니다 (O(N log k)). 또한, k가 n의 절반보다 클 때는 하위 n-k개를 찾는 것이 더 효율적이라는 점을 간파한 것이 뛰어납니다.
성능 병목 식별 및 집중: PR 설명에서 "This sort dominates the cost of shuffle sharding."이라고 명시된 것처럼, 코드의 가장 비용이 많이 드는 부분을 정확히 식별하고 개선하는 것이 중요합니다. 벤치마크를 통해 이를 확인하고 개선 효과를 정량적으로 입증했습니다.
Go의 내장 자료구조 사용 시 성능 고려: Go의 container/heap 패키지는 편리하지만, 특정 상황(슬라이스 할당 시 메모리 누수 가능성)에서는 성능 저하를 유발할 수 있습니다. 리뷰어의 지적과 개발자의 실험 결과(~45% slower)는 이러한 미묘한 성능 차이를 인지하고, 필요하다면 직접 구현하는 것이 더 나은 선택일 수 있음을 보여줍니다. 이 PR에서는 이 점을 명확히 인지하고 직접 구현한 힙을 사용했으며, 그 이유를 코드 주석으로 남겨 다른 개발자들에게 정보를 전달했습니다.
다양한 시나리오 고려: numShards <= n/2와 numShards > n/2 두 가지 경우로 나누어 최적화한 것은, 입력 파라미터의 범위에 따라 최적의 알고리즘이 달라질 수 있음을 보여줍니다. 이는 일반적인 알고리즘 설계에서 중요한 고려 사항입니다.

References

Grafana Loki PR #22269
xxHash - 사용된 해싱 알고리즘
xorshift pseudo-random number generator - 사용된 난수 생성기
Go container/heap documentation - 비교 대상이 된 내장 힙 구현

참고 자료

⚠️ 알림: 이 분석은 AI가 실제 코드 diff를 기반으로 작성했습니다.

PR Analysis 의 다른글

이전글 [sglang] SGLang LTX-2 VAE 디코딩 성능 최적화: channels_last_3d 도입으로 4.5배 속도 향상
현재글 : [loki] Grafana Loki의 Shuffle Sharding 알고리즘 최적화: 성능 향상의 비결
다음글 [vllm] vLLM, DFlash 도입으로 추론 속도 1.2배 향상: MRV2와 CUDAGraph의 시너지

댓글

관련 포스트

PR Analysis 의 다른글