[triton] Proton의 Runtime과 Metric 상관관계 단순화로 오버헤드 감소

2026년 1월 4일수정: 2026년 1월 4일

PR 링크: triton-lang/triton#9132 상태: Merged | 변경: +1073 / -884

들어가며

GPU 프로파일러는 커널 실행에 미치는 영향(overhead)을 최소화해야 합니다. Proton은 Triton의 내장 프로파일러로, 런타임 메트릭 수집과 데이터 구조 업데이트가 프로파일링의 핵심 경로입니다. 이 PR은 Data와 Metric 인터페이스를 대폭 재설계하여 불필요한 잠금(locking)과 중복 조회를 제거합니다.

핵심 코드 분석

DataEntry 구조 도입

Before:

// addOp과 addMetric이 별도 호출, 각각 lock 필요
virtual size_t addOp(size_t scopeId, const std::string &opName) = 0;
virtual void addMetric(size_t scopeId, std::shared_ptr<Metric> metric) = 0;
// 최적화 시도로 addOpAndMetric도 있었지만 인터페이스 복잡
virtual void addOpAndMetric(size_t scopeId, const std::string &opName,
                            std::shared_ptr<Metric> metric);

After:

struct DataEntry {
  size_t id{Scope::DummyScopeId};
  std::reference_wrapper<std::map<MetricKind, std::unique_ptr<Metric>>> metrics;

  void upsertMetric(std::unique_ptr<Metric> metric) {
    auto &metricsMap = metrics.get();
    auto it = metricsMap.find(metric->getKind());
    if (it == metricsMap.end()) {
      metricsMap.emplace(metric->getKind(), std::move(metric));
    } else {
      it->second->updateMetric(*metric);
    }
  }
};

virtual DataEntry addOp(const std::string &opName) = 0;

MetricKernelLaunchState 통합

Before:

static thread_local void *tensorMetricKernel;
static thread_local void *scalarMetricKernel;
static thread_local void *metricKernelStream;

After:

struct MetricKernelLaunchState {
  MetricKernelLaunchConfig tensor{};
  MetricKernelLaunchConfig scalar{};
  void *stream{nullptr};
};
static thread_local MetricKernelLaunchState metricKernelLaunchState;

핵심 개선은 addOp이 DataEntry를 반환하여 호출자가 metric을 직접 추가할 수 있게 된 것입니다. 이전에는 addOp으로 scopeId를 얻고, 다시 addMetric으로 해당 scopeId를 조회해야 했는데, 이 과정에서 이중 잠금과 이중 해시 조회가 발생했습니다. 새 구조에서는 clearCache() 메서드도 제거되어 캐시 무효화의 복잡성도 사라졌습니다. MetricKernelLaunchConfig에 numThreads와 sharedMemBytes를 추가한 것은 metric 커널 실행의 유연성을 높입니다.

정리

Proton의 Data/Metric 인터페이스를 DataEntry 기반으로 재설계하여 이중 잠금/조회를 제거하고, MetricKernelLaunchState 구조체로 thread-local 상태를 통합하여 프로파일링 오버헤드를 줄였습니다.

참고 자료

triton-lang/triton#9132

이 분석은 AI가 실제 코드 diff를 기반으로 작성했습니다.

PR Analysis 의 다른글

이전글 [cpython] gh-124951: base64 인코딩/디코딩 2~3배 속도 향상 — CPU 파이프라이닝 최적화
현재글 : [triton] Proton의 Runtime과 Metric 상관관계 단순화로 오버헤드 감소
다음글 [PyTorch] FlexAttention에 저정밀도 K/V 입력 지원 추가