AI Infrastructure / Validation Project

Inference Runtime Regression Gate

A lightweight CI validation system for inference-style workloads. It benchmarks baseline and candidate runs, tracks p50/p95/p99 latency and throughput, and blocks meaningful p95 latency regressions before release.

Inference runtime regression gate overview

Key Features

  • Compares baseline and candidate inference-style benchmark runs
  • Tracks p50, p95, and p99 end-to-end latency
  • Measures throughput in requests per second
  • Runs in GitHub Actions for push, PR, and manual validation
  • Includes controlled failure detection for reproducible regression testing

Regression Policy

  • Compares candidate p95 latency against baseline p95 latency
  • Fails when candidate p95 regresses by more than 10 percent
  • Requires at least 1.0 ms absolute regression to avoid runner noise
  • Separates normal pass validation from controlled failure detection
  • Uses JSON configuration so thresholds can be tuned without changing code
Regression gate benchmark results

Latest GitHub Actions Results

  • Default gate passed with 0.290 ms absolute p95 change
  • Manual pass gate passed with 0.021 ms absolute p95 change
  • Controlled regression demo detected a 1.259 ms p95 regression
  • All workflows completed successfully with expected validation behavior
Regression Gate Output
=== Default Regression Gate ===

Workflow: push / pull request validation
Stage: end_to_end_ms
Metric: p95_ms

Baseline throughput:   233.24 req/s
Candidate throughput:  219.38 req/s

Baseline p50:          4.260 ms
Candidate p50:         4.537 ms

Baseline p95:          4.357 ms
Candidate p95:         4.648 ms
Change:                6.66%
Absolute change:       0.290 ms
Percent threshold:     10.00%
Minimum floor:         1.000 ms
Status:                PASS

=== Manual Pass Gate ===
Workflow: manual passing-path validation

Baseline p95:          4.362 ms
Candidate p95:         4.384 ms
Change:                0.49%
Absolute change:       0.021 ms
Percent threshold:     10.00%
Minimum floor:         1.000 ms
Status:                PASS

=== Controlled Regression Detection ===
Workflow: manual failure-path validation

Baseline throughput:   231.46 req/s
Candidate throughput:  157.55 req/s

Baseline p95:          5.257 ms
Candidate p95:         6.516 ms
Change:                23.95%
Absolute change:       1.259 ms
Percent threshold:     10.00%
Minimum floor:         1.000 ms
Checker status:        FAIL
Workflow result:       PASS

Controlled regression was detected as expected.

Highlights

  • Built baseline vs candidate benchmark flow for inference-style runtime validation
  • Measured p50, p95, p99 latency and throughput for each candidate run
  • Implemented configurable regression thresholds using percent and absolute latency floors
  • Added controlled regression detection to prove the gate catches meaningful p95 regressions
  • Integrated validation into GitHub Actions for push, PR, and manual workflow checks
  • Added smoke tests to validate benchmark result structure before trusting outputs
  • Documented passing and failure-path behavior with reproducible benchmark results

Tech

Python PyTorch GitHub Actions CI/CD JSON Config Latency Benchmarking
Performance icon

Performance Validation

  • Compare candidate runs against baseline runs
  • Track p95 latency as the primary regression signal
  • Use absolute floors to reduce CI runner noise
  • Block meaningful latency regressions before release
Learning icon

What I Learned

  • Performance gates need noise tolerance
  • Percent thresholds alone can be too sensitive
  • Controlled failures make CI behavior easier to verify
  • Functional correctness and performance validation solve different problems
Tech stack icon

Tech Stack

  • Python benchmark and comparison scripts
  • PyTorch synthetic inference-style workload
  • GitHub Actions workflow validation
  • JSON-driven regression threshold configuration