AI Infrastructure / Validation Project
Inference Runtime Regression Gate
A lightweight CI validation system for inference-style workloads. It benchmarks baseline and candidate runs, tracks p50/p95/p99 latency and throughput, and blocks meaningful p95 latency regressions before release.
Key Features
- Compares baseline and candidate inference-style benchmark runs
- Tracks p50, p95, and p99 end-to-end latency
- Measures throughput in requests per second
- Runs in GitHub Actions for push, PR, and manual validation
- Includes controlled failure detection for reproducible regression testing
Regression Policy
- Compares candidate p95 latency against baseline p95 latency
- Fails when candidate p95 regresses by more than 10 percent
- Requires at least 1.0 ms absolute regression to avoid runner noise
- Separates normal pass validation from controlled failure detection
- Uses JSON configuration so thresholds can be tuned without changing code
Latest GitHub Actions Results
- Default gate passed with 0.290 ms absolute p95 change
- Manual pass gate passed with 0.021 ms absolute p95 change
- Controlled regression demo detected a 1.259 ms p95 regression
- All workflows completed successfully with expected validation behavior
Regression Gate Output
=== Default Regression Gate === Workflow: push / pull request validation Stage: end_to_end_ms Metric: p95_ms Baseline throughput: 233.24 req/s Candidate throughput: 219.38 req/s Baseline p50: 4.260 ms Candidate p50: 4.537 ms Baseline p95: 4.357 ms Candidate p95: 4.648 ms Change: 6.66% Absolute change: 0.290 ms Percent threshold: 10.00% Minimum floor: 1.000 ms Status: PASS === Manual Pass Gate === Workflow: manual passing-path validation Baseline p95: 4.362 ms Candidate p95: 4.384 ms Change: 0.49% Absolute change: 0.021 ms Percent threshold: 10.00% Minimum floor: 1.000 ms Status: PASS === Controlled Regression Detection === Workflow: manual failure-path validation Baseline throughput: 231.46 req/s Candidate throughput: 157.55 req/s Baseline p95: 5.257 ms Candidate p95: 6.516 ms Change: 23.95% Absolute change: 1.259 ms Percent threshold: 10.00% Minimum floor: 1.000 ms Checker status: FAIL Workflow result: PASS Controlled regression was detected as expected.
Highlights
- Built baseline vs candidate benchmark flow for inference-style runtime validation
- Measured p50, p95, p99 latency and throughput for each candidate run
- Implemented configurable regression thresholds using percent and absolute latency floors
- Added controlled regression detection to prove the gate catches meaningful p95 regressions
- Integrated validation into GitHub Actions for push, PR, and manual workflow checks
- Added smoke tests to validate benchmark result structure before trusting outputs
- Documented passing and failure-path behavior with reproducible benchmark results
Tech
Python
PyTorch
GitHub Actions
CI/CD
JSON Config
Latency Benchmarking
Performance Validation
- Compare candidate runs against baseline runs
- Track p95 latency as the primary regression signal
- Use absolute floors to reduce CI runner noise
- Block meaningful latency regressions before release
What I Learned
- Performance gates need noise tolerance
- Percent thresholds alone can be too sensitive
- Controlled failures make CI behavior easier to verify
- Functional correctness and performance validation solve different problems
Tech Stack
- Python benchmark and comparison scripts
- PyTorch synthetic inference-style workload
- GitHub Actions workflow validation
- JSON-driven regression threshold configuration