Systems / Performance Project
Lock-Free Ring Buffer Benchmark
A lock-free single-producer, single-consumer ring buffer in C++ with a comprehensive benchmark suite for analyzing contention strategies, queueing behavior, and latency tradeoffs.
Key Features
- Lock-free SPSC implementation using atomics
- Four wait strategies for contention analysis
- High-resolution latency metrics
- CPU affinity support for consistent scheduling
- Capacity sweep for backpressure effects
Architecture
- Fixed-size circular buffer backed by contiguous storage
- Separate producer and consumer indices for SPSC ownership
- Acquire and release memory ordering protects data visibility
- Cache-line alignment reduces false sharing between hot atomics
- Benchmark harness is separated from queue implementation logic
How It Works
- Producer writes an item into the current tail slot
- Producer release-stores the new tail so the consumer can see the write
- Consumer acquire-loads the tail before reading from the buffer
- Consumer reads from the current head slot and advances the head
- Failed pushes mean the buffer is full, while failed pops mean it is empty
Benchmark Output
=== Throughput: Wait Strategy Comparison === Benchmark msgs/s(M) fail_push fail_pop ----------------------------------------------------------------- Spin 16.41 19 1921676 Yield 27.38 204 32985 SpinThenYield 24.55 1499 551720 ExponentialBackoff 20.21 2645 327247 === Latency: Wait Strategy Comparison === Benchmark avg(ns) p50 p95 p99 max ------------------------------------------------------------------------------- Spin 93846.92 92441.60 106425.20 115458.20 134788.60 Yield 78351.52 77100.00 88166.40 96366.60 139583.20 SpinThenYield 92853.83 92175.00 100424.80 110566.40 128641.80 ExponentialBackoff 98410.68 95683.60 120675.00 131950.00 174291.60 === Throughput: Capacity Sweep === Capacity msgs/s(M) fail_push fail_pop ----------------------------------------------------------------- 64 16.54 24904 1261885 256 17.12 18212 813192 1024 22.49 1261 694032 8192 27.24 3653 207194 === Latency: Capacity Sweep === Capacity avg(ns) p50 p95 p99 max ------------------------------------------------------------------------------- 64 6313.10 5966.80 7533.20 8774.80 75908.40 256 23229.12 25991.20 25991.80 36441.60 40866.60 1024 92822.79 92116.80 99266.60 116816.80 148625.00 8192 765703.84 741950.00 934133.40 1067608.60 1311267.00
Highlights
- Lock-free SPSC queue built with C++20 atomics
- Benchmarks compare Spin, Yield, SpinThenYield, and ExponentialBackoff strategies
- Measures throughput, failed push/pop attempts, and tail latency percentiles
- Uses CPU affinity to reduce scheduling noise during benchmark runs
- Explores how queue capacity affects throughput and tail latency
- Demonstrates acquire/release synchronization in a real producer-consumer system
- Separates benchmark logic, wait strategies, formatting, and queue implementation
Tech
C++20
Atomics
Threads
CMake
Linux
Links
Performance Analysis
- Compare wait strategies under contention
- Measure throughput and tail latency
- Understand CPU scheduling impact
- Identify tradeoffs between latency and throughput
What I Learned
- Yielding improves throughput under contention
- Larger buffers increase latency due to queueing
- Backoff strategies are not always beneficial
- Thread affinity exposes true system behavior
Tech Stack
- C++20 (atomics, threads, chrono)
- CMake
- Linux (sched_setaffinity)
- Custom benchmark harness