Benchmarks
Performance benchmarks comparing loon data structures against standard library alternatives.
| Component |
Best Latency |
Peak Throughput |
vs std:: |
| RingBuffer |
0.95 ns |
2.1G ops/s |
3.2x faster |
| SPSC Queue |
2.40 ns |
832M ops/s |
18.7x faster |
| LRU Cache |
7.8 ns |
130M ops/s |
O(1) ops |
| Redis List |
0.33 ns |
560M ops/s |
22x vs std::list |
RingBuffer vs std::queue
Round-Trip Latency (push + pop)
| Message Size |
RingBuffer |
std::queue |
Speedup |
| 16 bytes |
1.9 ns |
6.1 ns |
3.2x |
| 64 bytes |
2.3 ns |
7.4 ns |
3.2x |
| 256 bytes |
4.1 ns |
12.8 ns |
3.1x |
Throughput
| Message Size |
RingBuffer |
std::queue |
Speedup |
| 16 bytes |
2.1G ops/s |
656M ops/s |
3.2x |
| 64 bytes |
1.7G ops/s |
541M ops/s |
3.1x |
| 256 bytes |
976M ops/s |
312M ops/s |
3.1x |
SPSC Queue vs Mutex Queue
Lock-free SPSC queue compared against a mutex-protected std::queue. Cache-line padding prevents false sharing between producer and consumer threads.
Operation Latency
| Operation |
SpscQueue |
MutexQueue |
Speedup |
| Interleaved push/pop |
2.40 ns |
44.9 ns |
18.7x |
| Round-trip (16B) |
2.42 ns |
46.0 ns |
19.0x |
| Round-trip (64B) |
4.21 ns |
47.2 ns |
11.2x |
| Round-trip (256B) |
13.3 ns |
55.8 ns |
4.2x |
Throughput
| Message Size |
SpscQueue |
MutexQueue |
Speedup |
| 16 bytes |
7.34 GiB/s |
719 MiB/s |
10.5x |
| 64 bytes |
14.4 GiB/s |
2.74 GiB/s |
5.3x |
| 256 bytes |
24.7 GiB/s |
9.11 GiB/s |
2.7x |
Multi-threaded Producer/Consumer
Real-world scenario with separate producer and consumer threads:
| Items |
SpscQueue |
MutexQueue |
Speedup |
| 1,024 |
103M ops/s |
22.3M ops/s |
4.6x |
| 4,096 |
297M ops/s |
25.8M ops/s |
11.5x |
| 32,768 |
485M ops/s |
24.9M ops/s |
19.5x |
| 65,536 |
367M ops/s |
23.1M ops/s |
15.9x |
LRU Cache
O(1) cache operations with automatic eviction of least recently used items.
Operation Latency
| Operation |
Time |
Throughput |
get (hit) |
15.0 ns |
67M ops/s |
get (miss) |
10.6 ns |
95M ops/s |
put |
255 ns |
4M ops/s |
exists |
7.8 ns |
130M ops/s |
| Mixed (80% read, 20% write) |
68 ns |
15M ops/s |
Value Size Impact
| Value Size |
Time |
Throughput |
| 16 bytes |
13.6 ns |
1.1 GiB/s |
| 64 bytes |
14.2 ns |
4.2 GiB/s |
| 256 bytes |
16.3 ns |
14.7 GiB/s |
LRU Cache vs std::unordered_map
Comparison with raw hash map (no LRU eviction):
| Operation |
LRU Cache |
unordered_map |
Overhead |
get (hit) |
15.0 ns |
7.5 ns |
2x |
put |
255 ns |
49 ns |
5x |
exists |
7.8 ns |
7.5 ns |
~1x |
| Operation |
Time |
put (string key) |
448 ns |
get (string key) |
108 ns |
Redis List
Redis-compatible list with efficient O(1) operations at both ends.
Operation Latency
| Operation |
Time |
Throughput |
lpush / rpush |
4.4 ns |
520M ops/s |
lpop / rpop |
28 ns |
35M ops/s |
| Interleaved push/pop |
3.6 ns |
560M ops/s |
lrange (10 elements) |
92 ns |
109M ops/s |
llen |
0.33 ns |
3B ops/s |
Value Size Impact
| Value Size |
Time |
Throughput |
| 16 bytes |
3.8 ns |
7.8 GiB/s |
| 64 bytes |
7.3 ns |
16.3 GiB/s |
| 256 bytes |
22.6 ns |
21.1 GiB/s |
Redis List vs std::list
| Operation |
RedisList |
std::list |
Speedup |
| push (4096 items) |
8.2 µs |
564 µs |
69x |
| pop (4096 items) |
11.8 µs |
320 µs |
27x |
| Interleaved |
3.6 ns |
78.5 ns |
22x |
Batch Operations
| Batch Size |
lpop(n) |
Throughput |
| 1 |
394 µs |
10M ops/s |
| 10 |
44 µs |
93M ops/s |
| 100 |
12 µs |
337M ops/s |
| 1000 |
5.2 µs |
790M ops/s |
Running Benchmarks
# Build with benchmarks enabled
make bench-build
# Run all benchmarks
make bench
# Or run directly with options
./build/Release/bench/loon_benchmarks
# Filter specific benchmarks
./build/Release/bench/loon_benchmarks --benchmark_filter="Spsc"
# Output to JSON for analysis
./build/Release/bench/loon_benchmarks --benchmark_format=json > results.json
# Longer runs for more stable results
./build/Release/bench/loon_benchmarks --benchmark_min_time=2s
Test Environment
- CPU: Intel Core i7 @ 2.2 GHz (8 cores)
- Cache: L1 32 KiB, L2 256 KiB, L3 6 MiB shared
- Compiler: Apple Clang 14, C++23, -O3
- OS: macOS
Why RingBuffer is Faster
- Stack allocation: Uses
std::array with no heap allocations
- Contiguous memory: Cache-friendly access patterns
- No node overhead: Unlike
std::queue<T, std::deque<T>> which uses chunked storage
- Compile-time capacity: No runtime size checks or reallocation
Why SPSC Queue is Faster
- Lock-free: Uses atomic operations instead of mutex locks
- No contention: Designed for exactly one producer and one consumer
- Cache-line padding: Producer and consumer indices are on separate cache lines (
alignas(64)) to prevent false sharing
- Cached index optimization: Each thread caches the other thread's index locally, avoiding cross-cache-line reads on every operation
- Ever-increasing indices: No modulo on full/empty checks — pure subtraction
- Minimal synchronization: Only acquire/release memory ordering where needed
Why LRU Cache is Fast
- Hash map lookup: O(1) key access via
std::unordered_map
- Intrusive list: Recency updates move list nodes without allocation
- Single eviction: Only the LRU item is removed when full
- Reference semantics:
get returns a reference wrapper, avoiding copies
Why Redis List is Fast
- Deque backing: Uses
std::deque for O(1) operations at both ends
- Cache-friendly: Contiguous chunk storage vs linked list nodes
- No allocation per element: Deque allocates in chunks
- Move semantics: Supports efficient moves for large values
Understanding the Metrics
| Metric |
Description |
| Time |
Wall-clock time per iteration (latency) |
| CPU |
CPU time per iteration |
| Iterations |
Number of benchmark runs |
| items_per_second |
Throughput in operations/sec |
| bytes_per_second |
Data throughput (for sized messages) |