Picollo is a new library for serious performance work in .NET. Picollo stands for Performance Instrumentation and Continuous Observation for Low-Level Optimization. It allows you to measure code performance with cycle-level precision, moving the time resolution to the picosecond level.

The initial public release of Picollo v0.1.0 contains two components. First, a modern version of an HDR histogram, which is much faster for recording data and easier to use. It has a thread-local implementation that allows scaling metrics collection across many threads with minimal impact on performance of the existing code. Second, a set of APIs for `perf_event_open`, including fast-path reads, that give raw access to PMU counters from .NET on Linux, WSL included.

HDR Histogram

An HDR histogram is a very useful tool to analyze not only networking latency, but any measurements that behave like a floor value plus some random noise. It is a very good fit for measuring code execution time and CPU counters.

How it works

The main purpose of a histogram is to compress the storage size required for observation counters. This is true for a standard and an HDR histogram. Compression comes at the cost of precision loss. The standard histogram usually keeps a bounded absolute precision, as defined by a bin width. The HDR histogram maintains the relative precision and can record a very large range of values with modest storage requirements.

There has already been, for many years, a standard/legacy implementation that works well in many languages. You can find the details on the website. Here I focus on my implementation, which is similar in the core idea but different in important details, and is much faster and easier to use, and also much easier to understand the implementation.

Here is the full implementation of the core algorithm (GitHub):

 1internal readonly struct HdrBuckets
 2{
 3    public readonly int BlockSize;
 4    public readonly int BlockScale;
 5    public readonly int BlockCount;
 6
 7    public double RelativeError => 0.5 / BlockSize;
 8
 9    public HdrBuckets(double relativeError = 0.001)
10    {
11        if (relativeError <= 0)
12            relativeError = 0.001;
13        else if (relativeError < 0.000_001)
14            relativeError = 0.000_001;
15        else if (relativeError > 0.1)
16            relativeError = 0.1;
17
18        BlockSize = (int)BitOperations.RoundUpToPowerOf2((uint)(0.5 / relativeError));
19        BlockScale = BitOperations.TrailingZeroCount(BlockSize);
20        BlockCount = 1 + 64 - BlockScale;
21    }
22
23    [MethodImpl(MethodImplOptions.AggressiveInlining)]
24    public nuint GetLogicalIndexForValue(ulong value)
25    {
26        int blockIndex = 64 - BitOperations.LeadingZeroCount(value >> BlockScale);
27        int stepScale = blockIndex - (blockIndex != 0 ? 1 : 0); // No branches, Roslyn recognizes it's just the result of !=
28        ulong bucketIndexInBlock = (value >> stepScale) & ((1u << BlockScale) - 1);
29        var index = (((nuint)(uint)blockIndex << BlockScale) + (nuint)bucketIndexInBlock);
30        return index;
31    }
32}

It takes less than 10 important lines of code. Let’s review them.

Relative error or precision

It all starts with defining the maximum precision loss or relative error you can tolerate. Values between 1% and 0.1% are the most common. The code does some opinionated clamping of the provided value to keep it in the range of 0.0001% and 10%. We will see later that this parameter is the most important for the total counter storage size, so going below 0.1% only makes sense if your measurements are fundamentally that precise and you care about preserving that.

The relative error defines the block size. The value space is divided into blocks, each block has a fixed number of buckets that span over an increasing range of values, which I call a step. The block size must be a power of 2 for efficient indexing. But why do we have 0.5 as the numerator?

The first block is somewhat special. The buckets start at value zero with step 1. Imagine that we have the first block as [0, 1, 2, 3]. We are going to double the step and the block range later, but it’s hard to double a zero value, and the math does not work well without the second bucket that also has step 1: [4, 5, 6, 7]. Now we have 8 precise values without any error. The next bucket starts to quantize the values, it looks like: [8, 10, 12, 14], with the step value of 2. So, for example, a value 11 belongs to a bucket [10,12). This notation means 10 inclusive and 12 exclusive. Despite being integers, the values such as time are infinitely divisible, so one can think of integers as a computer artifact of rounding a real value to an integer, so 11.99 turns into 11.

The maximum relative error we get when using these 3 buckets to record is: relative_error = 1/8, or 12.5%. Here 1 is the step/2 and 8 is BlockSize*2. The first two blocks have no error, the third block starts to have some errors. The step is divided by 2 to get a midpoint, any value that falls into a bucket is within step/2 distance from the midpoint, so we can use the midpoint as the equivalent value for a bucket. If we have a value 8 from an observation, we record it into the bucket [8,10) and then treat it as value 9.

If we rewrite the formula for the first bucket in the third block as: relative_error = 1/(BlockSize*2), we get relative_error = 0.5/BlockSize. Then, after rearranging, we get BlockSize = 0.5/relative_error. The relative error defines the block size and the histogram layout remains fixed for the given precision.

Note the error is the highest for the first bucket in a block. For the bucket [14,16), the maximum error is 1/14, or 7.1%. If we add the next block [16, 20, 24, 28], with a step value of 4, the first bucket there has the error of 2/16 = 1/8, and the last bucket has the error of 2/28 = 1/14. These are exactly the same relative errors as we have seen for the previous bucket. The relative value remains bounded, while the absolute error doubles with each block.

Bucket indexing

To calculate the linear bucket index in the histogram, we precompute the BlockScale, which is the exponent to which 2 must be raised to get the block size: BlockSize = 2^BlockScale.

To calculate the block, the value is first scaled down by the block size. Since BlockSize == 1 << BlockScale, shifting by BlockScale is integer division by BlockSize. However, the BlockSize is not a compile-time constant, so integer division is very expensive, while the shift operation is among the fastest ones. Then LeadingZeroCount gives the bit width of the scaled value. That bit width is the logarithmic block number. If the scaled value is zero, the result is zero, so values below BlockSize stay in block 0.

1int blockIndex = 64 - BitOperations.LeadingZeroCount(value >> BlockScale);

To calculate the bucket resolution for this block, we need blockIndex - 1 for all logarithmic blocks, but not for block 0. Without the correction, block 0 would get stepScale = -1, which is invalid. So the expression subtracts 1 only when blockIndex != 0. There is no branch instruction in this code and not even a cmov instruction. At the IL level, the blockIndex != 0 is the instruction cgt.un, which pushes either 1 or 0 onto the evaluation stack. The Roslyn compiler, not even the JIT, recognizes that the ternary operator returns the value that is already pushed onto the stack, so there is nothing else to do. The result of the comparison is directly used by the subsequent sub instruction.

1int stepScale = blockIndex - (blockIndex != 0 ? 1 : 0);
2
3    IL_0013: dup
4    IL_0014: dup
5    IL_0015: ldc.i4.0
6    IL_0016: cgt.un
7    IL_0018: sub
8    IL_0019: stloc.0      // stepScale

To calculate the bucket index inside the selected block, the original value is shifted right by stepScale. This applies the bucket resolution for that block: blocks 0 and 1 keep unit resolution, later blocks discard progressively more low bits. The mask BlockSize - 1 then keeps only the low BlockScale bits, which are the position inside the block.

1ulong bucketIndexInBlock = (value >> stepScale) & ((1u << BlockScale) - 1);

Now we have the block index and the index inside the block. The bucket index is just blockIndex*BlockSize + bucketIndexInBlock, with a shift by BlockScale used instead of multiplication by BlockSize.

1var index = (((nuint)(uint)blockIndex << BlockScale) + (nuint)bucketIndexInBlock);

The return type of nuint helps to improve codegen and avoid some wasteful instructions later.

Backing storage

You may have been asking: if we care about reducing the backing storage size, why the <10 LoC implementation above supports the entire [0, UInt64.MaxValue] range. That is enough to measure 213 days in picoseconds or 143 years in 4GHz cycles!

The HdrBuckets implementation is responsible for virtual/logical layout that just exists for a given precision and is defined by the formulas above. If we know the range of observation values, we can compute the logical index of the minimum trackable value and store it as the first index offset. Then we can compute the logical index of the maximum trackable value, subtract the index of the minimum, add 1 to get the required storage length and allocate the storage. To get the storage index, we only need to adjust the logical index by the stored first index offset.

Note that the min/max trackable values are precise only to the bucket resolution: it’s a guaranteed range that can be slightly larger, not an exact one. If the exact range is required for application correctness, it should be enforced before recording a value. The purpose of the min/max trackable range is to reduce the storage space.

Instead of throwing on out-of-bounds access, there is a dedicated overflow counter in OverflowSlot that behaves like a normal counter. Throwing by default in this situation is not great. I have had many unpleasant moments with the legacy implementation where min/max assumptions differed between environments (Dev/Prod). Nobody would care about that until the next deployment, but an exception on a hot path just blocks everything. If one insists on errors, it’s trivial to check the overflow counter after recording a value and throw if it is not zero.

An extension method may be added to the library later or a PR is welcome. Something like RecordInRange(ulong value, ulong min = 0, ulong max = ulong.MaxValue, bool throwOnOutOfRange = false) that uses the narrowest range from the histogram’s instance limits and the method parameters. I’m also evaluating if separating underflow and overflow is worth it and can be done with near-zero cost.

 1
 2internal HdrBuckets HdrBuckets;
 3internal UnsafeSpan<TCounter> Data;
 4internal TCounter OverflowSlot;
 5protected nuint _firstIndexOffset;
 6
 7ctor()
 8{
 9    _firstIndexOffset = HdrBuckets.GetLogicalIndexForValue(MinTrackableValue);
10
11    var lastLogicalIndex = (nuint)Math.Min((long)HdrBuckets.GetLogicalIndexForValue(MaxTrackableValue), 
12                                            HdrBuckets.BlockSize * HdrBuckets.BlockCount - 1);
13    var storageSize = (int)(lastLogicalIndex + 1 - _firstIndexOffset);
14    Data = new UnsafeSpan<TCounter>(new TCounter[storageSize]);
15}
16
17internal ref TCounter GetRef(ulong value)
18{
19    var storageIndex = HdrBuckets.GetLogicalIndexForValue(value) - _firstIndexOffset;
20    if (storageIndex >= (nuint)Data.LongCount)
21        return ref OverflowSlot;
22
23    return ref Data.GetAtUnsafe(storageIndex);
24}

The UnsafeSpan<TCounter> is a Span<T>-like structure that can be stored as a field and provides faster array indexing, even with bound checks. The struct, like the Span<T>, has the local length field that allows to skip touching the array.Length slot inside the array object, which is itself a load operation that can be expensive in the very hot path context. I believe that the bounds checks are expensive not because of the branches, but because of touching the length slot of an array. The performance of the safe indexer vs. unsafe access without bound checks in UnsafeSpan<TCounter> is basically identical, but using an array directly makes the code visibly slower.

The TCounter : unmanaged, IBinaryInteger<TCounter>, IUnsignedNumber<TCounter> is either uint or ulong and defines the storage type. It’s not related to the value range, but to the maximum counter value a single bucket can hold. It’s rare that ulong is required, but the implementation does not check for overflows. If a single bucket count can exceed 2 billion observations in theory, you should use ulong as the backing storage.

Visualization

A dynamic widget is worth a thousand words. Here the entire HDR histogram layout for the value space 0-255 is shown. You can click the link below to open the widget in a new tab and play with the query string parameters. The visualization remains precise for up to int53 values, and seems to work even for larger values.

Open in new tab (note query string parameters)

How to use

This section describes the public API and implementation details of Picollo HDR histogram

Summary

The entry point for a histogram is the factory Picollo.Metrics.HdrHistogram.Factory. From there, the usage is hopefully self-evident from the intellisense. After the histogram is created, the main method is Record(value) to increment a counter for the value, accompanied by Record(value, count) to add count to the counter. Another method that modifies the recorded data is Reset(), which clears all counters.

A very useful method is HdrHistogramSummary GetSummary(HdrHistogramSummary? reuseInstance = null). The returned summary object has a predefined set of percentile ranks and aggregates such as total/overflow count, mean and the standard deviation. The summary has methods PrettyPrint/PrettyPrintDiff, with the outputs shown below. Frankly, this alone makes it worth it for me to reimplement the histogram, as I’m so tired of reinventing custom compact output or scanning a wall of text to find a specific percentile.

 1public static void GettingStarted()
 2{
 3    var h =
 4        Picollo.Metrics.HdrHistogram.Factory
 5            .WithRelativeError(0.01)
 6            .WithUInt32Storage()
 7            .WithMinTrackableValue(10000)
 8            .WithMaxTrackableValue(30000)
 9            .Create();
10
11    var rng = new Random(0);
12
13    for (int i = 0; i < 1000_000; i++)
14        h.Record(20_000 + (ulong)((0.5 - rng.NextDouble()) * 1000 + Math.Pow(rng.NextDouble(), 2) * 5000));
15
16    h.Record(40000); // Outside the range
17
18    // A summary has standard percentiles for ranks: [0, 1, 5, 10, 25, 50, 75, 90, 92.5, 95, 97.5, 99, 99.9, 99.99, 99.999, 100]
19    // Plus total count, overflow count, mean, StDev, min/max trackable. 
20    // The percentiles such as summary.P99 have not only the value, but detailed info about the bucket where the percentile is found.
21    var summary = h.GetSummary();
22    Console.WriteLine($"P99: {summary.P99}");
23
24    h.Reset();
25
26    // Different distribution
27    for (int i = 0; i < 2000_000; i++)
28        h.Record(19_000 + (ulong)((0.5 - rng.NextDouble()) * 500 + Math.Pow(rng.NextDouble(), 3) * 10000));
29
30    // h.GetSummary(summary); // Can avoid summary alloc and reuse in place
31    var summary2 = h.GetSummary();
32    summary.PrettyPrint("Histogram Before");
33    summary.PrettyPrintDiff(summary2, "Getting Started Diff", "Before", "After");
34}

The output is below. It does produce valid markdown tables, but the console output is also nice and more compact than rendered tables. The ± column shows the absolute precision for the percentile value. The D-value row at the end is the effect size computed as Cohen’s D. I believe the other columns and rows do not require explanation.

 1P99: P99=24,960 [83 / 609]: [24,832, 25,088) 14,190
 2##### Histogram Before
 3| Percentile |     Value | ±      |     Count |
 4| :--------- | --------: | :----- | --------: |
 5| 0          |    20,096 | ±128   |         1 |
 6| 1          |    20,096 | ±128   |    10,000 |
 7| 5          |    20,096 | ±128   |    50,000 |
 8| 10         |    20,096 | ±128   |   100,000 |
 9| 25         |    20,352 | ±128   |   250,000 |
10| 50         |    21,376 | ±128   |   500,000 |
11| 75         |    22,912 | ±128   |   750,000 |
12| 90         |    23,936 | ±128   |   900,000 |
13| 92.5       |    24,192 | ±128   |   925,000 |
14| 95         |    24,448 | ±128   |   950,000 |
15| 97.5       |    24,704 | ±128   |   975,000 |
16| 99         |    24,960 | ±128   |   990,000 |
17| 99.9       |    25,472 | ±128   |   999,001 |
18| 99.99      |    25,472 | ±128   |   999,900 |
19| 99.999     |    25,472 | ±128   |   999,990 |
20| 100        |    25,472 | ±128   | 1,000,000 |
21| Overflow   |           |        |         1 |
22|            |           |        |           |
23| Mean:      | 21,696.54 | StDev: |  1,482.39 |
24| Precision: |   0.7813% | Total: | 1,000,000 |
25| Range Min: |    10,000 | Max:   |    30,000 |
26
27##### Getting Started Diff
28| Percentile |    Before |     After |      Δ% |
29| :--------- | --------: | --------: | ------: |
30| 0          |    20,096 |    19,072 |   -5.1% |
31| 1          |    20,096 |    19,072 |   -5.1% |
32| 5          |    20,096 |    19,072 |   -5.1% |
33| 10         |    20,096 |    19,072 |   -5.1% |
34| 25         |    20,352 |    19,328 |   -5.0% |
35| 50         |    21,376 |    20,352 |   -4.8% |
36| 75         |    22,912 |    23,168 |   +1.1% |
37| 90         |    23,936 |    26,240 |   +9.6% |
38| 92.5       |    24,192 |    27,008 |  +11.6% |
39| 95         |    24,448 |    27,520 |  +12.6% |
40| 97.5       |    24,704 |    28,288 |  +14.5% |
41| 99         |    24,960 |    28,800 |  +15.4% |
42| 99.9       |    25,472 |    29,056 |  +14.1% |
43| 99.99      |    25,472 |    29,312 |  +15.1% |
44| 99.999     |    25,472 |    29,312 |  +15.1% |
45| 100        |    25,472 |    29,312 |  +15.1% |
46|            |           |           |         |
47| Mean:      | 21,696.54 | 21,518.53 |   -0.8% |
48| StDev:     |  1,482.39 |  2,822.16 |  +90.4% |
49| Precision: |   0.7813% |   0.7813% |    0.0% |
50| Total:     | 1,000,000 | 2,000,000 | +100.0% |
51| D-value:   |           |           |   -0.07 |

Note that the rows with 92.5/97.5 may look ugly at the first glance as they stand out visually. However, they become very helpful to scan the rows just after several actual uses of the output. For me, they create anchors or visual clues saying: “Do not read me yet, you probably want the lines above, between and below”. I can focus on the Value column and still “feel” the rank of each row starting from P90 just from the shape of the first column I passively see with my side vision. These ranks are still useful for deeper understanding of the tail and are symmetrical to 25/75 for the last 10%.

Snapshots

The histogram supports snapshots, that can be updated in-place and do not allocate in steady-state operations. This feature is useful to get a self-consistent state of a live histogram from a monitoring thread over some time interval.

The code below produces exactly the same result as above, but with a snapshot, without resetting the histogram. Avoiding the reset may be useful to monitor the metrics live and still have the totals by the end of some logical cycle, e.g. a day.

 1public static void GettingStartedSnapshot()
 2{
 3    var h = Picollo.Metrics.HdrHistogram.Factory
 4        .WithRelativeError(0.01).WithUInt32Storage().WithMinTrackableValue(10000)
 5        .WithMaxTrackableValue(30000).Create();
 6
 7    var rng = new Random(0);
 8
 9    for (int i = 0; i < 1000_000; i++)
10        h.Record(20_000 + (ulong)((0.5 - rng.NextDouble()) * 1000 + Math.Pow(rng.NextDouble(), 2) * 5000));
11
12    h.Record(40000); // Outside the range
13
14    var snapshot = h.GetSnapshot();
15    var summary = snapshot.GetSummary();
16    Console.WriteLine($"P99: {summary.P99}");
17
18    // No reset: h.Reset();
19
20    for (int i = 0; i < 2000_000; i++)
21        h.Record(19_000 + (ulong)((0.5 - rng.NextDouble()) * 500 + Math.Pow(rng.NextDouble(), 3) * 10000));
22
23    snapshot.Update(deltas: true); // The snapshot contains the deltas since the previous snapshot
24    var summary2 = snapshot.GetSummary();
25    
26    summary.PrettyPrint("Histogram Before");
27    summary.PrettyPrintDiff(summary2, "Getting Started Diff", "Before", "After");
28}

The next snippet shows a more realistic example of a monitoring thread printing the summary of a live histogram every second.

 1public static void GettingStartedMonitoring(int seconds = 10)
 2{
 3    var h = Picollo.Metrics.HdrHistogram.Factory
 4        .WithRelativeError(0.01).WithUInt64Storage().WithMinTrackableValue(10000)
 5        .WithMaxTrackableValue(30000).Create();
 6
 7    var cts = new CancellationTokenSource(TimeSpan.FromSeconds(seconds));
 8    var rng = new Random(0);
 9
10    var writerTask = Task.Run(async () =>
11    {
12        while (!cts.IsCancellationRequested)
13        {
14            var drift = (ulong)DateTime.UtcNow.TimeOfDay.TotalMilliseconds & 16383;
15            h.Record(20_000 + (ulong)((0.5 - rng.NextDouble()) * 1000 + Math.Pow(rng.NextDouble(), 2) * drift));
16        }
17    });
18
19    var monitoringTask = Task.Run(async () =>
20    {
21        var snapshot = h.GetSnapshot();
22        var summary = snapshot.GetSummary();
23
24        while (!cts.IsCancellationRequested)
25        {
26            await Task.Delay(1000);
27
28            // The snapshot will contain the difference from the previous update, no allocation in steady-state
29            snapshot.Update(deltas: true);
30
31            snapshot.GetSummary(summary); // Reuse the summary object
32
33            // Allocates a string, convenient for exploratory analysis, but for Prod summary should be used directly with existing zero-alloc logging
34            var oneLineSummary = summary.ToString(nameof(GettingStartedMonitoring));
35            Console.WriteLine($"{summary.TimestampUtc:HH:mm:ss.fff} | {oneLineSummary}");
36        }
37    });
38
39    Task.WaitAll(writerTask, monitoringTask);
40    h.GetSummary().PrettyPrint("Aggregate");
41}

The last snippet produces the following output. Note that the data has a drift component that intentionally depends on time, so the distribution changes and produces overflows in this example.

 114:06:02.381 | GettingStartedMonitoring: Total=10,195,966, Overflow=0, Mean=21,443.7, P0=20,096, P25=20,352, P50=21,120, P90=23,424, P95=23,936, P99=24,448, P999=24,960, P100=25,216
 214:06:03.428 | GettingStartedMonitoring: Total=10,774,508, Overflow=0, Mean=21,785.5, P0=20,096, P25=20,352, P50=21,376, P90=24,192, P95=24,704, P99=25,472, P999=25,984, P100=26,240
 314:06:04.437 | GettingStartedMonitoring: Total=10,334,588, Overflow=0, Mean=22,128.2, P0=20,096, P25=20,352, P50=21,632, P90=25,216, P95=25,728, P99=26,496, P999=26,752, P100=27,264
 414:06:05.437 | GettingStartedMonitoring: Total=10,201,512, Overflow=0, Mean=22,461.9, P0=20,096, P25=20,608, P50=21,888, P90=25,984, P95=26,496, P99=27,264, P999=27,776, P100=28,288
 514:06:06.439 | GettingStartedMonitoring: Total=10,365,553, Overflow=0, Mean=22,793.9, P0=20,096, P25=20,608, P50=22,144, P90=26,752, P95=27,520, P99=28,288, P999=28,800, P100=29,312
 614:06:07.450 | GettingStartedMonitoring: Total=10,303,619, Overflow=131, Mean=23,128.4, P0=20,096, P25=20,608, P50=22,400, P90=27,520, P95=28,288, P99=29,312, P999=29,824, P100=30,080
 714:06:08.465 | GettingStartedMonitoring: Total=10,217,114, Overflow=115,673, Mean=23,383.5, P0=20,096, P25=20,608, P50=22,656, P90=28,288, P95=29,056, P99=29,824, P999=30,080, P100=30,080
 814:06:09.478 | GettingStartedMonitoring: Total=9,729,217, Overflow=524,463, Mean=23,422.9, P0=20,096, P25=20,608, P50=22,656, P90=28,288, P95=29,312, P99=30,080, P999=30,080, P100=30,080
 914:06:10.479 | GettingStartedMonitoring: Total=9,187,004, Overflow=914,542, Mean=23,421.9, P0=20,096, P25=20,608, P50=22,656, P90=28,288, P95=29,312, P99=30,080, P999=30,080, P100=30,080
1014:06:11.486 | GettingStartedMonitoring: Total=7,628,055, Overflow=1,073,055, Mean=23,423.0, P0=20,096, P25=20,608, P50=22,656, P90=28,288, P95=29,312, P99=30,080, P999=30,080, P100=30,080
11##### Aggregate
12| Percentile |     Value | ±      |      Count |
13| :--------- | --------: | :----- | ---------: |
14| 0          |    20,096 | ±128   |          1 |
15| 1          |    20,096 | ±128   |    989,974 |
16| 5          |    20,096 | ±128   |  4,949,866 |
17| 10         |    20,096 | ±128   |  9,899,731 |
18| 25         |    20,608 | ±128   | 24,749,326 |
19| 50         |    21,888 | ±128   | 49,498,651 |
20| 75         |    24,192 | ±128   | 74,247,977 |
21| 90         |    26,752 | ±128   | 89,097,572 |
22| 92.5       |    27,264 | ±128   | 91,572,505 |
23| 95         |    28,032 | ±128   | 94,047,437 |
24| 97.5       |    29,056 | ±128   | 96,522,370 |
25| 99         |    29,568 | ±128   | 98,007,329 |
26| 99.9       |    30,080 | ±128   | 98,898,305 |
27| 99.99      |    30,080 | ±128   | 98,987,403 |
28| 99.999     |    30,080 | ±128   | 98,996,313 |
29| 100        |    30,080 | ±128   | 98,997,302 |
30| Overflow   |           |        |  2,627,864 |
31|            |           |        |            |
32| Mean:      | 22,704.70 | StDev: |   2,570.16 |
33| Precision: |   0.7813% | Total: | 98,997,302 |
34| Range Min: |    10,000 | Max:   |     30,000 |

Percentiles

The full raw percentiles and bucket information are available via histogram properties that return enumerables: Buckets/BucketsWithValues/BucketPercentiles/BucketPercentilesWithValues.

 1public static void GettingStartedPercentiles()
 2{
 3    var h = Picollo.Metrics.HdrHistogram.Factory
 4        .WithRelativeError(0.01).WithUInt32Storage().WithMinTrackableValue(10000)
 5        .WithMaxTrackableValue(30000).Create();
 6
 7    var rng = new Random(0);
 8
 9    for (int i = 0; i < 1000_000; i++)
10        h.Record(20_000 + (ulong)((0.5 - rng.NextDouble()) * 1000 + Math.Pow(rng.NextDouble(), 2) * 5000));
11
12    // Enumerables to get detailed bucket info
13    // h.Buckets
14    // h.BucketsWithValues
15    // h.BucketPercentiles
16    // h.BucketPercentilesWithValues
17    
18    foreach (var percentile in h.BucketPercentilesWithValues)
19    {
20        Console.WriteLine(percentile);
21    }
22}

With the output:

 1P9.1914=20,096 [64 / 590]: [19,968, 20,224) 183,827
 2P23.6734=20,352 [65 / 591]: [20,224, 20,480) 105,814
 3P33.271=20,608 [66 / 592]: [20,480, 20,736) 86,138
 4P40.836=20,864 [67 / 593]: [20,736, 20,992) 65,161
 5P46.8661=21,120 [68 / 594]: [20,992, 21,248) 55,441
 6P52.1414=21,376 [69 / 595]: [21,248, 21,504) 50,065
 7P56.9318=21,632 [70 / 596]: [21,504, 21,760) 45,744
 8P61.3189=21,888 [71 / 597]: [21,760, 22,016) 41,997
 9P65.393=22,144 [72 / 598]: [22,016, 22,272) 39,486
10P69.2252=22,400 [73 / 599]: [22,272, 22,528) 37,157
11P72.8369=22,656 [74 / 600]: [22,528, 22,784) 35,077
12P76.2556=22,912 [75 / 601]: [22,784, 23,040) 33,297
13P79.5267=23,168 [76 / 602]: [23,040, 23,296) 32,125
14P82.6842=23,424 [77 / 603]: [23,296, 23,552) 31,025
15P85.7174=23,680 [78 / 604]: [23,552, 23,808) 29,639
16P88.6413=23,936 [79 / 605]: [23,808, 24,064) 28,840
17P91.4852=24,192 [80 / 606]: [24,064, 24,320) 28,038
18P94.2448=24,448 [81 / 607]: [24,320, 24,576) 27,153
19P96.6624=24,704 [82 / 608]: [24,576, 24,832) 21,200
20P98.4319=24,960 [83 / 609]: [24,832, 25,088) 14,190
21P99.5121=25,216 [84 / 610]: [25,088, 25,344) 7,413
22P99.9414=25,472 [85 / 611]: [25,344, 25,600) 1,173

To get your own set of percentiles, there are batch methods that compute multiple percentiles for given ranks and return internally consistent values. There are scalar methods that internally just call the batch methods, so they are not faster. The scalar methods may return values for different versions of the data if the histogram is actively updated.

 1public static void GettingStartedCustomPercentiles()
 2{
 3    var h = Picollo.Metrics.HdrHistogram.Factory
 4        .WithRelativeError(0.01).WithUInt32Storage().WithMinTrackableValue(10000)
 5        .WithMaxTrackableValue(30000).Create();
 6
 7    var rng = new Random(0);
 8
 9    for (int i = 0; i < 1000_000; i++)
10        h.Record(20_000 + (ulong)((0.5 - rng.NextDouble()) * 1000 + Math.Pow(rng.NextDouble(), 2) * 5000));
11
12    ReadOnlySpan<double> ranks = [90.0, 91, 92, 93, 94, 95, 96, 97, 98];
13    var percentiles = new Percentile[ranks.Length];
14    var percentileValues = new ulong[ranks.Length];
15    
16    // Batch
17    h.GetPercentiles(ranks, percentiles);
18    h.GetPercentileValues(ranks, percentileValues);
19
20    foreach (var percentile in percentiles)
21    {
22        Console.WriteLine(percentile);
23    }
24
25    // Scalar: each invocation calls the batch method with a single rank
26    foreach (var rank in ranks)
27    {
28        var percentile = h.GetPercentile(rank);
29        var percentileValue = h.GetPercentileValue(rank);
30        Console.WriteLine($"{rank:N0}: {percentileValue:N0}");
31    }
32}

And the output:

 1P90=23,936 [79 / 605]: [23,808, 24,064) 28,840
 2P91=24,192 [80 / 606]: [24,064, 24,320) 28,038
 3P92=24,192 [80 / 606]: [24,064, 24,320) 28,038
 4P93=24,448 [81 / 607]: [24,320, 24,576) 27,153
 5P94=24,448 [81 / 607]: [24,320, 24,576) 27,153
 6P95=24,448 [81 / 607]: [24,320, 24,576) 27,153
 7P96=24,704 [82 / 608]: [24,576, 24,832) 21,200
 8P97=24,704 [82 / 608]: [24,576, 24,832) 21,200
 9P98=24,960 [83 / 609]: [24,832, 25,088) 14,190
1090: 23,936
1191: 24,192
1292: 24,192
1393: 24,448
1494: 24,448
1595: 24,448

Units and scopes

The HDR histogram in Picollo does not deal with units, such as stopwatch ticks or milli-/micro-/nano-/pico-seconds. Trying to add units makes the API confusing and prone to errors. The value is always a 64 bit integer. Users are responsible for selecting the appropriate units.

However, to facilitate the common use case, there are 4 extension methods in Picollo that leverage the using pattern and record ticks/nanos/micros/millis over the lifetime of the scope:

 1public static void GettingStartedScopes()
 2{
 3    var h = Picollo.Metrics.HdrHistogram.Factory
 4        .WithMaxTrackableValue(NanoScope.OneSecondValue * 2).Create();
 5
 6    using (h.GetTickScope())
 7    {
 8        Thread.Sleep(1000);
 9    }
10
11    using (h.GetNanoScope())
12    {
13        Thread.Sleep(1000);
14    }
15
16    using (h.GetMicroScope())
17    {
18        Thread.Sleep(1000);
19    }
20
21    using (h.GetMilliScope())
22    {
23        Thread.Sleep(1000);
24    }
25
26    h.GetSummary().PrettyPrint();
27}

The histogram summary for this code is:

 1##### Histogram summary
 2| Percentile |          Value | ±        |          Count |
 3| :--------- | -------------: | :------- | -------------: |
 4| 0          |          1,004 | ±0       |              1 | Millis
 5| 1          |          1,004 | ±0       |              1 |
 6| 5          |          1,004 | ±0       |              1 |
 7| 10         |          1,004 | ±0       |              1 |
 8| 25         |          1,004 | ±0       |              1 |
 9| 50         |      1,007,104 | ±512     |              2 | Micros
10| 75         |     10,051,584 | ±8,192   |              3 | Ticks
11| 90         |  1,010,302,976 | ±524,288 |              4 | Nanos
12| 92.5       |  1,010,302,976 | ±524,288 |              4 |
13| 95         |  1,010,302,976 | ±524,288 |              4 |
14| 97.5       |  1,010,302,976 | ±524,288 |              4 |
15| 99         |  1,010,302,976 | ±524,288 |              4 |
16| 99.9       |  1,010,302,976 | ±524,288 |              4 |
17| 99.99      |  1,010,302,976 | ±524,288 |              4 |
18| 99.999     |  1,010,302,976 | ±524,288 |              4 |
19| 100        |  1,010,302,976 | ±524,288 |              4 |
20|            |                |          |                |
21| Mean:      | 255,340,667.00 | StDev:   | 435,895,264.66 |
22| Precision: |        0.0977% | Total:   |              4 |

Under the hood, all scopes use Stopwatch.GetTimestamp and Stopwatch.Frequency to convert stopwatch ticks to time units. The stopwatch ticks correspond to clock_gettime(monotonic) on Linux. The GetTimestamp method gives the most precise time measurements when using the standard API.

In the summary above, you can see that my main daily-driver modest machine with i5-12500 has ~10M ticks per second, or 100 ns per tick. This surprisingly matches the DateTime/TimeSpan resolution, one can just simply use TimeSpan API of Stopwatch, right!?

But the 10MHz frequency is a ridiculously low number when modern CPUs, as modest as mine, are running at 4+ GHz. The 10M comes not from the hardware limit, but from Windows. On that OS, .NET Stopwatch uses QueryPerformanceFrequency/QueryPerformanceCounter native methods. Microsoft explicitly says QPC frequency is fixed at boot, but also says QPC frequency often does not equal the actual hardware tick generator frequency; under hypervisor/reference-time paths, or newer Windows configurations, it may be fixed to 10 MHz. The method docs link to an article “Acquiring high-resolution time stamps”, that is probably dated by decades if they still call 10MHz a high resolution.

Let’s see if it is just a snarky emotional comment from my side or a deserved critique. Let’s measure the actual time resolution with this code:

 1public static void GettingStartedClockResolution()
 2{
 3    var h = Picollo.Metrics.HdrHistogram.Factory.Create();
 4
 5    for (long last = Stopwatch.GetTimestamp(), n = 0; n < 1000_000;)
 6    {
 7        long now;
 8        while ((now = Stopwatch.GetTimestamp()) == last) ;
 9        h.Record((ulong)(now - last));
10        last = now;
11        n++;
12    }
13
14    var summary = h.GetSummary();
15    Console.WriteLine($"Stopwatch.Frequency: {Stopwatch.Frequency:N0}");
16    Console.WriteLine(summary.P0);
17    Console.WriteLine(summary.P50);
18    Console.WriteLine(summary.P99);
19    var meanMeasurableInNanos = summary.P50.Value * 1e9 / Stopwatch.Frequency;
20    Console.WriteLine($"P50 measurable nanos: {meanMeasurableInNanos}");
21}

On my Windows machine, the Frequency at least does not lie and I can observe single-tick changes:

1Stopwatch.Frequency: 10,000,000
2P0=1 [1 / 1]: [1, 2) 999,627
3P50=1 [1 / 1]: [1, 2) 999,627
4P99=1 [1 / 1]: [1, 2) 999,627
5P50 measurable nanos: 100

If I run the same code on WSL on the same machine, you can see that I can never observe a single tick change, even at P0. But the advertised frequency is 1GHz, while the measurable time increment is 32 ns. This is 3 times better than the Windows host!

1Stopwatch.Frequency: 1,000,000,000
2P0=28 [28 / 28]: [28, 29) 9
3P50=32 [32 / 32]: [32, 33) 554,723
4P99=44 [44 / 44]: [44, 45) 2,422
5P50 measurable nanos: 32

That measurement was done with turbo-boost off, like any other in this post. With turbo-boost on, it is just 22ns or almost 5 times better. This difference indicates that the time resolution is limited by measurement overheads, such as PInvoking the clock_gettime function.

1Stopwatch.Frequency: 1,000,000,000
2P0=19 [19 / 19]: [19, 20) 106
3P50=22 [22 / 22]: [22, 23) 542,157
4P99=30 [30 / 30]: [30, 31) 2,957
5P50 measurable nanos: 22

Other than reminding that Linux is better for serious work even in WSL, this highlights an important fact that .NET’s TimeSpan can easily lose precision when using Stopwatch.GetElapsedTime method or Stopwatch.Elapsed property.

That is why it is better to avoid unit conversion and record the most precise raw data available. If you want to record nanos, do not use stopwatch.Elapsed.TotalNanoseconds, which is rounded to 100 ns. Instead, use at least stopwatch.ElapsedTicks * 1_000_000_000 / Stopwatch.Frequency manual conversion. Or even better, use Picollo’s NanoScope with a histogram as using(h.GetNanoScope()) {...} or a standalone NanoScope.TicksToNanos() static helper or the extension property provided by Picollo.

1extension(Stopwatch stopwatch)
2{
3    public ulong ElapsedNanos => NanoScope.TicksToNanos(stopwatch.ElapsedTicks);
4}

Consistency and concurrency

There are two significantly different use cases for the histogram. One is exploratory quick-and-dirty analysis to feel the data and optimize code. The other is monitoring of production systems. In both cases, multi-threaded access to the histogram is normal. The implementation in Picollo should generally work well for the single-writer case without any additional configuration. For multiple writers, there are interlocked and thread-local implementations of the histogram.

For production monitoring, the thread-local version is the way to go. It’s much faster than the interlocked one, at the cost of allocating the backing storage per thread and as accumulator storage. By faster, I really do mean significantly faster. Interlocked operations may look fine in single-threaded or uncontended benchmarks, but when operations start to collide they wreak havoc and may impact the surrounding code beyond a small noise level.

For exploratory analysis, you probably should not care, unless you want to use Reset with a single writer and monitoring, or you have multiple writers.

Reset old+1 problem

The Reset method on a histogram clears the storage. For the default non-concurrent implementation, it does something similar to Span<T>.Clear method. If the Reset method is called concurrently with a writer on a non-concurrent histogram, the writer can write old+1 value after the old value is set to 0 from Reset. This is so much worse than just missing some event by overwriting a recent value from another writer. The latter will probably be barely noticeable and can make the tail fatter, since collisions are likely to happen on hot typical counters. The former will make the entire histogram unusable after Reset as a single counter with old+1 value will be surrounded by zeros.

Consider the following code where the main thread multiplies a static field by a small constant 1.0001, while another thread flips the value in the field between +1 and -1 and instantly reads the result back.

 1private static double _value;
 2
 3public static void DetectStaleMultiplyStore(int seconds = 2)
 4{
 5    double maxBadSeen = 0;
 6    double expectedSign = 1;
 7    int bad = 0;
 8
 9    long iterations = 1000_000;
10    var rounds = 0;
11    Console.Write($"Running for {seconds} seconds");
12    var cts = new CancellationTokenSource(TimeSpan.FromSeconds(seconds));
13
14    var mre = new ManualResetEvent(false);
15
16    Task cleaner = Task.Factory.StartNew(() =>
17    {
18        mre.Set();
19
20        var sw = Stopwatch.StartNew();
21        var elapsed = sw.Elapsed;
22        while (!cts.IsCancellationRequested)
23        {
24            rounds++;
25            for (long i = 0; i < iterations; i++)
26            {
27                expectedSign = -expectedSign;
28
29                _value = expectedSign;
30                var seen = Volatile.Read(ref _value);
31
32                if ((seen < 0) != (expectedSign < 0))
33                {
34                    if (Math.Abs(seen) > maxBadSeen)
35                        maxBadSeen = Math.Abs(seen);
36                    bad++;
37                }
38            }
39
40            if (sw.Elapsed - elapsed >= TimeSpan.FromSeconds(1))
41            {
42                Console.Write(".");
43                elapsed = sw.Elapsed;
44            }
45        }
46    }, TaskCreationOptions.LongRunning);
47
48    mre.WaitOne();
49
50    while (!cts.IsCancellationRequested)
51    {
52        var c = 1.0001;
53        for (long i = 0; i < iterations; i++)
54        {
55            _value *= c; // load, multiply, store
56        }
57    }
58
59    cleaner.Wait();
60    Console.WriteLine();
61    Console.WriteLine($"Iterations: {rounds * iterations:N0}");
62    Console.WriteLine($"Bad signs:  {bad:N0}");
63    Console.WriteLine($"Max bad seen:  {maxBadSeen:N2}");
64}

Just after 1 second I see these typical numbers. Every ~100 operations the “cleaner” thread sees a sign different from what it has just written.

1Running for 1 seconds.
2Iterations: 21,000,000
3Bad signs:  254,261
4Max bad seen:  2.75

The operation _value *= c is a sequence of load, multiply, store. In the non-concurrent histogram, a counter++ operation is load, add, store. Such a race will happen eventually if Reset is used concurrently with writes.

Optimistic retry

The Reset method is problematic not only due to the old+1 race, which is unsolvable for the default implementation without making the writes synchronized, for which we have the interlocked implementation. Reset can leave the histogram in a half-cleared state as seen by a concurrent reader, rendering the histogram useless.

To avoid the internal inconsistency, all read methods take a temporary snapshot, with optimistic retry on the Version field. The Reset method is implemented as:

 1public override void Reset()
 2{
 3    lock (this)
 4    {
 5        Interlocked.Increment(ref Version);
 6        try
 7        {
 8            Clear();
 9            Interlocked.Increment(ref ResetCount);
10        }
11        finally
12        {
13            Interlocked.Increment(ref Version);
14        }
15    }
16}

while all read method have the following block:

 1var spinner = new SpinWait();
 2while (true)
 3{
 4    var version = Volatile.Read(ref Version);
 5    if ((version & 1) != 0) // While inside Reset, the version is odd
 6    {
 7        spinner.SpinOnce();
 8        continue;
 9    }
10
11    // Do work
12    
13    if (version == Volatile.Read(ref Version)) // No new Reset started while doing the work
14        return instance;
15
16    spinner.Reset();
17}

This is a standard optimistic retry pattern. In Picollo, it is used even for non-concurrent implementation to protect reads from structural updates and guarantee internal data consistency. But the main purpose of this retry loop is the thread-local implementation, which reuses the base implementation for per-thread and accumulator storage. This implementation is quite similar to the .NET’s ThreadLocal<T> class on the read path. However, it manages the local instances and their lifetimes differently, and allows fast allocation-free iteration over local instances.

The thread-local implementation itself may deserve a dedicated post, but it’s out of scope here. However, I invite you to challenge the implementation and try to find failure modes, or fundamentally improve the Reset behavior. I reviewed it several times over multiple days, and asked the top 3 AIs to find issues. At this stage, it seems correct, maybe not optimal.

Benchmarks

The design of Picollo HDR histogram is focused on very fast non-intrusive writes, and good-enough low-allocating reads, with zero-allocating steady-state paths for normal usage with a single monitoring/reader thread. That is, writes are as fast as possible even if that requires making reads slower (mainly due to conservative snapshotting).

The benchmarks are done on i5-12500 Windows and WSL, with turbo off, at 3GHz. Tiered PGO is enabled. The code is here.

The shared benchmark setup is as following:

 1public class HdrHistogramBenches
 2{
 3    private static readonly long MaxValue = 7716549600; // The legacy default MaxTrackable used in their tests
 4    private static readonly int RandomPower = 3; // Skews values down
 5    private static readonly int SignificantDigits = 3; // Affects the footprint much more than max value
 6    private static readonly bool UseDoublePrecision = true; // Legacy allocates 2x more slots than needed to guarantee midpoint precision
 7
 8    private static readonly int Rounds = 200;
 9
10    private static readonly long[] Values = InitValues();
11
12    private static long[] InitValues()
13    {
14        Percentile.DefaultEquivalentValueSelection = EquivalentValueSelection.Midpoint;
15
16        var count = 1_000_000;
17        var values = new long[count];
18        Random random = new Random(42);
19        for (int i = 0; i < count; i++)
20        {
21            values[i] =(long)(Math.Pow(random.NextDouble(), RandomPower) * MaxValue);
22            // (long)(20000 + random.NextDouble() * 2000 + random.NextDouble() * 1000 + random.NextDouble() * 5 + random.NextDouble() * 250); 
23        }
24
25        random.Shuffle(values);
26        return values;
27    }
28    ...
29}

I measure the performance of value recording as the average time over multiple operations. No fancy histograms of cycles (yet). The workloads for Picollo and the legacy histogram are identical, apart from the API difference.

 1private static void PicolloWorkload(Metrics.HdrHistogram h, int runs, int threadId = 0)
 2{
 3    var sw = Stopwatch.StartNew();
 4
 5    for (int x = 0; x < runs; x++)
 6    {
 7        sw.Restart();
 8
 9        for (int r = 0; r < Rounds; r++)
10        {
11            foreach (long value in Values)
12            {
13                h.Record((ulong)value);
14            }
15        }
16
17        sw.Stop();
18
19        var totalOps = Rounds * Values.Length;
20        var elapsed = sw.Elapsed;
21        var perOp = sw.ElapsedNanos /* Picollo extension property */ / (double)totalOps;
22        Console.WriteLine($"[{threadId}] Elapsed: {elapsed}, perOp: {perOp:N2} ns");
23    }
24}

Single-threaded writes

The Picollo HDR histogram spends on average 2.7-3 ns per Record operation. The numbers are stable on Windows/WSL and do not depend much on the input parameters. For the legacy histogram, I have noticed that the performance depends on the max trackable value in a weirdly significant way, while the footprint remains in L2. The range that I observe is 3.5-14.6 ns per RecordValue operation

Max trackable: 7716549600 // The default max in most tests in the legacy implementation

Picollo: footprint 97,588, best time: 2.7 ns/ops
Legacy: footprint 98,816, best time: 10.7 ns/ops

Max trackable: 30000

Picollo: footprint 24,004, best time: 3.0 ns/ops
Legacy: footprint 25,088, best time: 3.5 ns/ops

Max trackable: NanoScope.OneSecondValue = 1_000_000_000

Picollo: footprint 85,572, best time: 2.8 ns/ops
Legacy: footprint 98,816, best time: 7.0 ns/ops

Max trackable: long.MaxValue

Picollo: footprint 221,300, best time: 2.8 ns/ops
Legacy: footprint 221,300, best time: 14.6 ns/ops

These numbers are all good and rarely relevant e.g. for measuring network latencies. But it may become a significant difference when instrumenting very hot code regions.

Multi-threaded writes

This benchmark is much more interesting. The workload is the same, but multiple threads write the values. The legacy histogram tracks the total count and has a reader-writer lock, which results in 3 interlocked operations per counter increment, two of which hit exactly the same locations. Picollo’s interlocked implementation has only counters, which are randomly hit at different locations. Picollo’s thread-local implementation does not have any write contention by design, but has some overheads to resolve the storage location from a thread-static slot.

Here are approximate numbers as before, best runs observed within a reasonable amount of time. The interesting axes of measurements are the number of threads and max trackable value. With a smaller range, there should be more contention per counter.

Max trackable: long.MaxValue, 1 thread

Picollo thread-local: footprint 442,608, best time: 5.5 ns/ops
Picollo interlocked: footprint 221,304, best time: 9.6 ns/ops
Legacy: footprint 225,792, best time: 31.6 ns/ops

Max trackable: long.MaxValue, 2 threads

Picollo thread-local: footprint 663,912, best time: 5.5 ns/ops
Picollo interlocked: footprint 221,304, best time: 12.1 ns/ops (1.3x worse than single threaded)
Legacy: footprint 225,792, best time: 181 ns/ops (5.7x worse than single threaded)

Max trackable: long.MaxValue, 4 threads

Picollo thread-local: footprint 1,106,520, best time: 5.6 ns/ops
Picollo interlocked: footprint 221,304, best time: 14.9 ns/ops (1.6x worse than single threaded)
Legacy: footprint 225,792, best time: 408 ns/ops (13x worse than single threaded)

Max trackable: 30000, 1 thread

Picollo thread-local: footprint 48,016, best time: 5.3 ns/ops
Picollo interlocked: footprint 24,008, best time: 9.6 ns/ops
Legacy: footprint 25,088, best time: 29.1 ns/ops

Max trackable: 30000, 2 threads

Picollo thread-local: footprint 72,024, best time: 5.2 ns/ops
Picollo interlocked: footprint 24,008, best time: 20.4 ns/ops (2x worse than single threaded)
Legacy: footprint 25,088, best time: 174 ns/ops (6x worse than single threaded)

Max trackable: 30000, 4 threads

Picollo thread-local: footprint 120,040, best time: 5.4 ns/ops
Picollo interlocked: footprint 24,008, best time: 38.7 ns/ops (4x worse than single threaded)
Legacy: footprint 25,088, best time: 446 ns/ops (15x worse than single threaded)

As you can see, the thread-local version just does not care about threads. Its footprint is shown as the total, but most of the time it will use per-core L2 cache. The interlocked version benefits from multiple counter slots that reduce the contention, which is per cache line so per 8-16 counters. But if an interlocked version is forced to update the same locations, its overhead explodes even with 4 threads. You can imagine what it will do on a machine with dozens of cores and multiple NUMA nodes. (I will add such benchmark later)

PerfEvent

This section will be tiny compared to the first one. I want to focus on how easy it is to use the raw performance counters in .NET. It’s not very surprising, given that .NET has one of the best FFI implementations known as PInvoke, and the heavy lifting is done by the perf_event_open function. That function is not directly exposed by libc, it is a numbered syscall, but this changes nothing substantial.

The man page for perf_event_open is intimidating in size and overwhelming in detail. How it works is definitely beyond the scope of this post. But I can show how to use it from .NET. Maybe I will describe internals in a second post, if there is interest.

One thing that is worth highlighting is that PMU counters, at least the basic hardware ones, do work on WSL. There may be edge cases with vPMU support, but it’s already so much better than nothing on Windows.

Getting started

Here are some getting started examples. There is not much more to the API, since the goal is simply to extract the counters. The HDR histogram plays an important role to accumulate and display the results.

Known workload

 1public static void PerfSessionGettingStarted()
 2{
 3    if (!PerfEventCounterSession.TryGetNumberOfCounters(out uint fixedCounters, out uint programmableCounters))
 4        Console.WriteLine("CPU PMU: unavailable");
 5
 6    Console.WriteLine($"CPU PMU: fixedCounters={fixedCounters}, programmableCounters={programmableCounters}");
 7
 8    var counterSession = PerfEventCounterSession.Factory
 9        .WithTarget((int)0, -1) // By default, it's the current OS thread on any CPU core
10        .WithPinned(true) // Schedule all or nothing
11        .WithKernel(false) // Depends, for pure C# userspace code better to have it false
12        .WithEnabled(true) // Start as enabled, usually the right choice unless there are reasons, e.g. multiple sessions
13        // .WithHardwareCounters(PerfHwId.CpuCycles)
14        // .WithHardwareCounters(PerfHwId.Instructions)
15        // .WithHardwareCounters(PerfHwId.RefCpuCycles)
16        .WithFixedCounters() // Same as the 3 commented above. They are usually available and do not consume programmable counter slots
17        .WithHardwareCounter(PerfHardwareCounterId.BranchMisses)
18        .WithCacheCounter(PerfCacheCounterId.L1DReadMiss)
19        .WithCacheCounter(PerfCacheCounterId.L1IReadMiss)
20        // Often it's trial & error, even the standard counters may be unavailable
21        // E.g. the next two do not work on WSL
22        // .WithCacheCounter(PerfCacheCounterId.LLReadAccess) // Poor man's proxy for L2 misses
23        // .WithCacheCounter(PerfCacheCounterId.LLReadMiss)
24        
25        .Create();
26    
27    for (int i = 0; i < 10_000; i++)
28    {
29        // Record the current state of the counters
30        // ForceSyscallRead forces the read to always use a syscall even if a RDPMC-based fast path is available.
31        // It's more stable and atomic per session, but significantly slower.
32        counterSession.Read(forceSyscallRead: true);
33
34        // A helper that contains 256 simple instructions that cannot be reordered or optimized
35        // The expected work is 256k cycles
36        CpuUtils.AddChain256(1000);
37
38        // Record the counters again. The implementation keeps track of the previous read result and can calculate deltas 
39        counterSession.Read(forceSyscallRead: true);
40
41        // Record counter delta values into per-counter HDR histogram  
42        counterSession.Record(deltas: true);
43
44        if(i == 1000)
45            counterSession.Reset(); // Warm up
46    }
47
48    foreach (PerfEventCounter counter in counterSession.Counters)
49    {
50        counter.Histogram.GetSummary().PrettyPrint(counter.Name);
51    }
52}

The output from WSL:

  1CPU PMU: fixedCounters=4, programmableCounters=8
  2##### Hardware:Instructions
  3| Percentile |      Value | ±      |  Count |
  4| :--------- | ---------: | :----- | -----: |
  5| 0          |    260,224 | ±128   |      1 |
  6| 1          |    260,224 | ±128   |     90 |
  7| 5          |    260,224 | ±128   |    450 |
  8| 10         |    260,224 | ±128   |    900 |
  9| 25         |    260,224 | ±128   |  2,250 |
 10| 50         |    260,224 | ±128   |  4,500 |
 11| 75         |    260,224 | ±128   |  6,750 |
 12| 90         |    260,992 | ±128   |  8,100 |
 13| 92.5       |    260,992 | ±128   |  8,325 |
 14| 95         |    260,992 | ±128   |  8,550 |
 15| 97.5       |    260,992 | ±128   |  8,775 |
 16| 99         |    260,992 | ±128   |  8,910 |
 17| 99.9       |    260,992 | ±128   |  8,991 |
 18| 99.99      |    263,936 | ±256   |  8,999 |
 19| 99.999     |    263,936 | ±256   |  8,999 |
 20| 100        |    263,936 | ±256   |  8,999 |
 21|            |            |        |        |
 22| Mean:      | 260,351.06 | StDev: | 288.47 |
 23| Precision: |    0.0977% | Total: |  8,999 |
 24
 25##### Hardware:CpuCycles
 26| Percentile |      Value | ±      |    Count |
 27| :--------- | ---------: | :----- | -------: |
 28| 0          |    256,128 | ±128   |        1 |
 29| 1          |    256,128 | ±128   |       90 |
 30| 5          |    256,128 | ±128   |      450 |
 31| 10         |    256,128 | ±128   |      900 |
 32| 25         |    256,384 | ±128   |    2,250 |
 33| 50         |    256,640 | ±128   |    4,500 |
 34| 75         |    258,688 | ±128   |    6,750 |
 35| 90         |    260,480 | ±128   |    8,100 |
 36| 92.5       |    261,248 | ±128   |    8,325 |
 37| 95         |    262,400 | ±256   |    8,550 |
 38| 97.5       |    263,936 | ±256   |    8,775 |
 39| 99         |    268,544 | ±256   |    8,910 |
 40| 99.9       |    284,416 | ±256   |    8,991 |
 41| 99.99      |    401,664 | ±256   |    8,999 |
 42| 99.999     |    401,664 | ±256   |    8,999 |
 43| 100        |    401,664 | ±256   |    8,999 |
 44|            |            |        |          |
 45| Mean:      | 257,888.51 | StDev: | 3,288.28 |
 46| Precision: |    0.0977% | Total: |    8,999 |
 47
 48##### Hardware:RefCpuCycles
 49| Percentile |      Value | ±      |    Count |
 50| :--------- | ---------: | :----- | -------: |
 51| 0          |    256,384 | ±128   |        1 |
 52| 1          |    256,384 | ±128   |       90 |
 53| 5          |    256,384 | ±128   |      450 |
 54| 10         |    256,384 | ±128   |      900 |
 55| 25         |    256,384 | ±128   |    2,250 |
 56| 50         |    256,640 | ±128   |    4,500 |
 57| 75         |    258,944 | ±128   |    6,750 |
 58| 90         |    260,736 | ±128   |    8,100 |
 59| 92.5       |    261,504 | ±128   |    8,325 |
 60| 95         |    262,400 | ±256   |    8,550 |
 61| 97.5       |    264,448 | ±256   |    8,775 |
 62| 99         |    269,568 | ±256   |    8,910 |
 63| 99.9       |    311,040 | ±256   |    8,991 |
 64| 99.99      |    402,176 | ±256   |    8,999 |
 65| 99.999     |    402,176 | ±256   |    8,999 |
 66| 100        |    402,176 | ±256   |    8,999 |
 67|            |            |        |          |
 68| Mean:      | 258,128.22 | StDev: | 3,987.04 |
 69| Precision: |    0.0977% | Total: |    8,999 |
 70
 71##### Hardware:BranchMisses
 72| Percentile |   Value | ±      | Count |
 73| :--------- | ------: | :----- | ----: |
 74| 0          |       8 | ±0     |     1 |
 75| 1          |       8 | ±0     |    90 |
 76| 5          |       8 | ±0     |   450 |
 77| 10         |       8 | ±0     |   900 |
 78| 25         |       9 | ±0     | 2,250 |
 79| 50         |      10 | ±0     | 4,500 |
 80| 75         |      14 | ±0     | 6,750 |
 81| 90         |      18 | ±0     | 8,100 |
 82| 92.5       |      19 | ±0     | 8,325 |
 83| 95         |      21 | ±0     | 8,550 |
 84| 97.5       |      25 | ±0     | 8,775 |
 85| 99         |      31 | ±0     | 8,910 |
 86| 99.9       |      46 | ±0     | 8,991 |
 87| 99.99      |     109 | ±0     | 8,999 |
 88| 99.999     |     109 | ±0     | 8,999 |
 89| 100        |     109 | ±0     | 8,999 |
 90|            |         |        |       |
 91| Mean:      |   11.99 | StDev: |  5.21 |
 92| Precision: | 0.0977% | Total: | 8,999 |
 93
 94##### HardwareCache:L1DReadMiss
 95| Percentile |   Value | ±      | Count |
 96| :--------- | ------: | :----- | ----: |
 97| 0          |       0 | ±0     |     1 |
 98| 1          |       0 | ±0     |    90 |
 99| 5          |       0 | ±0     |   450 |
100| 10         |       0 | ±0     |   900 |
101| 25         |       0 | ±0     | 2,250 |
102| 50         |       0 | ±0     | 4,500 |
103| 75         |      23 | ±0     | 6,750 |
104| 90         |      30 | ±0     | 8,100 |
105| 92.5       |      32 | ±0     | 8,325 |
106| 95         |      37 | ±0     | 8,550 |
107| 97.5       |      56 | ±0     | 8,775 |
108| 99         |     105 | ±0     | 8,910 |
109| 99.9       |     209 | ±0     | 8,991 |
110| 99.99      |     421 | ±0     | 8,999 |
111| 99.999     |     421 | ±0     | 8,999 |
112| 100        |     421 | ±0     | 8,999 |
113|            |         |        |       |
114| Mean:      |   11.29 | StDev: | 22.70 |
115| Precision: | 0.0977% | Total: | 8,999 |
116
117##### HardwareCache:L1IReadMiss
118| Percentile |   Value | ±      | Count |
119| :--------- | ------: | :----- | ----: |
120| 0          |       0 | ±0     |     1 |
121| 1          |       0 | ±0     |    90 |
122| 5          |       0 | ±0     |   450 |
123| 10         |       0 | ±0     |   900 |
124| 25         |       0 | ±0     | 2,250 |
125| 50         |       4 | ±0     | 4,500 |
126| 75         |      47 | ±0     | 6,750 |
127| 90         |      85 | ±0     | 8,100 |
128| 92.5       |      90 | ±0     | 8,325 |
129| 95         |      97 | ±0     | 8,550 |
130| 97.5       |     110 | ±0     | 8,775 |
131| 99         |     138 | ±0     | 8,910 |
132| 99.9       |     300 | ±0     | 8,991 |
133| 99.99      |     418 | ±0     | 8,999 |
134| 99.999     |     418 | ±0     | 8,999 |
135| 100        |     418 | ±0     | 8,999 |
136|            |         |        |       |
137| Mean:      |   27.28 | StDev: | 38.77 |
138| Precision: | 0.0977% | Total: | 8,999 |

You can see that the Hardware:CpuCycles P50 matches the expected value.

Sorted vs unsorted aggregation

A more fun example is to illustrate the most upvoted StackOverflow question: Why is conditional processing of a sorted array faster than of an unsorted array?

The code below measures cycles and branches and prints summary diffs.

 1public static void PerfSessionSorted()
 2{
 3    if (!PerfEventCounterSession.TryGetNumberOfCounters(out uint fixedCounters, out uint programmableCounters))
 4        Console.WriteLine("CPU PMU: unavailable");
 5
 6    Console.WriteLine($"CPU PMU: fixedCounters={fixedCounters}, programmableCounters={programmableCounters}");
 7
 8    var values = new byte[1000_000];
 9    Random.Shared.NextBytes(values);
10
11    var counterSession = PerfEventCounterSession.Factory
12        .WithKernel(false)
13        .WithEnabled(true)
14        .WithHardwareCounter(PerfHardwareCounterId.CpuCycles)
15        .WithHardwareCounter(PerfHardwareCounterId.BranchMisses)
16        .Create();
17
18    var summaryUnsortedCycles = new HdrHistogramSummary();
19    var summaryUnsortedBranches = new HdrHistogramSummary();
20    double acc = 0;
21
22    for (int r = 0; r < 2; r++)
23    {
24        for (int i = 0; i < 1000; i++)
25        {
26            counterSession.Read(forceSyscallRead: true);
27
28            foreach (byte value in values)
29            {
30                if (value >= 128)
31                    acc += 100;
32                else
33                    acc *= 0.9999;
34            }
35
36            counterSession.Read(forceSyscallRead: true);
37
38            counterSession.Record(deltas: true);
39
40            if (i == 1000)
41                counterSession.Reset();
42        }
43
44        if (r == 0)
45        {
46            counterSession.Counters.Hardware.CpuCycles!.Histogram.GetSummary(summaryUnsortedCycles);
47            counterSession.Counters.Hardware.BranchMisses!.Histogram.GetSummary(summaryUnsortedBranches);
48
49            counterSession.Reset();
50            Array.Sort(values);
51
52            if (acc < 0)
53                throw new Exception("Do not optimize");
54            acc = 0;
55        }
56    }
57
58    var summarySortedCycles = counterSession.Counters.Hardware.CpuCycles!.Histogram.GetSummary();
59    var summarySortedBranches = counterSession.Counters.Hardware.BranchMisses!.Histogram.GetSummary();
60
61    summaryUnsortedCycles.PrettyPrintDiff(summarySortedCycles, "CpuCycles", "Unsorted", "Sorted");
62    summaryUnsortedBranches.PrettyPrintDiff(summarySortedBranches, "BranchMisses", "Unsorted", "Sorted");
63
64}

The output is:

 1##### CpuCycles
 2| Percentile |      Unsorted |       Sorted |     Δ% |
 3| :--------- | ------------: | -----------: | -----: |
 4| 0          |    16,211,968 |    8,511,488 | -47.5% |
 5| 1          |    16,244,736 |    8,511,488 | -47.6% |
 6| 5          |    16,310,272 |    8,527,872 | -47.7% |
 7| 10         |    16,343,040 |    8,527,872 | -47.8% |
 8| 25         |    16,408,576 |    8,560,640 | -47.8% |
 9| 50         |    16,621,568 |    8,708,096 | -47.6% |
10| 75         |    17,481,728 |    9,248,768 | -47.1% |
11| 90         |    18,923,520 |    9,658,368 | -49.0% |
12| 92.5       |    19,447,808 |    9,707,520 | -50.1% |
13| 95         |    19,906,560 |    9,789,440 | -50.8% |
14| 97.5       |    20,398,080 |    9,838,592 | -51.8% |
15| 99         |    20,627,456 |    9,871,360 | -52.1% |
16| 99.9       |    34,045,952 |    9,969,664 | -70.7% |
17| 99.99      |    34,045,952 |    9,969,664 | -70.7% |
18| 99.999     |    34,045,952 |    9,969,664 | -70.7% |
19| 100        |    34,045,952 |    9,969,664 | -70.7% |
20|            |               |              |        |
21| Mean:      | 17,179,303.94 | 8,927,461.38 | -48.0% |
22| StDev:     |  1,241,956.59 |   438,585.23 | -64.7% |
23| Precision: |       0.0977% |      0.0977% |   0.0% |
24| Total:     |         1,000 |        1,000 |   0.0% |
25| D-value:   |               |              |  -8.86 |
26
27##### BranchMisses
28| Percentile |   Unsorted |  Sorted |      Δ% |
29| :--------- | ---------: | ------: | ------: |
30| 0          |    499,456 |      17 | -100.0% |
31| 1          |    499,456 |      20 | -100.0% |
32| 5          |    499,456 |      21 | -100.0% |
33| 10         |    499,456 |      22 | -100.0% |
34| 25         |    499,968 |      25 | -100.0% |
35| 50         |    499,968 |      27 | -100.0% |
36| 75         |    499,968 |      31 | -100.0% |
37| 90         |    500,480 |      36 | -100.0% |
38| 92.5       |    500,480 |      39 | -100.0% |
39| 95         |    500,480 |      42 | -100.0% |
40| 97.5       |    500,480 |      48 | -100.0% |
41| 99         |    500,480 |      57 | -100.0% |
42| 99.9       |    674,304 |      71 | -100.0% |
43| 99.99      |    674,304 |      71 | -100.0% |
44| 99.999     |    674,304 |      71 | -100.0% |
45| 100        |    674,304 |      71 | -100.0% |
46|            |            |         |         |
47| Mean:      | 500,179.71 |   28.82 | -100.0% |
48| StDev:     |   5,516.70 |    6.94 |  -99.9% |
49| Precision: |    0.0977% | 0.0977% |    0.0% |
50| Total:     |      1,000 |   1,000 |    0.0% |
51| D-value:   |            |         | -128.21 |

Q.E.D.

Next steps

The initial release of Picollo v0.1.0 should be quite usable, especially the HDR part and basic counters extraction. But it is not yet stable, and the API may change (a little, at least for the histogram). One issue I keep fighting is the stability of the fast-path counter reads. This part requires a native library with inline assembly, which is built as part of Picollo automatically. At least the CI/CD issue is trivially solved. But the values wrap around often, especially on WSL. The API for perf_event may change significantly, but the core shape will remain.

I would greatly appreciate practical feedback on the API shape and implementation, as well as bug reports and PRs with fixes.

HackerNews X.com