Using HdrHistogram with Ruby

TL:DR HdrHistogram is a fast and lightweight tool to record a histogram of your applications performance. It also corrects for the problem of Coordinated Omission. To install it run gem install HdrHistogram

What is HdrHistogram?

HdrHistogram is a library for recording values to a histogram with a configurable precision.

require 'HDRHistogram'

# Create a histogram to record values between 1 and 1,000 milliseconds
hdr = HDRHistogram.new(1, 1000, 3)

1..800.times {|_| hdr.record(5)} # 80% Fast requests
1..150.times {|_| hdr.record(30)} # 15% Moderate requests
1..50.times {|_| hdr.record(200)} # 5% Slow requests

percentile80th = hdr.percentile(80)
# percentile80th 5

percentile85th = hdr.percentile(85)
# percentile85th 30

percentile99th = hdr.percentile(99)
# percentile99th 200

In this example we've created a histogram that records values from 1 to 1000 milliseconds with a precision of 3 significant digits. Notice that we're recording milliseconds in whole numbers. This is due to HdrHistogram being represented internally as whole numbers from 1 to an integer upper bound. To record floating point values we'll need to use the optional multiplier parameter.

# Create a histogram to record values between 1 and 1,000 milliseconds
hdr = HDRHistogram.new(0.001, 1, 3, multiplier: 0.001)

1..800.times {|_| hdr.record(0.005)} # 80% Fast requests
1..150.times {|_| hdr.record(0.03)} # 15% Moderate requests
1..50.times {|_| hdr.record(0.2)} # 5% Slow requests

percentile80th = hdr.percentile(80)
# percentile80th 0.005

percentile85th = hdr.percentile(85)
# percentile85th 0.03

percentile99th = hdr.percentile(99)
# percentile99th 0.2

The multiplier parameter converts recorded values by value * (1 / multiplier) so that a floating point value can be properly recorded. multiplier represents the smallest recordable value of the histogram. In the above example any value under 0.001 would not be recorded (.e.g. 0.0057 would be recorded as 0.005)

Benefits of HdrHistogram

Why bother with HdrHistogram when we can simply use an array to record values and then compute the average?

timings = []

1..800.times {|_| timings.push(0.005)} # 80% Fast requests
1..150.times {|_| timings.push(0.03)} # 15% Moderate requests
1..50.times {|_| timings.push(0.2)} # 5% Slow requests

average = timings.inject(:+) / timings.length
# average 0.0185

When you record values this way you introduce jitter into your measurements due to allocations and potentially garbage collection. Each push onto the array requires an allocation that not only adds time to the measurement but also increases memory pressure leading to a potential garbage collection.

HdrHistogram preallocates all the memory it will need to record values in a very compact way. This take away a source jitter that can potentially be very large (if a Garbage collection is triggered)

# Create a histogram to record values between 1 and 1,000 microseconds
hdr = HDRHistogram.new(0.001, 1, 3, multiplier: 0.001)

puts hdr.memsize
# => 16480 bytes before recording values

1..800.times {|_| hdr.record(0.005)} # 80% Fast requests
1..150.times {|_| hdr.record(0.03)} # 15% Moderate requests
1..50.times {|_| hdr.record(0.2)} # 5% Slow requests

hdr.memsize
# => 16480 bytes after recording values

HdrHistogram also corrects for the problem of Coordinated Omission (see below) which very few profiling tools take into account.

Coordinated Omission

What is Coordinated Omission? Broadly it's assuming that the number of requests is exactly equal to the number of responses. Since that description is very high level here's an example to clarify things.

Let's say you work at an Ice Cream Parlour that always has a two hour rush during lunchtime. The fastidious store manager takes measurements of how fast customers arrive and finds that customer come to your store once every minute during the rush. Soon a "Punctual Performer" bonus is announced. You'll receive this bonus if your average service time is less than a minute.

The next day rolls around and the lunch rush starts. You frantically serve customers averaging 30 seconds to complete an order. After 90 minutes you decide to take off the last 30 minutes and play video games only serving one customer. Do you qualify for the "Punctual Performer" bonus?

For the first 90 minutes you served 90 customers with a 30 second response time and while you were playing video games one customer got served with a 30 minutes response time. Calculating the average with ruby we get

wt_num = 90 # Number of Customers served during the first 90 minutes
wt_rt = 30 # Response Time during those first 90 minutes

vg_num = 1 # Customers served while you were playing video games
vg_rt = 30 * 60 # Respone Time while you were playing video games

average = ((wt_num * wt_rt) + (vg_num * vg_rt)) / (wt_num + vg_num).to_f
# average => 49.450549 seconds

We qualify with 10 seconds to spare! In fact you could play video games for up 40 minutes and still get the bonus. Clearly our boss would not see it that way, so where did our calculations go wrong?

The flawed assumption was if we stop working then customers will stop arriving. In our calculations we said that there was only one customer during the 30 minutes we were playing video games. But we know that customers arrive at our Ice Cream Parlour at the rate of one per minute so we should actually have 30 customers during that time instead of only one. The first customer will be waiting 30 minutes, the second 29 minutes, the third 28 minutes and so on. Calculating the new average with these corrected assumptions we get the actual average

wt_num = 90 # Number of Customers served during the first 90 minutes
wt_rt = 30 # Response Time during those first 90 minutes

vg_num = 30 # Customers arriving while you played video games
# Total wait time of customes while you were playing video games
vg_rt_total = (1..30).map {|i| i * 60 }.inject(:+)

average = ((wt_num * wt_rt) + vg_rt_total) / (wt_num + vg_num).to_f
# average => 255.0 seconds

The corrected average is 4 minutes and 15 seconds well past the one minute time needed for the bonus. By not recording all customers coming to our shop Coordinated Omission gives us the false impression that we're serving customers almost 5 times faster then we actually are. Not only does Coordinated Omission wildly underestimate your response time it can show performance improvements as making things worse. Lets say instead of working furiously serving people in 30 seconds for the first 90 minutes you served everyone at a more sedate pace of 55 seconds. People would think that your performance was 5.5 seconds slower!

Fixing Coordinated Omission with HdrHistogram

To fixes the problem of Coordinated Omission with HdrHistogram switch from record(value) to record_corrected(value, expected_interval). Expected interval is the expected arrival rate of requests (i.e. one customer arriving per minute in the above example). Here's the Ice Cream Parlour example using record_corrected

require "HDRHistogram"

hdr = HDRHistogram.new(1, 120, 3)
1..90.times {|_| hdr.record_corrected(30, 60) }

hdr.record_corrected(30 * 60, 60)

puts hdr.mean
# 255 Seconds

puts hdr.latency_stats
#  50.000%          30 seconds
#  75.000%          30 seconds
#  90.000%        1080 seconds
#  99.000%        1740 seconds
#  99.900%        1800 seconds
#  99.990%        1800 seconds
#  99.999%        1800 seconds
# 100.000%        1800 seconds

That's really all there is to it.

Conclusion

HdrHistorgram is a great tool to record values without accidentally introducing jitter to your recorded values. It also corrects for Coordinated Omission. To find out more about HdrHistogram and Coordinated Omission visit the github repo and watch this talk.