Using HdrHistogram with Ruby
TL:DR HdrHistogram is a fast and lightweight tool to record a histogram of your applications performance. It also corrects for the problem of Coordinated Omission. To install it run gem install HdrHistogram
What is HdrHistogram?
HdrHistogram is a library for recording values to a histogram with a configurable precision.
require 'HDRHistogram'
# Create a histogram to record values between 1 and 1,000 milliseconds
hdr = HDRHistogram.new(1, 1000, 3)
1..800.times {|_| hdr.record(5)} # 80% Fast requests
1..150.times {|_| hdr.record(30)} # 15% Moderate requests
1..50.times {|_| hdr.record(200)} # 5% Slow requests
percentile80th = hdr.percentile(80)
# percentile80th 5
percentile85th = hdr.percentile(85)
# percentile85th 30
percentile99th = hdr.percentile(99)
# percentile99th 200
In this example we've created a histogram that records values from 1 to 1000 milliseconds with a precision of 3 significant digits. Notice that we're recording milliseconds in whole numbers. This is due to HdrHistogram
being represented internally as whole numbers from 1 to an integer upper bound. To record floating point values we'll need to use the optional multiplier
parameter.
# Create a histogram to record values between 1 and 1,000 milliseconds
hdr = HDRHistogram.new(0.001, 1, 3, multiplier: 0.001)
1..800.times {|_| hdr.record(0.005)} # 80% Fast requests
1..150.times {|_| hdr.record(0.03)} # 15% Moderate requests
1..50.times {|_| hdr.record(0.2)} # 5% Slow requests
percentile80th = hdr.percentile(80)
# percentile80th 0.005
percentile85th = hdr.percentile(85)
# percentile85th 0.03
percentile99th = hdr.percentile(99)
# percentile99th 0.2
The multiplier
parameter converts recorded values by value * (1 / multiplier)
so that a floating point value can be properly recorded. multiplier
represents the smallest recordable value of the histogram. In the above example any value under 0.001
would not be recorded (.e.g. 0.0057
would be recorded as 0.005
)
Benefits of HdrHistogram
Why bother with HdrHistogram when we can simply use an array to record values and then compute the average?
timings = []
1..800.times {|_| timings.push(0.005)} # 80% Fast requests
1..150.times {|_| timings.push(0.03)} # 15% Moderate requests
1..50.times {|_| timings.push(0.2)} # 5% Slow requests
average = timings.inject(:+) / timings.length
# average 0.0185
When you record values this way you introduce jitter into your measurements due to allocations and potentially garbage collection. Each push
onto the array requires an allocation that not only adds time to the measurement but also increases memory pressure leading to a potential garbage collection.
HdrHistogram preallocates all the memory it will need to record values in a very compact way. This take away a source jitter that can potentially be very large (if a Garbage collection is triggered)
# Create a histogram to record values between 1 and 1,000 microseconds
hdr = HDRHistogram.new(0.001, 1, 3, multiplier: 0.001)
puts hdr.memsize
# => 16480 bytes before recording values
1..800.times {|_| hdr.record(0.005)} # 80% Fast requests
1..150.times {|_| hdr.record(0.03)} # 15% Moderate requests
1..50.times {|_| hdr.record(0.2)} # 5% Slow requests
hdr.memsize
# => 16480 bytes after recording values
HdrHistogram also corrects for the problem of Coordinated Omission (see below) which very few profiling tools take into account.
Coordinated Omission
What is Coordinated Omission? Broadly it's assuming that the number of requests is exactly equal to the number of responses. Since that description is very high level here's an example to clarify things.
Let's say you work at an Ice Cream Parlour that always has a two hour rush during lunchtime. The fastidious store manager takes measurements of how fast customers arrive and finds that customer come to your store once every minute during the rush. Soon a "Punctual Performer" bonus is announced. You'll receive this bonus if your average service time is less than a minute.
The next day rolls around and the lunch rush starts. You frantically serve customers averaging 30 seconds to complete an order. After 90 minutes you decide to take off the last 30 minutes and play video games only serving one customer. Do you qualify for the "Punctual Performer" bonus?
For the first 90 minutes you served 90 customers with a 30 second response time and while you were playing video games one customer got served with a 30 minutes response time. Calculating the average with ruby we get
wt_num = 90 # Number of Customers served during the first 90 minutes
wt_rt = 30 # Response Time during those first 90 minutes
vg_num = 1 # Customers served while you were playing video games
vg_rt = 30 * 60 # Respone Time while you were playing video games
average = ((wt_num * wt_rt) + (vg_num * vg_rt)) / (wt_num + vg_num).to_f
# average => 49.450549 seconds
We qualify with 10 seconds to spare! In fact you could play video games for up 40 minutes and still get the bonus. Clearly our boss would not see it that way, so where did our calculations go wrong?
The flawed assumption was if we stop working then customers will stop arriving. In our calculations we said that there was only one customer during the 30 minutes we were playing video games. But we know that customers arrive at our Ice Cream Parlour at the rate of one per minute so we should actually have 30 customers during that time instead of only one. The first customer will be waiting 30 minutes, the second 29 minutes, the third 28 minutes and so on. Calculating the new average with these corrected assumptions we get the actual average
wt_num = 90 # Number of Customers served during the first 90 minutes
wt_rt = 30 # Response Time during those first 90 minutes
vg_num = 30 # Customers arriving while you played video games
# Total wait time of customes while you were playing video games
vg_rt_total = (1..30).map {|i| i * 60 }.inject(:+)
average = ((wt_num * wt_rt) + vg_rt_total) / (wt_num + vg_num).to_f
# average => 255.0 seconds
The corrected average is 4 minutes and 15 seconds well past the one minute time needed for the bonus. By not recording all customers coming to our shop Coordinated Omission gives us the false impression that we're serving customers almost 5 times faster then we actually are. Not only does Coordinated Omission wildly underestimate your response time it can show performance improvements as making things worse. Lets say instead of working furiously serving people in 30 seconds for the first 90 minutes you served everyone at a more sedate pace of 55 seconds. People would think that your performance was 5.5 seconds slower!
Fixing Coordinated Omission with HdrHistogram
To fixes the problem of Coordinated Omission with HdrHistogram switch from record(value)
to record_corrected(value, expected_interval)
. Expected interval is the expected arrival rate of requests (i.e. one customer arriving per minute in the above example). Here's the Ice Cream Parlour example using record_corrected
require "HDRHistogram"
hdr = HDRHistogram.new(1, 120, 3)
1..90.times {|_| hdr.record_corrected(30, 60) }
hdr.record_corrected(30 * 60, 60)
puts hdr.mean
# 255 Seconds
puts hdr.latency_stats
# 50.000% 30 seconds
# 75.000% 30 seconds
# 90.000% 1080 seconds
# 99.000% 1740 seconds
# 99.900% 1800 seconds
# 99.990% 1800 seconds
# 99.999% 1800 seconds
# 100.000% 1800 seconds
That's really all there is to it.
Conclusion
HdrHistorgram is a great tool to record values without accidentally introducing jitter to your recorded values. It also corrects for Coordinated Omission. To find out more about HdrHistogram and Coordinated Omission visit the github repo and watch this talk.