Quick GC Histograms
Some times its hard to find a way of explaining to the technically less literate the power that the Unix shell hands you. The original ideas behing the combination wordprocessor and programmers work bench combined with the power of pipes make text processing easy.
So to today's problem - analysing GC logs for latency spikes. As we run the jvm with the following properties;
-XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime \ -XX:+PrintTenuringDistribution -Xloggc:/path/to/gc.logfile
The output looks like this;
2013-04-18T13:30:33.491+0000: 97108.543: [GC 97108.543: [ParNew
Desired survivor size 26836992 bytes, new threshold 3 (max 4)
- age 1: 12848568 bytes, 12848568 total
- age 2: 7033104 bytes, 19881672 total
- age 3: 7791880 bytes, 27673552 total
- age 4: 5567656 bytes, 33241208 total
: 459080K->45463K(471872K), 0.0147270 secs] 2571295K->2162244K(4968704K) \ icms_dc=0 , 0.0148850 secs] [Times: user=0.27 sys=0.00, real=0.02 secs]
Total time for which application threads were stopped: 0.0154030 seconds
Total time for which application threads were stopped: 0.0002950 seconds
Total time for which application threads were stopped: 0.0002120 seconds
Total time for which application threads were stopped: 0.0003010 seconds
Total time for which application threads were stopped: 0.0002510 seconds
(This system is not on the critical path by the way).
So, looking for an idea of spikes/deviation from normal or just a view of how the pauses are distributed and the length of our tail a histogram is a good start.
We want to see something like this;
Count Millis 1301 0.000 71 0.001
2 0.002
1 0.003
3 0.005
8 0.006
8 0.007
15 0.008
29 0.009
33 0.010
32 0.011
45 0.012
47 0.013
50 0.014
52 0.015
54 0.016
60 0.017
30 0.018
39 0.019
37 0.020
24 0.021
24 0.022
18 0.023
8 0.024
10 0.025
4 0.026
8 0.027
5 0.028
3 0.029
4 0.030
3 0.031
1 0.032
Which we get by using this;
grep 'stopped:' ems2-gc.log | cut -d':' -f 2 | cut -d' ' -f 2 | sort -n \ | awk '{ printf "%1.3f\n", $1 }' | (echo " Count Millis" ; uniq -c )
From left to right we're;
- pulling out the stopped lines using grep
- extracting the seconds using the two cut statements
- numerically sorting the output
- formatting the output to print out in the buckets we're interested in - in this case milliseconds using awk
- and lastly creating a histogram using uniq
There are much cleverer bash one liners out there, and each linux admin has their own favourite set of tools, so the above can be done in many different ways no doubt. There are also those who reach for perl/python for anything more complex than grepping the output of ps. I think they're missing out, as almost every day (and there have been many) there's a new flag or command to surprise and delight me with bash.