2012-05-02 22:09:36 by jdixon
This morning I was collecting some graphs for one of our weekly status meetings. Asked to find something that represented the state of our Graphite system, I naturally gravitated to my usual standbys, "Carbon_Performance" (top) and "Carbon_Inbound_Bandwidth" (bottom).
The SysAdmin in me loves these because they highlight resource utilization on the server. While the former details disk I/O and CPU, the latter tracks inbound bandwidth in terms of bits and packets per-second. Although the network graph seems utterly boring (in as much as we've all used these in one form of another, from vendor-supplied dashboards to Cacti installations), it's this one that is actually the more complicated of the two to configure.
At least as of Graphite 0.9.9, there is no magic function to transform counters into their "per-second" equivalents. Hence, we're forced to concoct a scale recipe that transforms to our desired result. In my case, I wanted to present Bits Per Second (bps) as the default unit for this chart. I determined that by taking the nonNegativeDerivative of our SNMP octets counter, multiplying it by eight (to convert to bits) and then dividing by 60 (seconds) would yield the correct number. A quick comparison against the network switch graphs (provided by the vendor, of course) confirmed my calculations that we were indeed receiving approximately 5Mbps of inbound traffic to the Graphite server.
Of course, this was months ago, which all but guarantees I'd forgotten exactly how I'd come to this calculation (or that it was even necessary).
Jump forward to this morning, where I was scratching my head trying to figure out why the 1-week graph (lower right, above) showed traffic levels 5 times greater than the usual 1-day pattern. My first inclination was to check the aggregationMethod for the affected metrics.
# whisper-info.py /data/snmp/graphite/IF-MIB\:\:ifInOctets/7.wsp | grep aggregation aggregationMethod: average
This looked normal (average is the default aggregation function), so my focus went to the data itself. Our retention policies for SNMP data are relatively sane; we like to keep 1-minute resolution for a day, 5-minute resolution for four weeks, and 15-minute resolution for a year.
[snmp] priority = 200 pattern = ^snmp\. retentions = 1m:1d,5m:28d,15m:1y
Using the Graphite web API, I reviewed the raw data for both archives. Everything looked normal.
I went back and looked at the graphs again. This time I attempted to isolate the periods when the graphs looked "good" and when the levels were out of whack. And then it hit me like a ton of bricks. As soon as my view extended beyond the 1-day archive (1m:1d), the numbers skewed by a factor of 5. And if I went beyond the 4-week archive (5m:28d), they jumped up another factor of 3. The totals were off by exactly the same amount as the increase in secondsPerPoint per archive.
# whisper-info.py /data/whisper/snmp/graphite/IF-MIB\:\:ifInOctets/7.wsp | \ > grep -A2 Archive Archive 0 retention: 86400 secondsPerPoint: 60 -- Archive 1 retention: 2419200 secondsPerPoint: 300 -- Archive 2 retention: 31536000 secondsPerPoint: 900
Because I had introduced a time conversion into our scale factor, the data was only valid for the particular archive that I initially verified it against. If we perform any queries that pull from a different archive, the scale calculation is off by the factor of difference in seconds per archive. What was originally designed as a graph that described number of bits per second per one minute interval becomes something more like number of bits per second per five minute interval.
The moral of the story - never attempt to convert time units in your deviance or it will bite you in the ass.