Unhelpful Graphite Tip #1 - Frequency of Events

2012-04-10 00:41:02 by jdixon

I'd like to begin sharing more of my knowledge as it pertains to using Graphite in production. Most of these upcoming posts are bound to be of the "check out this cool function" variety, but hopefully you can stitch them together into something useful. Before I proceed, I'd like to thank Chris Davis and the team at Orbitz who started this incredible software project and released it to the open-source community. Without your work I'd be stuck using something... less awesome.

Today's tip comes courtesy of a combined effort by me and Michael Leinartas (@mleinart). I've used this particular combination of functions before to calculate the number of "events" in a series during a particular timeframe. Unfortunately I failed to record this query anywhere (pro-tip: save your best Graphite functions in a document or gist, you'll be glad you did) although I had a vague idea of the functions needed. Michael was kind enough to remind me of the particular order for chaining the functions.

At $DAYJOB we use a large number of EC2 instances for various components. For the last few months we've been experiencing a high mortality rate with a particular type of instance, used in a particular component configuration. To support our research in tracking down possible kernel bugs we started submitting "annotation" metrics to the Graphite server. These are generally a one-shot metric that we'll render in Graphite using the drawAsInfinite() function. It allows us to identify a particular moment in time by rendering a vertical line where the metric was recorded (along the X axis). This works very well for visualizing isolated events (server crashes, software deployments, etc).

drawAsInfinite(color(custom.instances.*.killed,"white"))

But what if you want to aggregate these events over time, to gauge the frequency of the series as a whole? For this we can chain the group(), sumSeries() and summarize() functions. The group() function, as its name implies, coalesces dissimilar metrics into a single series. This is useful for passing on to other functions that require a single input series, such as sumSeries(). As you might expect, sumSeries() adds up values in the series. Lastly, this is passed to summarize(), which aggregates the values into "buckets" of a specific time interval.

summarize(sumSeries(group(custom.instances.*.killed)), "1d")

In case you're wondering, I set Line Mode to Staircase Line and enabled Draw Null as Zero. I also enabled Area Mode (All). These are merely aesthetic preferences, but they help me to discern the daily mortality rate over this period.

Comments

at 2012-04-10 10:11:22, Jeff Blaine wrote in to say...

Thanks Jason!

Clarity nitpick - you might want to change:

Lastly, this is passed to summarize(), which aggregates the values into "buckets" at 1-day intervals.

to

Lastly, this is passed to summarize(), which aggregates the values into "buckets" of a specified time period.

[ article then shows that you used 1d ]

at 2012-04-10 10:18:35, Jason Dixon wrote in to say...

I was referring to my usage of it, but I see your point. Updated for clarity.

at 2012-04-10 10:55:53, Kraig Amador wrote in to say...

Clever!

at 2012-06-25 13:49:04, Robert Krombholz wrote in to say...

Thanks for the great idea. One questions came up to me while reading this: How do you define your storage schema for randomly occurring events without wasting too much storage for this kind of metrics?

at 2012-06-25 17:52:44, Jason Dixon wrote in to say...

@Robert - If you don't want to lose the data, you really have no choice but to force it to *not* roll up to a lower resolution. See http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-9 for more details.

at 2014-05-06 01:50:54, Andy Feller wrote in to say...

Hey Jason,

Here is a good article to look at about hitcount() and how it differs from summarize(): http://code.hootsuite.com/accurate-counting-with-graphite-and-statsd/ We've using both was beneficial to use with hitcount() based on the largest bucket size ala retention and summarize to be the actual time buckets we want.

Anyhow, would be interested in your feedback.

Andy

Add a comment:

  name

  email

  url

max length 4000 chars