Watching the Carbon Feed

2012-06-01 11:40:25 by jdixon

This is one of my most favorite, and certainly most underappreciated graphs. Its simplicity belies its usefulness. This single chart gives me a holistic view of our metrics feed, writes to Whisper files, as well as general system health. At a glance I can correlate slow updates caused by a spike in Whisper file creations or a backup resulting in a higher PPU value. We use some of its targets with Nagios to monitor for metric feed issues. And it's always the first place I look whenever there's a whiff of Graphite problems.

Here is a recent one-hour snapshot:

And its corresponding 24-hour view:

The targets are relatively straightforward. We use the group function so that we can easily sumSeries multiple carbon cache daemons at once. In our installation we actually have eight carbon cache processes (and four relays), so this saves a lot of typing. The Points-per-Update (PPU), CPU and Creates are all rendered on the secondYAxis to keep them at a reasonable scale.

alias(color(sumSeries(group(carbon.agents.*.updateOperations)), "blue"),"Updates") 
alias(color(sumSeries(group(carbon.agents.*.metricsReceived)), "green"), "Metrics Received")
alias(color(sumSeries(group(carbon.agents.*.committedPoints)),"orange"),"Committed Points"))
alias(secondYAxis(color(averageSeries(group(carbon.agents.*.cpuUsage)),"red")),"CPU (avg)")

What's your favorite graph? Is it something you'd be willing to share? Feel free to tweet me a gist with your graph configs and I'll post them on my blog.


at 2013-06-18 13:51:42, Pierce Wetter wrote in to say...

Thanks for the graph. I added this:

alias(secondYAxis(color(scale(maxSeries(group(carbon.agents.*.metricsReceived)),0.001), "white")), "Max K MetricsR/Host")

It shows the maximum number of metrics received per agent in K metrics. I watch that so I know when we need to scale up the metrics hosts.

at 2014-01-16 07:25:15, Alex wrote in to say...

Can you explain first graph?

It's mean that you receive 500k of metrics events per second?

Do you have article about your Graphite configuration? I mean how much instances of carbon, cache etc. do you have?

I try find how fast is Graphite but can not find anything else than "very fast".

I did not find any performance test of Graphite.

I understand that Graphite performance depends from my hardware but I want to find something like:

we have following hardware: blah blah

we can proceed following amount of metrics per second: blah blah

here is trend of performance: blah blah

I didn't see anything similar to this.

at 2014-01-16 13:09:57, Jason Dixon wrote in to say...

@Alex - Scaling Graphite is generally an exercise in systems engineering. Learn where your bottlenecks are, adjust, measure, re-adjust, etc. I agree that it would help to have more documented use cases out there. is a good example of this process.

I've posted a couple examples of Carbon setups I've had in production.

at 2014-01-17 08:29:45, Alex wrote in to say...

Thank you so much for this information!

at 2014-01-17 12:17:47, Alex wrote in to say...

What do you do with re-synchronization of carbon instances?

I try to explain.

I read

and find the following text:

"In a sense carbon-relay provides replication functionality, though it would more accurately be called input duplication since it does not deal with synchronization issues."

let's say we have two instance of carbon and carbon-relay which writes data of metric1 to both of them. Then we drop one of the instances of carbon. As result we will receive synchronization issues.

To fix this issue we must run magic script according to:

"There are administrative scripts that leave control of the re-synchronization process in the hands of the system administrator."

I tried to find this script under

but as you can guess unsuccessfully.

How this script works?

What we can specify to it?

Can it be done in automatically way?

Add a comment:




max length 4000 chars