Everybody Loves Graphite

2015-11-05 23:15:04 by jdixon

There was an article published recently - not here, and not to be linked or referenced here directly - proposing that "nobody loves Graphite" anymore. A linkbait title if I've ever heard one. Many folks linked this article to me, almost certainly expecting me to respond in an uproar. And yet, what I find myself really disappointed over is the obvious misrepresentation of fact (as it pertains to Graphite's technical limits) and an almost malicious disregard for the enormous community that uses it and contributes back to its ongoing development.

It's almost as if they're trying to sell you something. Nah, that couldn't be it.

I readily concede that Graphite was not designed for the transient nature of the sort of bleeding-edge containerized, clustering systems that are becoming popular in conference talks and Hacker News (if not in actual use in production, but we'll forgive them this tiny oversight). Admittedly, it takes an expertly skilled engineer to craft a background job for the purpose of removing old cruft. It's not every SysAdmin that knows how to cron, after all.

The rumor being spread, according to this particular author, is that certain individuals are now using Grafana (gasp!) or "custom dashboards built in-house" (double gasp!) rather than Graphite's user interface. I guess this means that because folks are building and using tools designed to consume Graphite's API, that... nobody loves Graphite? Hmm.

As someone who's developed a variety of custom interfaces for the Graphite API, this message resonates with me. I mean, what is Graphite if not just a simple UI for composing graphs? I mean, wouldn't it be grand if it had its own storage engine (or better yet, pluggable storage backends)? Imagine a world where Graphite could export JSON, to be consumed and rendered using external libraries or third-party tools. I guess that's just a pipe dream.

Regardless, the article proceeds to make a number of claims about what Graphite assumes. Let's dissect these briefly.

Relatively few metrics

You can't prove a negative here, so I'm not sure where the author is getting their facts. Production Graphite deployments are known to support many, many millions of metrics. Graphite is designed to scale horizontally, and it does so quite well. I personally helped design and build out large-scale Graphite deployments at Heroku, GitHub, and Dyn.

Relatively long-lived metrics

I'm not quite sure when the ability to retain data over time became A Bad Thing ™, but Graphite (or Whisper, I should say) is perfectly happy to let you keep your data as short (minimum one-second resolution) or as long (it has to be a finite length of time) as you wish. If you no longer need that metric, simply delete the file (or setup a cron job to purge them for you).

A metric's datapoints will exist continually

Didn't we just cover that one? Oh well, I'll excuse them for needing more than two reasons to tarnish Graphite.

The article does make some valid points about irregular data and cardinality. However, I would argue that as your data becomes highly irregular and increasingly cardinal, we're no longer talking about time-series data. We're almost never talking about application or host (or container) data. You're talking about the sort of user and event-driven data intended for analytical systems.

None of the systems mentioned in that article are tailored for that workload. They are designed for time-series data, which is successive, sequential, and "more regular than not". In fact, the very same StatsD that the author praises in the article is specifically designed to normalize irregular time-series data.

When all is said and done, the thing that bothers me the most is that the entire hypothesis of the article postulates that Graphite is not a useful system for thousands of businesses, developers, and users out there. And that simply is not true. While some folks play around with the newest toy databases, breaking them and losing data all the while - Graphite rolls on, serving up graphs and time-series data dutifully and reliably. It doesn't care whether you're using the built-in Composer, the raw data exports, or its JSON format (made all the more popular by the aforementioned Grafana UI).

Admittedly, Graphite isn't the shiniest new monitoring project out there. The great thing is, it doesn't need to be. For the overwhelming majority of businesses trying to keep their servers and applications running, Graphite runs quietly behind the scenes, accepting your metrics and providing visibility to your operations. It's the Volvo of time-series systems, and I'm grateful for that.

In the end, it doesn't matter to me if everybody loves Graphite, but I for one still do.


at 2015-11-06 07:27:04, Michael W Lucas wrote in to say...

I am HIGHLY glad that you're writing the Graphite book. Because Graphite is great, and it deserves a book.

If you didn't write the book, I would have to. And I already have enough books to write, thank you very much.

at 2015-11-09 15:19:57, Joshua Buss wrote in to say...

Nice post, Jason. I've been working with it since it was still a beta internally at Orbitz, took it with me to the next three companies I worked with, and they've all been 'improved' if I dare say it because of that.

Sure, I'm trying out the latest InfluxDB, and sure, I've experimented with sending our operational metrics to kairosdb.. but you know what? ...nothing is as straightforward and quick to use, so like you said, it's got its place.

at 2015-11-11 08:14:21, Ian McCarthy wrote in to say...

I have had/am having trouble scaling graphite. My company takes in 5million metrics a minute so we eventually split across 5 clusters in two datacenters. Now on our custom dashboard solution we have large problems stemming from cross cluster querying. Right now over half of our graphs return no data because the other clusters didn't return anything or even more bafflingly return that they don't have data they should have. Have you come across this before? And or know a way around it?

at 2015-11-13 09:31:53, Jason Dixon wrote in to say...

@Ian - 80k metrics/sec really isn't that much, especially if you're splitting it across multiple nodes. I've been able to benchmark hundreds of thousands of metrics on a single i2.4xl EC2 node with SSD. I would suggest posting your issue to https://answers.launchpad.net/graphite/ so we can discuss more thoroughly.

at 2016-10-09 09:49:22, Cody wrote in to say...

Too bad Graphite doesn't run on Windows.

Add a comment:




max length 4000 chars