Adding a Metrics Cache to Descartes

2012-11-08 00:00:26 by jdixon

Update / TL;DR: Thanks to Bernd Ahlers (@berndahlers) for clueing me into the fact that you can call rufus-scheduler directly rather than indirectly through resque-scheduler. Because it uses Event-Machine, there's no need to run separate worker processes or queue up the jobs. Consider me sold. The changes have already been committed.

If you still want to read the original post, continue on.


Today I merged in a refactor of the Descartes bits that deal with metrics. Specifically, the live Metrics tab and sparklines view. This will have a profound effect on performance, but can also have a surprising effect on your wallet if you're not paying attention.

So, a little background on how Descartes used to operate and why this change was necessary. Not too long ago I added a new Metrics page that displays sparklines for every metric in your Graphite server and lets you click on them to create a composite graph. Although the page is still rather immature, it's useful for basic visualization and graph creation. Personally I think its major selling point right now is in the sparklines I mentioned. This is one thing that you don't really get with native Graphite -- being able to quickly see activity patterns on any metrics without going through the hassle of actually creating a graph. This is made that much more awesome by the presence of live filtering. Click on the Add to Graph button and you're presented with an additional input field that, as you type a string, will filter down the list of metric sparklines you're viewing in realtime.

As cool as this is, it's not without its headaches. You see, the Graphite API doesn't provide any good search capabilities for metrics. More specifically, you have to give it the path that you're interested in, and it will return a list of leaf or branch nodes within that path. While this is satisfactory for Graphite's navigation tree, it's terrible for full-text searches. Fortunately, Aman Gupta (@tmm1) added support for a full metrics key dump at the /metrics/index.json API endpoint. However, because it has to do a full walk of the filesystem, it's rather slow and, on particularly busy servers with lots of unique metrics, can result in a very large dump.

Regardless, this is still useful data. The original implementation of the metrics search was overly simple. When a user opened the Metrics page, we send an ajax request to the Graphite server, download the json blob, parse it and store it in memory. As long as the user stays on that page, the post-load experience is quite good, at the expense of initial load-time and not knowing if any new metrics had been submitted since that time. But it became obvious that even moderately-loaded Graphite installations would result in significant load times waiting for the json dump. On our production server with 60k+ unique metrics, it takes between 12-14 seconds to download.

Naturally this suggests we use some sort of caching mechanism. While introducing a web cache like Varnish would certainly help, I knew that long-term I wanted to make this data available as a service[1]. I considered storing the object in Redis, but it bothered me that people might not have the foresight to host their Redis and web service nearby, effectively putting us right back where we started (high transit latency). In the end, I decided that the best place to cache this was in the Sinatra app itself. We can download the blob at rackup, store it in a pseudo model object, and update it in the background with a scheduled job. Add a text search endpoint to the app and we're gold, Jerry. Gold.

To accomodate the scheduled jobs I settled on resque-scheduler. While it would've been simpler (and potentially less dyno-expensive) to use something like Heroku's Scheduler addon, I detest the notion of tightly coupling an OSS project to any specific commercial platform[2]. Thus, the decision was made to go with Resque and the resque-scheduler project. This results in two additional process types in the Procfile: a scheduler process for queueing up future jobs, and a worker process for pulling jobs off the queue and running the code that updates the metric list in memory.

Therefore, consider yourself advised that if you are running Descartes on Heroku and want to continue to do so (at least, with a working Metrics tab), you will need to pay for the additional dyno time. I hope this isn't an issue for many of you. Conversely, you can easily run this on your own physical, cloud or VPS servers quite easily with the bundled management facilities (i.e. Bundler, Foreman, Rake, etc).

In the end, I'm really pleased with the results. The metrics list is seeded at runtime via rackup, and the ajax/search interaction works quite well. There are no perceived load-time or interaction latencies, and I can easily adapt the search logarithm on the backend without requiring users to force-update their cached javascript libs. In fact, I've already had suggestions from other GitHubbers on how the search could be integrated with Hubot for a whole other level of awesomeness.


[1] I might yet be motivated to extract this capability as an external component. But probably not.

[2] I remain a huge fan of Heroku, and continue to target their platform for my personal development, but I also strive for platform independence for users at large. You should be always be able to take Descartes (or any of my projects) and run them anywhere with a minimum of effort.

Comments

at 2012-11-08 10:21:59, Bernd Ahlers wrote in to say...

Why aren't you using some in-process scheduler like rufus-scheduler[1]? This would avoid additional processes. It creates either a scheduler thread or uses an eventmachine periodic timer.

[1] https://github.com/jmettraux/rufus-scheduler

at 2012-12-06 11:34:10, Iancu Stoica wrote in to say...

Hi jdixon - noobie here. While searching for alternatives for cacti I found about graphite and while looking about graphite I found your blog-posts. Now my question is completely unrelated to the above post but couldn't find contact info so I figured maybe you could throw me a hand. So what I am looking for is a way to get snmp data from various devices and then import that data into graphite. The way I understand it is that the common way is for setting up something that would listen for data, but how would I go about for polling for the snmp data? And then importing it to graphite?

Thanks in advance.

at 2012-12-06 11:42:39, Jason Dixon wrote in to say...

@Iancu - check out https://github.com/obfuscurity/graphite-scripts/blob/master/bin/poll_snmp.pl

at 2012-12-07 03:02:30, Iancu Stoica wrote in to say...

Thanks! Will give this a try.

Add a comment:

  name

  email

  url

max length 4000 chars