2012-11-08 00:00:26 by jdixon
Update / TL;DR: Thanks to Bernd Ahlers (@berndahlers) for clueing me into the fact that you can call rufus-scheduler directly rather than indirectly through resque-scheduler. Because it uses Event-Machine, there's no need to run separate worker processes or queue up the jobs. Consider me sold. The changes have already been committed.
If you still want to read the original post, continue on.
Today I merged in a refactor of the Descartes bits that deal with metrics. Specifically, the live Metrics tab and sparklines view. This will have a profound effect on performance, but can also have a surprising effect on your wallet if you're not paying attention.
So, a little background on how Descartes used to operate and why this change was necessary. Not too long ago I added a new Metrics page that displays sparklines for every metric in your Graphite server and lets you click on them to create a composite graph. Although the page is still rather immature, it's useful for basic visualization and graph creation. Personally I think its major selling point right now is in the sparklines I mentioned. This is one thing that you don't really get with native Graphite -- being able to quickly see activity patterns on any metrics without going through the hassle of actually creating a graph. This is made that much more awesome by the presence of live filtering. Click on the Add to Graph button and you're presented with an additional input field that, as you type a string, will filter down the list of metric sparklines you're viewing in realtime.
As cool as this is, it's not without its headaches. You see, the Graphite API doesn't provide any good search capabilities for metrics. More specifically, you have to give it the path that you're interested in, and it will return a list of leaf or branch nodes within that path. While this is satisfactory for Graphite's navigation tree, it's terrible for full-text searches. Fortunately, Aman Gupta (@tmm1) added support for a full metrics key dump at the /metrics/index.json API endpoint. However, because it has to do a full walk of the filesystem, it's rather slow and, on particularly busy servers with lots of unique metrics, can result in a very large dump.
Regardless, this is still useful data. The original implementation of the metrics search was overly simple. When a user opened the Metrics page, we send an ajax request to the Graphite server, download the json blob, parse it and store it in memory. As long as the user stays on that page, the post-load experience is quite good, at the expense of initial load-time and not knowing if any new metrics had been submitted since that time. But it became obvious that even moderately-loaded Graphite installations would result in significant load times waiting for the json dump. On our production server with 60k+ unique metrics, it takes between 12-14 seconds to download.
Naturally this suggests we use some sort of caching mechanism. While introducing a web cache like Varnish would certainly help, I knew that long-term I wanted to make this data available as a service. I considered storing the object in Redis, but it bothered me that people might not have the foresight to host their Redis and web service nearby, effectively putting us right back where we started (high transit latency). In the end, I decided that the best place to cache this was in the Sinatra app itself. We can download the blob at rackup, store it in a pseudo model object, and update it in the background with a scheduled job. Add a text search endpoint to the app and we're gold, Jerry. Gold.
To accomodate the scheduled jobs I settled on resque-scheduler. While it would've been simpler (and potentially less dyno-expensive) to use something like Heroku's Scheduler addon, I detest the notion of tightly coupling an OSS project to any specific commercial platform. Thus, the decision was made to go with Resque and the resque-scheduler project. This results in two additional process types in the Procfile: a scheduler process for queueing up future jobs, and a worker process for pulling jobs off the queue and running the code that updates the metric list in memory.
Therefore, consider yourself advised that if you are running Descartes on Heroku and want to continue to do so (at least, with a working Metrics tab), you will need to pay for the additional dyno time. I hope this isn't an issue for many of you. Conversely, you can easily run this on your own physical, cloud or VPS servers quite easily with the bundled management facilities (i.e. Bundler, Foreman, Rake, etc).
 I might yet be motivated to extract this capability as an external component. But probably not.
 I remain a huge fan of Heroku, and continue to target their platform for my personal development, but I also strive for platform independence for users at large. You should be always be able to take Descartes (or any of my projects) and run them anywhere with a minimum of effort.