Monitoring Sucks. Do Something About It.

2011-07-07 23:45:30 by jdixon

For as long as I can remember, systems administrators have bitched about the state of monitoring. Now, depending on who you ask, you might get a half dozen (or more) answers as to what "monitoring" actually means. Monitoring is most commonly used as a casual catch-all term to describe one or more pieces of software that perform host and service monitoring and basic trending (graphs or charts). But in most cases, these complaints are targeted at software responsible for daily fault detection and notifications for IT shops and Web Operations. The usual whipping boy is Nagios, a popular open-source monitoring project that supports a universe of host and service checks, notifications, escalations and more.

Nagios has been the "lesser of all evils" for quite some time. Its cost (free), extensibility (high) and configuration flexibility have helped it achieve significant adoption levels across a variety of industries and range of business sizes, from small one-man web startups to Fortune 500 enterprises. It's been forked multiple times and is recognized by industry analysts as a force to be reckoned with. Regardless, those who use it, do so with a fair amount of hostility. Ask around and you're likely to find more users who stay with Nagios because it's "good enough" than those who actually like it. So why doesn't Nagios have more competition in the open-source marketplace? Largely because writing an entire monitoring system from scratch is an enormous undertaking. Ok, does that mean we should keep improving Nagios (or forking it... again)? Perhaps.

Or here's a better idea: stop trying to create an entire monitoring system. There's a good reason why large monitoring software projects don't do everything right for everyone. They do a lot of shit, sometimes poorly. And almost always, in a disjointed manner, caused by years of feature creep and parity. Too many moving parts, layers upon layers of antiquated code (and bugs), and an inflexibility towards newer engineering methodologies (e.g. infrastructure as code, aka operational automation). Rather than continuing to feed a bloated system, we should strive to simplify and rebuild desired features as individual components.

Think of your monitoring architecture not as the sum of its parts, but rather as a collection of related services. Its not unlike any modern web architecture using a myriad of dedicated pieces, each focused on their own service dependencies. Popular web architectures include a customer-facing interface, a backend to process, crunch and route user requests, a storage engine, possibly a search engine, caching components, and almost certainly layers of redundancy. Now re-imagine monitoring as a collective of dedicated pieces, with their own API and standardized data format. There are collection agents to run checks and gather metrics; a storage engine to aggregate data and respond to queries; a messaging bus to queue and deliver information between components; a state engine to index thresholds, recognize alertable conditions and track alert states; a notification service to page on alerts, handle escalations and scheduling; and of course, a front-end to view dashboards and historical trends.

If you've read this far you might even be thinking "hey, I know something that can do alerting" (e.g. PagerDuty) or can store metrics (e.g. Graphite). That's fantastic, but it goes beyond simply trying to build the necessary components. One of the major hurdles with large software projects like Nagios is that they create an "all or nothing" competitive barrier to entry. Enterprising developers feel like their vision of a new monitoring tool will take too long to complete or won't be able to keep pace with future innovations. Hint: you don't need to. Take a chapter from the "Infrastructure as Code" movement and appreciate the benefits that come from interoperable pieces that are inexpensive, scalable and easily automated with a little code and a stable API.

We don't realize it yet, but what we really need is more competition in this area, particularly from the bottom-up. Gartner's Magic Quadrant is flush with billion-dollar behemoths and rising stars all hoping to cash in on our pain. Businesses continue to empower them by buying their "one-size-fits-all" monitoring suites or poorly conceived SaaS offerings. Open-source has the talent to contend here; in fact, all we have to do is look at log indexing and search projects like Logstash and Graylog2 to see that smaller, simpler projects can compete. And not just with the big boys-- we should promote competition amongst ourselves. Create a reusable, interoperable suite of components that form a cohesive base for performing fault detection, notifications, trending and analysis. Start with a single service, upload it to GitHub and start sharing your work. Be happy when someone forks it and makes it better. Be ecstatic when someone comes out with a completely different implementation that is interface compatible to yours.

How awesome would it be if, five years from now, monitoring software looked like EC2, Cloud Foundry or Heroku? I like to envision it as the Voltron of monitoring software. Except that I'll be fucking thrilled if your Green Lion replaces mine because it's faster, has fewer bugs and a better interface. Besides, I'll be too busy hanging out with Princess Allura in the Blue Lion.


at 2011-07-08 08:16:22, peter royal wrote in to say...

have you looked at circonus? ( .. its a step towards what you mention in the last paragraph.

at 2011-07-08 08:44:43, Jason Dixon wrote in to say...

I was the Product Manager for Circonus for the first year+ of its existence. So yeah, I'm pretty familiar with it, and it should come as no surprise that our work there continues to influence my view of monitoring. But my point throughout isn't that we need "a better BIG monitoring solution" but rather we need to break out the feature set into discrete components, iterate and improve upon them. Encourage competition between developers, with the goal of an interoperable (and ideally, pluggable) monitoring architecture.

at 2011-07-08 11:53:10, Mark Bainter wrote in to say...

I really do not understand the hate for Nagios. I grant that the host-oriented nature of it is now problematic given the shifts that have happened in our industry, but that's relatively recent and is not a common complaint.

I did use to think it sucked that Ethan kept such a death-grip on commit access, but then I started meeting the people who wanted to "fix" it. The problem was, none of them actually understood the monitoring problem. Their "fixes" tended towards making it broken in the same way as the monolithic lock-in solutions from vendors. Most people complaining about Nagios over the years have made me think of a 16yo driving a Ferrari and complaining that the car must suck because they can't control it.

The real glory of Nagios is that it *is* mostly a framework. It's a scheduling engine for running jobs and tracking the results. As you rightly noted, it's unbelievably extensible, allowing for some really incredible setups. Even

at 2011-07-08 14:01:33, Lennart wrote in to say...

Totally agree here. Btw, it's Graylog2 not Greylog2 :)

at 2011-07-08 14:10:02, Jason Dixon wrote in to say...

@Lennart - You're absolutely correct, my apologies. Funny thing in that Google already corrected me when I was searching for the URL, but I forgot to update my own text. :-P

at 2012-02-05 21:44:43, Geoff Flarity wrote in to say...

Join me at

at 2012-07-29 18:27:09, Nick Satterly wrote in to say...

Take a look at the Guardian's monitoring project on GitHub called "alerta". We developed it with many of the requirements you mention above in mind. ie. distributed and de-coupled components, message bus for alert input and notification, a RESTful API, standard JSON-encoded alert format,  mongodb as a status database, elasticsearch for archiving, awesome "at-a-glance" visualisation and more...

at 2014-02-25 22:54:16, Anton wrote in to say...

Hey Jason, you used to have a bad ass slide deck (on slideshare) on the Voltron of monitoring software. Can't find that anywhere on the Interwebz. Is that still publicly available anywhere?

at 2014-02-26 12:19:10, Jason Dixon wrote in to say...

@Anton - It's on SpeakerDeck:

at 2015-01-22 19:18:21, raj dutt wrote in to say...

This is really good stuff Jason, and still as true today as it was 3.5 years ago, imo.

This post, among others, along with your voltron presentation were big parts of the strategic thinking behind

See you at Monitorama in June!


Add a comment:




max length 4000 chars