2011-07-07 23:45:30 by jdixon
For as long as I can remember, systems administrators have bitched about the state of monitoring. Now, depending on who you ask, you might get a half dozen (or more) answers as to what "monitoring" actually means. Monitoring is most commonly used as a casual catch-all term to describe one or more pieces of software that perform host and service monitoring and basic trending (graphs or charts). But in most cases, these complaints are targeted at software responsible for daily fault detection and notifications for IT shops and Web Operations. The usual whipping boy is Nagios, a popular open-source monitoring project that supports a universe of host and service checks, notifications, escalations and more.
Nagios has been the "lesser of all evils" for quite some time. Its cost (free), extensibility (high) and configuration flexibility have helped it achieve significant adoption levels across a variety of industries and range of business sizes, from small one-man web startups to Fortune 500 enterprises. It's been forked multiple times and is recognized by industry analysts as a force to be reckoned with. Regardless, those who use it, do so with a fair amount of hostility. Ask around and you're likely to find more users who stay with Nagios because it's "good enough" than those who actually like it. So why doesn't Nagios have more competition in the open-source marketplace? Largely because writing an entire monitoring system from scratch is an enormous undertaking. Ok, does that mean we should keep improving Nagios (or forking it... again)? Perhaps.
Or here's a better idea: stop trying to create an entire monitoring system. There's a good reason why large monitoring software projects don't do everything right for everyone. They do a lot of shit, sometimes poorly. And almost always, in a disjointed manner, caused by years of feature creep and parity. Too many moving parts, layers upon layers of antiquated code (and bugs), and an inflexibility towards newer engineering methodologies (e.g. infrastructure as code, aka operational automation). Rather than continuing to feed a bloated system, we should strive to simplify and rebuild desired features as individual components.
Think of your monitoring architecture not as the sum of its parts, but rather as a collection of related services. Its not unlike any modern web architecture using a myriad of dedicated pieces, each focused on their own service dependencies. Popular web architectures include a customer-facing interface, a backend to process, crunch and route user requests, a storage engine, possibly a search engine, caching components, and almost certainly layers of redundancy. Now re-imagine monitoring as a collective of dedicated pieces, with their own API and standardized data format. There are collection agents to run checks and gather metrics; a storage engine to aggregate data and respond to queries; a messaging bus to queue and deliver information between components; a state engine to index thresholds, recognize alertable conditions and track alert states; a notification service to page on alerts, handle escalations and scheduling; and of course, a front-end to view dashboards and historical trends.
If you've read this far you might even be thinking "hey, I know something that can do alerting" (e.g. PagerDuty) or can store metrics (e.g. Graphite). That's fantastic, but it goes beyond simply trying to build the necessary components. One of the major hurdles with large software projects like Nagios is that they create an "all or nothing" competitive barrier to entry. Enterprising developers feel like their vision of a new monitoring tool will take too long to complete or won't be able to keep pace with future innovations. Hint: you don't need to. Take a chapter from the "Infrastructure as Code" movement and appreciate the benefits that come from interoperable pieces that are inexpensive, scalable and easily automated with a little code and a stable API.
We don't realize it yet, but what we really need is more competition in this area, particularly from the bottom-up. Gartner's Magic Quadrant is flush with billion-dollar behemoths and rising stars all hoping to cash in on our pain. Businesses continue to empower them by buying their "one-size-fits-all" monitoring suites or poorly conceived SaaS offerings. Open-source has the talent to contend here; in fact, all we have to do is look at log indexing and search projects like Logstash and Graylog2 to see that smaller, simpler projects can compete. And not just with the big boys-- we should promote competition amongst ourselves. Create a reusable, interoperable suite of components that form a cohesive base for performing fault detection, notifications, trending and analysis. Start with a single service, upload it to GitHub and start sharing your work. Be happy when someone forks it and makes it better. Be ecstatic when someone comes out with a completely different implementation that is interface compatible to yours.
How awesome would it be if, five years from now, monitoring software looked like EC2, Cloud Foundry or Heroku? I like to envision it as the Voltron of monitoring software. Except that I'll be fucking thrilled if your Green Lion replaces mine because it's faster, has fewer bugs and a better interface. Besides, I'll be too busy hanging out with Princess Allura in the Blue Lion.