Why Big Monitoring Software Sucks

2012-06-20 17:21:47 by jdixon

There are a ton of open-source and commercial monitoring tools available, so why do we claim that monitoring sucks? Certainly there are some usable tools out there; without them our systems would be even more unpredictable and unreliable than they already are. So what makes one tool sticky where others get tried and tossed aside?

Systems Administrators (and Engineers) are a finicky bunch. We prefer to build complex systems from small, sharp instruments rather than fight with larger, malleable (read: monolithic) software. There's a reason why Pingdom and Pager Duty are enormously popular among technically agile businesses. Cost is only a small part of the equation; these customers understand (implicitly, if not explicitly) that combining these small, sharp tools into a series of logically connected functions (fault detection, notifications and historical trending) is much easier than breaking apart an Enterprise-Ready monitoring suite and coercing it to meet their unique needs.

I've worked for and helped build one of these larger monitoring suites. I understand what drives them. They also see how painful this space is; they want to help alleviate our misery. Unfortunately, it's a very competitive industry; that is to say, there's a lot of money out there for anyone with a piece of software that doesn't completely suck. In a marathon race for customer dollars, it becomes a sprint for feature parity. They want to become that "one-stop solution" for the Event Correlation & Analysis market. Before long the product evolves into yet another me-too monitoring suite that does a lot of things adequately, but maybe only one thing (or nothing) with aplomb.

To all of these vendors, I plead - stop trying to be everything to everyone. You can't predict how we'll use your software. We can't even predict that ourselves. You might have a chest full of beautiful tools, but if they're not designed to meet our specific needs they're a waste of our time and money.

Take your hammer, for example. It works really well. It's reliable, it fits perfectly in our hand and it has a nice heft. We know how to use it and we use it only for the appropriate tasks. Your saw? Not so much. It's pretty but it gets dull really quick and is terrible for this sort of project. We found a better saw from the vendor down the street. Together, these tools collaborate on some beautiful work because they have well-defined interfaces and meet their design specifications.

While you're busy competing on features and functionality, we're trying to circumvent your walled garden. We want to break our data free from your slightly-crappy tools (hopefully using your somewhat-less-crappy API), to use it in conjunction with another company's small, sharp alternative. We want to give you our money, but not for features, service or support that we'll never use. If you want to specialize in multiple tools, that's awesome. And by all means, reward your customers with discounts for using multiple products. But don't require us to use your services as a bundled offering. Build each piece of software with the notion that they will be used independent of each other.


at 2012-06-20 20:01:04, James Litton wrote in to say...

Appreciate the PagerDuty mention. Clearly we agree. I'd love to get in touch with and discuss your ideas further.

at 2012-06-20 20:19:25, Peter Sankauskas wrote in to say...

Wow you really hit the nail on the head. This has been my frustration for years!

at 2012-06-20 20:27:39, Jason Dixon wrote in to say...

@James - Will you be around for Velocity?

at 2012-06-20 20:44:02, James Litton wrote in to say...

Yes, I will be. I'm not scheduled to go for the whole conference, though. I will be going to DevOps Days Mountain View on Thursday and Friday, as well. Drop me an email and we can schedule to meet up.

at 2012-06-20 20:52:33, Anonymous wrote in to say...

There are plenty of monitoring tools out there that are free. If you are paying for your tools you've probably paid too much.

at 2012-06-20 21:01:41, Jason Dixon wrote in to say...

@Anonymous - That's an unrealistic (or inexperienced) opinion. Let's take the two commercial example from the article. Pingdom provides remote monitoring from a large number of globally distributed systems. You can't easily or cheaply replace that functionality with "free software". PagerDuty is billed as a "notification service" but their real value is in scheduling and escalations. Nagios is capable of escalations on its own, but afaik is *not* capable of the same sort of scheduling rotations.

at 2012-06-20 21:02:35, Jason Dixon wrote in to say...

@James - I'll be around for Velocity only, Mon-Wed.

at 2012-06-21 01:49:04, PIT wrote in to say...

Why not using free software like monitor.us?

at 2012-06-21 06:04:02, Jason Dixon wrote in to say...

@PIT - I'm not calling out all good (or bad) vendors here. That's not the point of the article.

at 2012-06-21 09:56:01, David Krider wrote in to say...

I wrote a sort of PagerDuty replacement for receiving Nagios and SNMPTT alerts, and alarms from our doors system. You can subscribe to keywords within alerts, and make time windows for when you want to receive messages. It's stubbed to allow pagers, email, IM, and whatever else you need for notifications. (Maybe Growl?) One nice feature is that it uses Twilio to allow you to respond to a texted Nagios alert and cause the service or host to go into service mode, or to just shut up. It's a Rails 2.x app, and it's hosted at GitHub. I know it's hacky, but there's little time to polish it up, and it's worked just great for a couple years for us now.

at 2012-06-21 10:22:11, Theo Schlossnagle wrote in to say...

The "we" you are referring to in your article is not representative of all companies/engineers. It is, of course, representative of many. Large companies are very tired of so many tools and disfunction across organizational boundaries, growing companies are afraid of lock-in and roadmap shifts. They are both right and each could both learn from the other.

I see a vendor's challenge: provide enough tooling to serve a large, multi-disciplinary, multi-process organization while maintaining high levels of interoperability through APIs and data services. That's what we focus on at Circonus. Sure, we add new features to satisfy our clients and to push the industry forward, but those features certainly don't power the sell and rarely motivate the buy.

Case in point: we support alerting and escalations within Circonus, but also support PagerDuty. Most people that use those features heavily prefer PagerDuty. That, in my opinion, is pure awesome.

To be fair, this is a damn hard challenge.

at 2012-06-21 11:25:11, Mark Tomlinson wrote in to say...

For a traditional, legacy, outdated and somewhat-crappy monitoring suite - integrations are a tricky product strategy. How much of your argument could be resolved by big-monitoring-suite vendors simply opening up their software to allow integrations?

My other response is this: who is in the buyer's position when it comes to making unilateral, dictatorial decisions about tooling? Is there a top-down regime that's hand-cuffing a creative generation of engineers?

at 2012-07-04 17:31:58, Jonathan Ginter wrote in to say...

James, couldn't agree more. Support for standards-based integration ought to be part of every software suite.

I think the genesis of the problem, though, is often the customers themselves. I don't mean to dump all responsibility at their doorstep, but they have more influence of this than they might realize. Customers often drive misguided behaviour by using kitchen-sink checklists instead of asking for integration standards. That drives vendors to build "good enough" features that clutter up the big suites. Or they end up in acquisitions in order to cover perceived gaps - a perception that is driven by customer check-lists.

I think the industry would be more likely to enjoy solid integration-driven development if the customer base would start demanding it. Large vendors are coin-operated in the end.

Add a comment:




max length 4000 chars