Giant Robots Are Cool and Shit, But Seriously...

2011-07-15 16:54:34 by jdixon

I'm pleased to see so many people interested in the #monitoringsucks movement/campaign/whatever. My last post seemed to resonate with a lot of you out there. I'm excited to hear discussions surrounding APIs, command-line monitors, monitoring frameworks, etc. But I think a major thrust of my article was missed. It's not just that Nagios can be a pain in the ass, or that we need a modular monitoring system. What I'm trying to emphasize is that monolithic monitoring systems are bad and not suited for the task at hand.

Some very smart systems people (and developers) are trying to solve this problem in the open-source arena. Unfortunately, while they're attempting to diagnose and cure the problems in contemporary monitoring systems, they continue to architect big honking inflexible software projects. When I refer to "the Voltron of monitoring systems" I'm not talking about an enormous fucking automaton of monitoring, alerting and trending components. I mean that each component should exist independently of the others, with a stable data format and communications API. Any single component should be easily replaceable and deprecated. Authors should strive for competition because it makes the inclusive architecture that much stronger.

Realistically I see one of three things happening over the next 12-18 months:

  1. A community forms around a reasonable set of defined components and begins cranking out useful bits. Over time we have what resembles a useful ecosphere of monitoring tools and users.
  2. Motivated developers continue to solve the issues affecting monitoring software, but in their own walled garden projects. We benefit from a larger pool of projects to choose from, but they all continue to suffer from NIH syndrome.
  3. I'm disregarded as a nutcase. Nothing changes and we continue to use the same crappy ubiquitous software.

At this point I think the most likely outcome is a combination of numbers 1 and 2. It's hard for anyone to justify working on a disassociated component when the related components it needs to be useful might never be developed. On the other hand, if someone working on a monolithic project has the foresight to break up the bits into a true Service Oriented Architecture, then it would be feasible for external developers to fork individual units.


at 2011-07-15 19:07:08, Dan Ryan wrote in to say...

You nailed it. I currently use Nagios, Server Density and Munin in my monitoring stack, all of which do essentially the exact same thing. Why? Because each app has things I like. To get what I want out of them, I hobble them together in an overly complex and precarious manner. If we had an assortment of monitoring tools that followed the Unix philosophy, "do one thing and do it well", I would be one happy camper.

at 2011-07-15 20:48:53, Jason Dixon wrote in to say...

@Dan - Coincidentally, yours (overwatch) was one of the projects I saw that inspired me to write this follow-up. I like your approach but I'm concerned that it's still "all or nothing". Do you have a roadmap for the project?

at 2011-07-15 22:12:35, Dan Ryan wrote in to say...

@Jason - What I've got on GitHub is a working example, but it's still not where I want it to be. My goal with Overwatch is to have, not one overarching product, but a suite of monitoring tools that I can combine depending on the requirements of a particular infrastructure.

Each part is its own distinct application that can run completely independent of the others: the stats collection, the charts interface, the events system, etc. Everything, including clients, communicate using JSON over HTTP. Initially this design was so Overwatch could scale horizontally very easily, but I quickly realized that it also enabled new functionality to be added via custom modules or even external services. There are certainly faster ways to do it, but I chose JSON/HTTP because they are understood by almost everyone.

I'm still working out the logistics of it all. I'll likely do up a mind map while I hack on it this weekend to better explain how the pieces will fit together.

at 2011-08-01 15:25:37, Brian Candler wrote in to say...

I think there needs to be an overarching system *architecture* for the

components to communicate meaningfully.

For example, imagine I decide to write an SNMP poller capable of polling

10,000 devices per second. The actual polling functionality is pretty

straightforward. The questions become:

1. How is this component to be configured when it starts up? It could read

local config files or a local database (a pain to manage). It can sit there

idly, waiting for someone to POST a config to it; or it can actively call

home to ask what to do. Those latter cases need some sort of central

management service.

2. How is this component to report its results? Well, it can just POST them

to a configured URL. However, I would like to use the data in multiple

places, and I would like the poller to continue to work even if management

station(s) need to be rebooted or the network goes down for a while. This

implies some sort of queuing and routing service.

at 2011-08-01 15:27:02, Brian Candler wrote in to say...

3. For those 10,000 polls, do I send 10,000 POSTs? Probably not a good idea.

So I need some agreed format for batching multiple results into a single

JSON document.

4. There's even more fundamental stuff, such as "how do I uniquely identify

a host"? I may have a host in customer site A, but a different

host in customer site B. So either poller A needs to know which

VRF it is in, or the consumer needs to know that when it reads data

from poller A, it corresponds to site A.

By the time you've designed a SOA to that level, arguably you've basically

built a new monolithic system, where the components are relegated to plugins!

at 2011-08-16 10:02:20, Brian Candler wrote in to say...

One other point. I think the main complaints from people using

Nagios+Cacti+Smokeping are: (1) having to configure the same hosts in

multiple places; and (2) running multiple sets of probes against the same

hosts. To fix (1) implies some sort of central configuration database,

which all the tools use to configure themselves. To fix (2) additionally

implies autonomous pollers, which stream the collection data in a form that

all the upstream tools can digest (for graphing, threshold detection,

alerting etc). Would you not consider (1) and (2) to form another

monolithic monitoring system, albeit a very good and extensible one?

Add a comment:




max length 4000 chars