Are We Ready to Kill Thresholds?

2013-06-26 09:12:54 by jdixon

I've been hearing a lot of chatter from various sources that adaptive fault detection is going to be The New Shit ™ and that static thresholds are virtually useless because they lack context. While I agree that some of the more advanced techniques sound amazing (and make no mistake, I'm really excited about the possibilities here), it's foolish to think that thresholds as a measure of fault conditions are useless.

Toufic Boubez gave an engaging presentation recently at DevOpsDays Mountain View. He asserts that using static thresholds means that we make the assumption that we understand what's going on (read: our data patterns). I think he is absolutely right when it comes to monitoring characteristics of a system such as resource exhaustion or throughput. However, I would argue that measuring KPI (which is infinitely more valuable in terms of diagnosing business health) often requires the sort of human insight and familiarity that isn't easily predicted with computer models. We can make certain assumptions about the cascading effect of broken systems on our business KPI (e.g. widget sales), but not of the KPI itself (often subject to weak seasonality, periodicity, etc).

Baron Schwartz gave an excellent talk last year on his work with various algorithms in identifying abnormal behavior in working systems. I think it's particularly valuable that the designers of these systems convey (at least to some degree) how they derive the state of a system's health, lest we're asked to blindly trust their black box monitoring service.

If you look back on some of Baron's writings you'll see him mention that "metrics that measure the work being done are more important than others." Anyone who's seen a Linux server hit a load of 200 while performing productive work would certainly agree. But I would posit that while those metrics are important to managing your IT resources, they are not nearly as valuable as the KPI that demonstrate that your business objectives are (or are not) being met. I'm not sure if I'm ready to trust those measurements to anyone but the stakeholders for those indicators.

I'm completely onboard with the notion that adaptive fault detection is the future of monitoring and notifications. However, until these systems (including Etsy's Kale project) have been battle-tested in production environments, I'm not comfortable enough to wholly displace the traditional monitoring-with-thresholds model. It's one thing to get paged at 3am due to a threshold that I configured and can adjust through learned experience. It's quite another to miss an alert because of a closed-source algorithm that I can't understand or fine-tune.

I think this could be the beginning of a new age of monitoring software. Am I being cautious or simply paranoid? I'd love your feedback in the comments below.


at 2013-06-26 09:21:55, Jon Cowie wrote in to say...

I actually agree with you on this one - we didn't write Kale to totally supplant manually curated alerts, more to give visibility into the stuff you're not specifically already looking at. We're only just now beginning to look at how to generate meaningful alerts from adaptive fault detection and it's going to be an interesting and fun road to figuring that out. It may be that we *never* totally get rid of manually configured alerts, and that adaptive fault detection is used to complement it. Or it may be that adaptive skynet takes over the world.

Watch this space!!

at 2013-06-26 11:08:02, Baron Schwartz wrote in to say...

Great points as usual. I will add that all of these metric-focused things (even the ones I think are exciting) are very myopic compared to the broader systems and services view you hinted at when you talked about KPIs.

at 2013-08-20 12:27:33, Rob Ottaway wrote in to say...

I'm pretty optimistic about how projects such as kale can help me to better understand where and how human produced metrics should apply. Skyline gives you a lot if hints, Oculus lets you dig in make notes. Together I feel like it's a great concept, and a really good tool for putting together more informed human produced metrics. I'm not sure I'd wholly trust any algorithms (yet) to decide when to fire off a production level alert.

Add a comment:




max length 4000 chars