2012-09-27 14:13:16 by jdixon
Kicking off this year's Surge conference was a pair of BoF sessions. The #monitoringsucks one was packed, to the extent that a number of us had to steal chairs from the Chef BoF across the hall. I remembered to write down some of the highlights from the session. Note that I'm not quoting anyone directly and am summarizing each speaker to the best of my recollection. If you were at the event and remember things differently, please notify me in the comments section below.
The moderator was Chris Burroughs (@csby54) who asked everyone to briefly introduce themselves and mention something about Monitoring that they hate. A pattern quickly emerged.
This just in:Everyone hates Nagios.#surgecon— Pete Cheslock (@petecheslock) September 27, 2012
Highlights from the discussion, in chronological order (as I remember them):
- Marcus Barczak (@ickymettle) Mentioned that at Etsy they have so much data that it becomes difficult to identify interesting events in a sea of information.
- Alex Howells (@nixgeek) contrasted this with his interest in finding patterns based on the absence of data.
- I brought up the use of Holt-Winters for forecasting and dynamic alerting. Specifically, I was curious about how accurate others see it as, and whether Etsy (known to use Holt-Winters in production) has tweaked their version of the algorithm. Marcus Barczak confirmed this was the case, but they only use it on a very narrow set of metrics.
- Theo Schlossnagle (@postwait) pointed out that Holt-Winters is susceptible to outage data. Being able to null the data (not zero, but actually ignore the outage event) results in much better seasonal accuracy.
- Bryan Cantrill (@bcantrill) posited that scalars are less interesting than tracking latency. The same data presented as a histogram (heatmap) will reveal the latency while avoiding the tail.
- Theo Schlossnagle countered that the scalar is necessary to measure certain anomalies, e.g. "the amount of sales stayed the same but revenue goes down".
- I proposed that monitoring sucks but it's improving. The tools we have today are better than we had a year ago. Composability is vital to providing tools that are flexible enough for a wide variety of businesses.
- Theo Schlossnagle argues that the tools aren't necessarily better. He believes that what's helping is that we're having these discussions.
Chris Burroughs (@csby54) moderated the BoF and added his comments:
- Monitoring (gathering telemetry) vs alerting (waking people up). Really wish there was better terminology there. Hard not to tightly couple them.
- Are things a little less sucky now because we gotten over step 1: (applications need to generate metrics) and are on to "how can we make sense of that"? Do we have composable parts for metrics generation yet?
- Surprisingly little conflict on passive vs active monitoring.
- Everyone really wants predictive analytics from their monitoring, but no on seems to have it yet. Very few hands for even using Holt-Winters.
- Comments (0)