Are We Ready to Kill Thresholds?

2013-06-26 09:12:54 by jdixon

I've been hearing a lot of chatter from various sources that adaptive fault detection is going to be The New Shit ™ and that static thresholds are virtually useless because they lack context. While I agree that some of the more advanced techniques sound amazing (and make no mistake, I'm really excited about the possibilities here), it's foolish to think that thresholds as a measure of fault conditions are useless.

Read the rest of this story...

Contribute to Open Source Monitoring Projects

2013-04-03 16:40:34 by jdixon

You say you want to contribute to an Open Source project, but you're not sure where to start? Have an interest in monitoring, trending or logging software? Hop on over to the Monitorama Hackathon issues list and take a look around. In the weeks leading up to the event we seeded the repo with a bunch of tasks/feature requests/bug reports that are easily digestible over the course of a day or two. These are a great place to get started on a new project, or nail out some quick issues that have an immediate impact.

Read the rest of this story...

Thoughts from Monitorama 2013

2013-03-30 10:34:42 by jdixon

This is not your typical conference review. This is a braindump of my thoughts following the organization and execution of the 2013 Monitorama Conference and Hackathon in Boston (Cambridge), Massachusetts.

Read the rest of this story...

Monitorama Hackathon

2013-03-05 10:38:23 by jdixon

One of the overarching themes that drove me to organize Monitorama was the desire to bring together OSS developers in an effort to improve the current state of monitoring and trending software. I grew impatient with the lack of measurable progress that happens at the typical SysAdmin/WebOps/DevOps-style events, which tend to focus on automation and traditional operations fare. While I'm pleased that everyone is excited about our speaker lineup, the works we accomplish at this Hackathon will be the true barometer of our success. With this in mind I have some points to consider as you prepare for your attendance and participation at the event.

Read the rest of this story...

Assembling Uptime, Umpire and Graphite

2012-10-16 11:05:08 by jdixon

Just this morning I discovered the Uptime project over on GitHub. The author bills it as "A simple HTTP remote monitoring utility using Node.js and MongoDB". I'm already in love with this tool thanks to its composability and ease of use.

The documentation over at the Uptime project is quite good, so I won't bore you with the details. The basic gist is that you'll want to have a MongoDB server available (OS X users can just brew install mongodb) and Node.js (at least version 0.8). Clone the repo locally and then run node app.js to start the monitor (web UI) and analyzer (check engine).

Read the rest of this story...

Stray Bits from my DevOpsDays Roma Talk

2012-10-08 09:15:01 by jdixon

A few stray bits of information following my presentation at DevOpsDays Roma.

The talk has received a ton of positive feedback from everyone. The slides in particular have been getting a ton of redistribution on Twitter. I'm not sure if this is a sign that my deck is that much better than the actual talk, but whatever. I'm glad that people are finding it useful and/or informative.

Read the rest of this story...

Trip to Italy

2012-10-06 13:21:15 by jdixon

I've just concluded a week in Italy as part of my visit to speak at DevOpsDays Roma. Most people don't know this, but I was an Architecture student at Georgia Tech many years ago. As such, I was exposed to a lot of Greek and Roman history. This made a lasting impression on me; I've always dreamed of visiting Rome and it was a stroke of luck when I heard about the conference and was eventually accepted to speak.

Read the rest of this story...

#monitoringsucks BoF at Surge 2012

2012-09-27 14:13:16 by jdixon

Kicking off this year's Surge conference was a pair of BoF sessions. The #monitoringsucks one was packed, to the extent that a number of us had to steal chairs from the Chef BoF across the hall. I remembered to write down some of the highlights from the session. Note that I'm not quoting anyone directly and am summarizing each speaker to the best of my recollection. If you were at the event and remember things differently, please notify me in the comments section below.

Read the rest of this story...

Trending your PagerDuty Alerts in Graphite

2012-08-28 12:04:09 by jdixon

We've noticed an increase in alerts recently at $DAYJOB. So naturally we thought it would be helpful to begin tracking Nagios alerts in Graphite. Alas, this will only help us going forward, so I wondered how difficult it would be to retrieve historic data from PagerDuty and import it into Graphite. Turns out it isn't too hard, although we have to work around some of the limitations in PagerDuty's Incidents API.

Read the rest of this story...

My Personal Roadmap

2012-08-26 18:21:47 by jdixon

I've been a little busy lately and haven't found the time to post any new articles, Graphite-related or otherwise. For those who missed the announcement, I started working at GitHub in July. Initially I continued my work on Descartes; more recently my time has been split up among a few different projects, both inside and outside of work. Although I generally detest announcing plans before shipping them, I thought others might like to read about what I'm working on these days.

Read the rest of this story...

El Cheapo Network Graph

2012-07-08 12:45:58 by jdixon

Here's an embarrassingly simple script I threw together this morning to track network latency to a handful of remote websites/networks from my home internet. Yes, I understand that these numbers are highly influenced by my proximity to various CDN networks and bear no resemblance to how actual web browsing would perform concurrently. That isn't the point. This is merely to demonstrate a cheap and easy way to get more metrics into Graphite; and at the same time, providing me with some useful reference for when my home internet provider will inevitably have hiccups.

Read the rest of this story...

Graph Porn and Sharing

2012-07-01 13:43:14 by jdixon

Part of what I see myself doing (by writing blog posts, creating software like Tasseo, etc) is to try and help others learn better ways of communicating our operational knowledge through visualization tools and methodologies. While I've gotten a lot of positive feedback from my Graphite articles, what I haven't seen as much is a two-way sharing of the harvested data made possible through these experiences.

I think there are a couple possible reasons for this: first, we work with "propietary" data that our employers might not want divulged; second, we assume our data is immaterial and not worth sharing. For the former, I think this is a very similar argument that many of us had with employers during the push to open source software. There is much to be gained by sharing our raw data (perhaps without all of the proprietary metadata and labels that make it relevant to our business) and seeing those examples improved upon and returned by our peers.

Read the rest of this story...

Velocity 2012 Postmortem

2012-06-29 16:23:23 by jdixon

This week I traveled out west for my first Velocity conference as an attendee. I went out two years ago but I was so busy juggling exhibitor duties that I didn't get to enjoy any hallway networking or formal session. This year I went in with plans to catch as many sessions as possible, particularly those skewed towards monitoring, trending and operations workflow. As expected, I skipped quite a few talks but made up for it with a lot of quality time catching up with peers and reviewing new technologies (and philosophies) in the DevOps space.

Read the rest of this story...

Why Big Monitoring Software Sucks

2012-06-20 17:21:47 by jdixon

There are a ton of open-source and commercial monitoring tools available, so why do we claim that monitoring sucks? Certainly there are some usable tools out there; without them our systems would be even more unpredictable and unreliable than they already are. So what makes one tool sticky where others get tried and tossed aside?

Systems Administrators (and Engineers) are a finicky bunch. We prefer to build complex systems from small, sharp instruments rather than fight with larger, malleable (read: monolithic) software. There's a reason why Pingdom and Pager Duty are enormously popular among technically agile businesses. Cost is only a small part of the equation; these customers understand (implicitly, if not explicitly) that combining these small, sharp tools into a series of logically connected functions (fault detection, notifications and historical trending) is much easier than breaking apart an Enterprise-Ready monitoring suite and coercing it to meet their unique needs.

Read the rest of this story...

Polling Graphite with Nagios

2012-05-31 20:37:00 by jdixon

I'm a big proponent of using Graphite as the source of truth for monitoring systems where polling host and service checks have traditionally been the norm. Realistically, this will take a long and gradual shift in philosophy by the larger IT community. Until then, we can still use Nagios and Graphite in tandem for powering more insightful checks of our application metrics.

There are actually a few different "check_graphte" scripts out there. The first one I saw announced publicly was Pierre-Yves Ritschard's check-graphite project. Shortly afterwards I published my own check_graphite script. Pierre's version is smaller but doesn't appear to automatically invert the thresholds (e.g. if critical is lower than warning). Otherwise you should be fine using either module; the remaining differences are mostly isolated to implementation details and default values. Since this is my blog, I'm going to use my script for this example. ;-)

Read the rest of this story...

Giant Robots Are Cool and Shit, But Seriously...

2011-07-15 16:54:34 by jdixon

I'm pleased to see so many people interested in the #monitoringsucks movement/campaign/whatever. My last post seemed to resonate with a lot of you out there. I'm excited to hear discussions surrounding APIs, command-line monitors, monitoring frameworks, etc. But I think a major thrust of my article was missed. It's not just that Nagios can be a pain in the ass, or that we need a modular monitoring system. What I'm trying to emphasize is that monolithic monitoring systems are bad and not suited for the task at hand.

Some very smart systems people (and developers) are trying to solve this problem in the open-source arena. Unfortunately, while they're attempting to diagnose and cure the problems in contemporary monitoring systems, they continue to architect big honking inflexible software projects. When I refer to "the Voltron of monitoring systems" I'm not talking about an enormous fucking automaton of monitoring, alerting and trending components. I mean that each component should exist independently of the others, with a stable data format and communications API. Any single component should be easily replaceable and deprecated. Authors should strive for competition because it makes the inclusive architecture that much stronger.

Realistically I see one of three things happening over the next 12-18 months:

  1. A community forms around a reasonable set of defined components and begins cranking out useful bits. Over time we have what resembles a useful ecosphere of monitoring tools and users.
  2. Motivated developers continue to solve the issues affecting monitoring software, but in their own walled garden projects. We benefit from a larger pool of projects to choose from, but they all continue to suffer from NIH syndrome.
  3. I'm disregarded as a nutcase. Nothing changes and we continue to use the same crappy ubiquitous software.

At this point I think the most likely outcome is a combination of numbers 1 and 2. It's hard for anyone to justify working on a disassociated component when the related components it needs to be useful might never be developed. On the other hand, if someone working on a monolithic project has the foresight to break up the bits into a true Service Oriented Architecture, then it would be feasible for external developers to fork individual units.

Monitoring Sucks. Do Something About It.

2011-07-07 23:45:30 by jdixon

For as long as I can remember, systems administrators have bitched about the state of monitoring. Now, depending on who you ask, you might get a half dozen (or more) answers as to what "monitoring" actually means. Monitoring is most commonly used as a casual catch-all term to describe one or more pieces of software that perform host and service monitoring and basic trending (graphs or charts). But in most cases, these complaints are targeted at software responsible for daily fault detection and notifications for IT shops and Web Operations. The usual whipping boy is Nagios, a popular open-source monitoring project that supports a universe of host and service checks, notifications, escalations and more.

Nagios has been the "lesser of all evils" for quite some time. Its cost (free), extensibility (high) and configuration flexibility have helped it achieve significant adoption levels across a variety of industries and range of business sizes, from small one-man web startups to Fortune 500 enterprises. It's been forked multiple times and is recognized by industry analysts as a force to be reckoned with. Regardless, those who use it, do so with a fair amount of hostility. Ask around and you're likely to find more users who stay with Nagios because it's "good enough" than those who actually like it. So why doesn't Nagios have more competition in the open-source marketplace? Largely because writing an entire monitoring system from scratch is an enormous undertaking. Ok, does that mean we should keep improving Nagios (or forking it... again)? Perhaps.

Read the rest of this story...

Trending with Purpose

2011-03-18 13:52:44 by jdixon

I threw together a presentation on short notice this week for an internal tele-conference about Trending with Purpose. The end result was much better than I might have expected (even given my penchant for procrastinating). Although much of the content is specific to applications currently in use at $DAYJOB, I think there's something to take out of it even if you're not using these tools.

The content is intended for developers who might not (or know how to) use application profiling data to complement their operations' monitoring and trending efforts. Special props to the Orbitz.com developers for open-sourcing their Graphite graphing tool, as well as John Allspaw and the Etsy Engineering team for their work on StatsD, and for generally serving as innovators in the Web Operations industry.

Special note: These slides were thrown together in rapid fashion. Anyone who experiences violent reactions to Gill Sans Italic should not download this slideshow. You have been warned.

The slides are available here.

New Year's Resolutions

2010-01-01 22:28:02 by jdixon

I'm not sure how effective it is to post these here, but I'm hopeful that having them in cyberspace will help keep me motivated. I'm hereafter calling these goals rather than resolutions The latter, to me, implies something that you begin immediately. This cold-turkey approach virtually guarantees failure. The moment you trip up, the subconscious immediately considers them a lost cause and reverts to the old behavior. As goals, I think it sets a more optimistic tone and allows me to gradually adapt the preferred conduct.

Without further ado, my personal list of goals for this year (in no particular order)...

Read the rest of this story...

Business Metrics

2009-09-15 23:58:42 by jdixon

Somewhere between our first corrupt filesystem and an unlikely ascent to CTO, all Systems Administrators are taught to monitor their systems. We're trained to monitor the health of our computers and trend the usage for capacity planning and analytics. A Nagios is deployed; eventually complemented by Cacti; both of which are inevitably supplanted by Something Enterprise (TM). Services are checked, change is managed, and reports are reportified.

Have you asked yourself, what value does this offer my company? Perhaps you've correlated your database connection breakdown time with website load time. Or you noticed that the FULL backups on Sunday coincide with excessive packet loss on your Seattle firewalls. Besides buffing out some of the rough edges on your operational capabilities, how does this data work for you?

Read the rest of this story...