Just Enough Developed Infrastructure

Monitoring Wonderland Survey - Metrics - API - Gateways

Update 4/01/2012: added ways to add metrics via logs, java pickle graphite feeder

One tool to rule them all? Not.

If you are working within an enterprise , chances are that you have different metric systems in place: You might have some Cacti, Ganglia, Collectd, etc... due to historical reasons, different departments,

This reminded me of the situation while I was working in Identity Management: you might have an LDAP, Active Directory, local HR database etc. There would be plans and discussions of using one over the other, and gateways would need to be written. I learned a few lessons there:

  1. have as few sources/stores of information as possible
  2. DON't try to chase the one tool to rule them all, aka don't use a tool for something it's not made for
  3. make it self-servicing to user and automate processes

1 to 1 gateways

Take the new Metrics hotness Graphite as an example, it has some nice graphing advantages over other tools . So people wonder , should I migrate my Ganglia, Collectd to Graphite? Graphite doesn't come with elaborate collection scripts for memory/disk/etc ... , so we have to rely on other tools like Cacti,Munin,Collectd,Ganglia to first collect the data.

So we start writing gateways to get data into Graphite:

But what happens if we also use Opentsdb for storing long term data ? We have to re-implement those gateways:

Issue 1 : Effort duplication

This just seems like a waste of energy implementing the protocol in every tool.This sure isn't the first time this happens in history: the same thing happened for Collectd -> Ganglia Plugin

If you look at the data that is transmitted it is actually pretty much the same:

a metric name, value, timestamp, optionally hostname, some metadata tags

So we could easily envision a 'universal' format that would be used to translate from and to.

Ganglia  <-> Intermediate format <-> Graphite
Collectd <-> Intermediate format <-> Opentsdb

With this intermedia format, we would only have to write one end of the equation once.

I started thinking of this like an ffmpeg for monitoring

Issue 2: Difficult to hook in additional listeners

Let's add another system that wants to listen into the metrics, something like Esper, Nagios alerting, some Dataware house tools etc... We could reuse the libraries from end to the other, but we'll have to add more gateways and put these in place everytime.

A better approach would be to use a message bus approach: every tools puts and listens on a bus and gets the data it needed. RI Pienaar has written about this approach extensively in his Series on Common Messaging Patterns. Aso John Bergmans has a great post on using AMQP and Websockets to get realtime graphics.

Some of the tools already have Message queue integrations, but there seems to be a common intermediate format missing

As a proof of concept I've created :

Building blocks

In this section I'll look for API's (ruby oriented) to get data in and out of the different metrics systems:

Graphite - IN

Sending metrics from ruby to Graphite:

These both implement the Simple Protocol, but for high performance we'd like to use the batching facility through the Pickle Format. I could not find a Pickle gem for ruby, but his could work through Ruby-Python gateway http://rubypython.rubyforge.org/.

Faster - a Java Netty based graphite relay takes the same approach https://github.com/markchadwick/graphite-relay

Another way to get your data into graphite is using Etsy's Logster https://github.com/etsy/logster

Mike Brittain greatly explains it's use in Take my logs... Please! - A velocity Online Conference SessionVideoPDF

Graphite - OUT

To get all the data out of Graphite is impossible through the standard API. You get a graph out as Raw data, but that hardly counts.

The best option seems to be to listen in to the graphite - udp receiver and duplicate the information onto a message bus.

An alternative might be to directly read from the Whisper storage, inspiration for that can be found in:

Opentsdb - IN

I could not find any ruby gem that implements the Opentsdb protocol for sending data, but creating one should be trivial. Opentsdb just use a plain TCP socket to get the data in

Opentsdb - OUT

Getting data out of Opentsdb suffers the same problem as Graphite: you can do queries on specific graph data

But you can't get it out, maybe if you directly interface with the Hbase/Java API. So again the best bet is to create a listener/proxy for the simple TCP protocol.

Ganglia - IN

Sending metrics to Ganglia is easy using the gmetric shell command. Early days code describing this can still be found at http://code.google.com/p/embeddedgmetric/

Igrigorik has written up nicely on how to use the Gmetric Ruby gem to send metrics

If you want to feed in log files into ganglia Logtailer might be your thing https://bitbucket.org/maplebed/ganglia-logtailer

Ganglia - OUT

Vladimir describes the options while he explains on how to get Ganglia data to graphite

Option 1 is to poll the Gmond over TCP and get the XML from it's current data:

Options 2 is to listen into the UDP protocol as a additional receiver.

I implemented both approaches in the https://github.com/jedi4ever/gmond-zmq

Note: As a side effect I found that the metrics send to the UDP are actualy more acurate then the values when you query the XML.

Collectd - IN

So send metrics to Collectd, you can use ruby gem from Astro that implements most of the UDP protocol

Collectd - OUT

I give Collectd for the price of best output.

It currently implements different writers:

  • Network plugin
  • UnixSock plugin
  • Carbon plugin
  • CSV
  • RRDCacheD
  • RRDtool
  • Write HTTP plugin

And the deactived ZeroMQ - https://github.com/deactivated/collectd-write-zmq

The Binary Protocol http://collectd.org/wiki/index.php/Binary_protocol is pretty simple to listen into.

Munin

If you happen to use Munin, here's some inspiration, but I haven't researched it much

Circonus

If you happen to use Circonus, here's some inspiration, but I haven't researched it much

RRD interaction from ruby

For those who want to read and write directly from RRD's in ruby, please have fun:

Alert on metrics:

With all the tools in and out, and a unified intermediate format, it will be trivial to rewrite the traditional alert check tools to listen into the bus for values. This means you can listen into for your Nagios, your ticket system, your pager system etc.. from the same source.

Graphite

Opentsdb

Ganglia

New Relic

https://github.com/kogent/check_newrelic

Conclusion

It should be feasible to create an intermediate format and reuse some of these libraries to implement both IN and OUT functionality. Why not create a Fog for monitoring information? Like implements metric receive, send,

Next stop Nagios because it deserves a blogpost on it's own ...