Just Enough Developed Infrastructure

Devops Areas - Codifying devops practices

(2012-05-12) - Comments

While working on the Devops Cookbook with my fellow authors Gene Kim,John Willis,Mike Orzen we are gathering a lot of "devops" practices. For some time we struggled with structuring them in the book. I figured we were missing a mental model to relate the practices/stories to.

This blogpost is a first stab at providing a structure to codify devops practices. The wording, descriptions are pretty much work in progress, but I found them important enough to share to get your feedback.

Devops in the right perspective

As you probably know by now, there are many definitions of devops. One thing that occasionally pops up is that people want to change the name to extend it to other groups within the IT area: star-ops, dev-qa-ops, sec-ops, ... From the beginning I think people involved in the first devops thinking had the idea to expand the thought process beyond just dev and ops. (but a name bus-qa-sec-net-ops would be that catchy :).

I've started reffering to :

  • devops : collaboration,optimization across the whole organisation. Even beyond IT (HR, Finance...) and company borders (Suppliers)
  • devops 'lite' : when people zoom in on 'just' dev and ops collaboration.

As rightly pointed out by Damon Edwards , devops is not about a technology , devops is about a business problem. The theory of Contraints tells us to optimize the whole and not the individual 'silos'. For me that whole is the business to customer problem , or in lean speak, the whole value chain. Bottlenecks and improvements could be happen anywhere and have a local impact on the dev and ops part of the company.

So even if your problem exists in dev or ops, or somewhere between, the optimization might need to be done in another part of the company. As a result describing pre-scriptive steps to solve the 'devops' problem (if there is such a problem) are impossible. The problems you're facing within your company could be vastly different and the solutions to your problem might have different effects/needs.

If not pre-scriptive, we can gather practices people have been doing to overcome similar situations. I've always encouraged people to share their stories so other people could learn from them. (one of the core reasons devopsdays exists) This helps in capturing practices, I'd leave it in the middle to say that they are good or best practices.

Currently a lot of the stories/practices are zooming in on areas like deployment, dev and ops collaboration, metrics etc.. (Devops Lite) . This is a natural evolution of having dev and ops in the term's name and given the background of people currently discussing the approaches. I hope that in the future this discussion expands itself to other company silos too: f.i. synergize HR and Devops(Spike Morelli) or relate our metrics to financial reporting.

Another thing to be aware of is that a system/company is continously in flux: whenever something changes to the system it can have an impact; So you can't take for granted that problems,bottle-necks will not re-emerge after some time. It needs continuous attention. That will be easier if you get closer to a steady-state, but still, devops like security is a journey, not an end state.

Beyond just dev and ops

Let's zoom in on some of the practices that are commonly discussed: the direct field between 'dev' and 'ops'.

In most cases, 'dev' actually means 'project' and 'ops' presents 'production'. Within projects we have methodologies like (Scrum, Kanban, ...) and within operations (ITIL, Visble Ops, ...). Both parts have been extending their project methodology over the years: from the dev perspective this has lead to 'Continous Delivery' and from the Ops side ITIL was extended with Application Life Cycle (ALM). They both worked hard on optimize the individual part of the company and less on integration with other parts. Those methodologies had a hard time solving a bottleneck that outside their 'authority'. I think this where devops kicks in: it seeks the active collaboration between different silos so we can start seeing the complete system and optimize where needed, not just in individual silos.

Devops Areas

In my mental model of devops there are four 'key' areas:

  • Area 1 : Extend delivery to production (Think Jez Humble) : this is where dev and ops collaborate to improve anything on delivering the project to production
  • Area 2 : Extend Operation to project (Think John Allspaw) : all information from production is radiated back to the project
  • Area 3 : Embed Project(Dev) into Operations : when the project takes co-ownership of everything that happens in production
  • Area 4 : Embed Production(Ops) into Project : when operations are involved from the beginning of the project

In each of these areas there will be a bi-directonal interaction between dev and ops, resulting in knowledge exchange and feedback.

Depending on where your most pressing 'current' bottleneck manifests itself, you may want to address things in different areas. There is no need to first address things in area1 than area2. Think of them as pressure points that you can stress but requiring a balanced pressure.

Area 1 and Area2 tend to be heavier on the tools side , but not strictly tools focused. Area3 and Area4 will be more related to people and cultural changes as their 'reach' is further down the chain.

When visualized in a table this gives you:

As you can see:

  • the DEV and OPS part keep having their own internal processes specific to their job
  • the two processes are becoming aligned and the areas extend both DEV and OPS to production and projects
  • it's almost like a double loop with area1 and area2 as the first loop and area3 and area4 as the second loop

Note 1: these areas definitely need 'catchier' names to make them easier to remember. Note 2: Ben Rockwoods post on "The Three Aspects of Devops" lists already 3 aspects but I think the areas make it more specific

Area Layers

In each of these areas, we can interact at the traditional 'layers' tools, process, people:

So whenever I hear story , I try to relate it's practice to one of these areas as described above and the layer it's adressing. Practices can have an impact at different layers so I see them as 'tags' to quickly label stories. Another benefit is that whenever you look at an area, you can ask yourself what practices we can do to improve each of these layers. To have a maximum impact on each of the layers, it's clear that the approach needs to be layered in all three.

The ultimate devops tools would support the whole people and process in all of these areas, not just in Area1 (deployment) or Area2 (monitoring/metrics). Therefore a devops toolchain with different tools interacting in each of the areas makes more sense. Also the tool by itself doesn't make it a devops tool: configuration mangement systems like chef and puppet are great, but when applied in Ops only don't help our problem much. Of course Ops gets infrastructure agilitity, but it isn't until it is applied to the delivery (f.i. to create test and development environments) that it becomes 'devops'. This shows that the mindset of the person applying the tool makes it a devops tool, not the tool by itself.

Area Maturity Levels

Now that we have the areas and layers identified, we want to track progress as we start solving our problems and are improving things.

Adrian Cockroft suggested using CMMI levels for devops:

CMMI levels allow you to quantify the 'maturity' of your process. That addresses only one layer (although an equally important one). In a nutshell CMMI describes the different levels as:

  1. Initial : Unpredictable and poorly controlled process and reactive nature
  2. Managed : Focused on project and still reactive nature
  3. Defined : Focused on organization and proactive
  4. Quantively Managed : Measured and controller approach
  5. Optimizing : Focus on Improvement

All these levels could be applied to dev , ops or devops combined. It gives you an idea at what level process is in, while you are optimizing in an area.

An alternative way of expressing maturity levels is used by the Continuous Integration Maturity Model. It puts a set of practices in levels of maturity: (industry consensus)

  1. Intro : using source control ...
  2. Novice : builds trigger by commit ...
  3. Intermediate : Automated deployment to testing ..
  4. Advanced : Automated Functional testing ...
  5. Insane : Continuous Deployment to Production ...

Instead of focusing on the proces only , it could be applied to a set of tools, process or people practices. What people consider the most advanced would get the highest maturity level.

Practices, Patterns and principles

A practice could be anything from an anecdotal item to a systemic approach. Similar practices can be grouped into patterns to elevate them to another level. Similar to the Software Design Patterns we can start grouping devops practices in devops patterns.

Practices and patterns will rely on principles and it's these underlying principles that will guide you when and you to apply the pattern or practice. These principles can be 'borrowed' from other fields like Lean, Systems Theory etc, Human Psychology. The principles are what the agile manifesto is about for example.

Slowly we will turn the practices -> patterns -> principles .

Note: I'm wondering if there will be new principles that will emerge from from devops itself or it will be apply existing principle to a new perspective.

A few practical examples:

Below are a few example 'practices' codified in a standard template. The practices/patterns/principles are not yet very well described. The point is more that this can serve as a template to codify practices.

Area Indicators

The idea is to list metrics/indicators that can tracked. The numbers as such might be not be too relevant but the rate of change would be. This is similar to tracking the velocity of storypoints or the tracking of mean time to recovery.

Note: I'm scared of presenting these as metrics to track, therefore I call them indicators to soften that.

Examples would be :

  • Tools Layer : Deploys/Day
  • Process Layer : Number of Change Requests/Day
  • People Layer : People Involved per deploy

This is not yet fleshed out enough , I'm guessing it will be based on my research done for my Velocity 2011 Presentation (Devops Metrics)

Devops Scorecard

To present progress during your 'devops' journey you can put all these things in a nice matrix, to get an overview on where you are at optimizing at the different layers and areas.

Obviously this only makes sense if you don't lie to yourself, your boss, your customers.

Project Teams, Product Teams and NOOPS

Jez Humble often talks about project teams evolving to product teams: largere silos will split of not by skill, but for product functionality they are delivering. Splitting teams like that, has the potential danger of creating new silos. It's obvious these product teams need to collaborate again. You should treat other product teams are external dependencies, just like other Silos. The areas of interaction will be very similar.

Also you can see the term NOOPS as working with product teams outside your company, like you rely on SAAS for certain functions. It's important not only to integrate in each of the areas on the tools layer, but also on the people and process layer. Something that is often forgotten. Automation and abstraction allows you to go faster but when things fail or even changes occur, synchronisation needs to happen.

CAMS and areas

The CAMS acronym (Culture, Automation, Measurement, Sharing) could be loosely mapped onto the areas structure:

  • Automation seems to map to Area1: the delivery process
  • Measurement seems to map to Area2: the feedback process
  • Culture to Area3 : embedded devs in Production
  • Sharing to Area4: embedded ops in Projects

Of course automation, measurement, culture and sharing can happen in any of the areas, but some of the areas seem to have a stronger focus on each of these parts.

Conclusion

Devops areas, layers and maturity levels, give us a framework to capture new practices stories and it can be used to identify areas of improvements related to the devops field. I'd love feedback on this. If anyone wants to help, I'd like to bring up a website where people can enter their stories in this structure and make it easily available for anyone to learn. I don't have too much CPU cycles left currently , but I'm happy to get this going :)

P.S. @littleidea: I do want to avoid the FSOP Cycle


Conference time - Summer of 2012

(2012-05-11) - Comments

It's the time of year that all conferences are gearing up. Here's a list of conferences I'm speaking or wish I was attending.

  • ChefConf 12 - May 15-17 : the place to be if you're anything with chef these days

  • GOTOCon Copenhagen - May 21-23 (me speaking) : fun conference and very well organized although a bit too static to my taste.

  • Devopsdays Tokyo - May 26: Tokyo was always on my list, I can't go , bummers. Botchagalupe is winning :)

  • Atlassian Summit - May 30,June 1 (me speaking) : really proud to be opening the devops track at my current employer. First time my employer has an explicit interest in devops. Go-go atlassian!

  • Kanban for Devops , Belgium June 18-19: initially announced that I would be there, and I was very keen on doing so. Work got in the way, so can't make. But if you can , you should! I'm sure @dominica will get your WIP (that is Work in Progress :)

  • Velocity - June 25-27 : the uber conference on anything on web and performance

  • Devopsdays MountainView - June 28-29 : this year at Google, looking forward to so much fun!

  • Webperfdays - June 28 : interesting unconference happening on performance. Happening at the same time as Devopsdays at Google.

  • Puppetconf - September 27-28 : and if you're into puppet, or config mgmt in general. A cool place to be , hope I can make it this year

  • Velocity Europe - October 2-4 : since the success last year, Velocity Europe strikes again: Web Performance isn't a US only concern!

  • Devopsdays Italy - October 6-7 : Rome, sweet rome - sun and devops - the perfect mix

  • AppSec USA 2012 - October 23-24 : not 100% sure on this one, but rumors go on a devops track in a security conference - sounds like fun to me.

Busy times .... but .... Fun times!


Monitoring URLs by the thousands in Nagios

(2012-05-06) - Comments

10K websites x 5 URL's to monitor

For our Atlassian Hosted Platform, we have about 10K websites we need to monitor. Those sites are monitored from a remote location to measure responsetime and availability. Each server would have about 5 sub URLs on average to check, resulting in 50K URL checks.

Currently we employ Nagios with check_http and require roughly about 14 Amazon Large Instances. While the nagios servers are not fully overloaded, we make sure that all checks would complete within a 5 minutes check cycle.

In a recent spike we investigated if we could do any optimizations to:

  • use less server resources, not only to reduce costs but also avoiding the management of multiple nagios servers which don't dynamically rebalance checks across multiple nagios hosts.
  • have all checks complete within a smaller window (say 1 minute), as this would increase our MTTD

While looking at this, we wanted the technology to be reusable with our future idea of a fully scalable and distributed monitoring in mind (think Flapjack or the new kid on the block Sensu). But for now, we wanted to focus on the checks only.

In the first blogpost of the series we look at the integration and options within Nagios. In a second blogpost we will provide proof of concept code for running an external process (ruby based) to execute and report back to nagios. Even though Nagios isn't the most fun to work with, a lot of solutions that try to replace it, focus on replacing the checks section. But Nagios gives you more the reporting, escalation, dependency management. I'm not saying there aren't solutions out there, but we consider that to be for another phase.

Check HTTP

The canonical way in Nagios to run a check is to execute Check_http.

F.i. to have it execute a check if confluence is working on https://somehost.atlassian.net/wiki , we would provide the options:

  • -H (virtual hostname), -I (ipaddress) , -p (port)
  • -u (path of url) , -S (ssl) , -f follow (follow redirects)
  • -t (timeout)

    $ /usr/lib64/nagios/plugins/check_http -H somehost.atlassian.net -p 443 -u /wiki -f follow -S -v -t 2 HTTP OK: HTTP/1.1 200 OK - 546 bytes in 0.734 second response time |time=0.734058s;;;0.000000 size=546B;;;0

Some observations:

  1. For each check configure Nagios will fork twice and exec check_http, avoiding this would improve performance as fork is considered expensive.
  2. If we were to have many URL's on the same host, we can't leverage connection reuse, making it less efficient
  3. For status checking, we can configure it to use the -J HEAD if our check doens't rely on the content of the page (saving on transfer time and reduce check time)
  4. Redirects: not an issue of Nagios, but we currently have quite a few redirects going from the login-page logic, reducing those would again improve check time.

We can reduce part of the forks by using the use_large_installation_tweaks=1 setting. The benefits and caveats are explained in the docs

Check scheduling

Nagios itself tries to be smart to schedule the checks. It tries to spread the number of service checks within the check interval you configure. More information can be found in older Nagios documentation .

Configuration options that influence the scheduling are:

  • normal_check_interval : how long between re-executing the same check
  • retry_check_interval : how fast to retry a check if it failed
  • check_period: total time for a complete check cycle
  • inter_check_delay_method: method to schedule checks (
  • service_interleave_factor: time between checks to the same remote host
  • max_concurrent_checks: obvious not ?
  • service_reaper_frequency : frequency to check for rogue checks

Default for the inter_check_delay_method is to use smart, if we want to execute the checks as fast as possible

  • n = Don't use any delay - schedule all service checks to run immediately (i.e. at the same time!)
  • d = Use a "dumb" delay of 1 second between service checks
  • s = Use a "smart" delay calculation to spread service checks out evenly (default)
  • x.xx = Use a user-supplied inter-check delay of x.xx seconds

Distributing checks

When one host can't cut it anymore, we have to scale eventually. Here are some solutions that live completely in the Nagios world:

Our future solution would have a similar approach to dispatching the checks command and gathering the results back over queue, but we'd like it to be less dependent on the Nagios solution and be possible to be integrated with other monitoring solutions (Think Unix Toolchain philosophy) A great example idea can be seen in the Velocityconf presentation Asynchronous Real-time Monitoring with Mcollective

Submitting check results back to Nagios

So with distribution we just split our problem again in smaller problems. So let's focus again on the single host running checks problem, after all, the more checks we can run on 1 host, the less we have to distribute.

Nagios Passive Checks easily allow you to uncouple the checks from your main nagios loop and submit the check results later. NSCA (Nagios Service Check Acceptor) is the most used solution for this.

NSCA does have a few limitations:

Opsview writes:

  • Only the first 511 bytes of plugin out was returned to the master, limiting the usefulness of the information you could display
  • Only the 1st line of data was returned, meaning you had to cramp output together
  • NSCA communication used fixed size packets which were inefficient
  • While results were sent, Nagios would wait for completion, introducing a bottleneck
  • If there was a communication problem with the master, results were dropped

This lead them to using NRD (Nagios Result Distributor)

Ryan Writes:

"What no one tells you when you are deploy NCSA is that it send service checks in series while nagios performs service checks in parallel"

This lead him to writing A highperformance NSCA replacement involving feeding the result direct into the livestatus pipe instead of over the NSCA protocol baked into nagios On a similar note Jelle Smet has created NSCAWEb Easily submit passive host and service checks to Nagios via external commands

We would leverage the Send NSCA Ruby Gem

Why is this relevant to our solution? Without employing some of these optimizations, our bottleneck would shift from running the checks to accepting the check results.

Another solution could be run an NRPE server , and we could probably leverage some ruby logic from Metis - a ruby NRPE server

Conclusion

Even after the following optimizations:

  • using head vs get
  • large installation tweaks
  • tuning the inter_check_delay_method
  • parallel NSCA submissions vs serial submissions

we can still optimize with:

  • avoid the fork process by running all checks from the same process
  • reusing the http connection across multiple requests for the same host (potentially even do http pipelining

In the next blogpost we will show the results of proof of concept code involving ruby/eventmachine/jruby and various httpclient libraries.


Devops a Wicked problem

(2012-01-08) - Comments

One of the strong pillars of devops (if not the strongest) is the collaboration/communication. For the talk about Devops Metrics for Velocity 2011 I researched how to prove collaboration is a good thing: while discussing devops to people it sometimes comes to believe that it makes sense to collaborate more or that all this collaboration is overkill. I think at time I came across Design Thinking and read how it evolved from 1 person doing the design to listening to user requirements to participatory design. In the book Design Thinking - Understanding Designers Think Nigel Cross writes that design used to be collaborative thing (like guilds trying to push their craft forward).

Symmetry of Ignorance

One of the concepts introduced was the symmetry of ignorance PDF

Complex design problems require more knowledge than any one single person can possess, and the knowledge relevant to a problem is often distributed and controversial. Rather than being a limiting factor, “symmetry of ignorance” can provide the foundation for social creativity. Bringing different points of view together and trying to create a shared understanding among all stakeholders can lead to new insights, new ideas, and new artifacts. Social creativity can be supported by new media that allow owners of problems to contribute to framing and solving these problems. These new media need to be designed from a meta-design perspective by creating environments in which stakeholders can act as designers and be more than consumers.

Sounds like systems thinking and reminded me of the knowledge divide within the devops problem space. When you spend time with each group/silo individually they would of think themselves superior to the other group: "ha those devs they don't know anything about the systems, ha those ops don't anything about coding". So it seems more about the symmetry of arrogance . That arrogance symmetry reminded "We judge others by their behavior, we judge ourselves by our intentions". We might think we know more/can do better, but that often not visible in our actions.

This kind of got me intrigued and I wanted to explore the subject more for the next Cutter Summit 2012.

Wicked Problem

Part of the designing thinking and this symmetry of ignorance is related to the concept of wicked problems

Rittel and Webber's (1973) formulation of wicked problems specifies ten characteristics:

  1. There is no definitive formulation of a wicked problem (defining wicked problems is itself a wicked problem).
  2. Wicked problems have no stopping rule.
  3. Solutions to wicked problems are not true-or-false, but better or worse.
  4. There is no immediate and no ultimate test of a solution to a wicked problem.
  5. Every solution to a wicked problem is a "one-shot operation"; because there is no opportunity to learn by trial and error, every attempt counts significantly.
  6. Wicked problems do not have an enumerable (or an exhaustively describable) set of potential solutions, nor is there a well-described set of permissible operations that may be incorporated into the plan.
  7. Every wicked problem is essentially unique.
  8. Every wicked problem can be considered to be a symptom of another problem.
  9. The existence of a discrepancy representing a wicked problem can be explained in numerous ways. The choice of explanation determines the nature of the problem's resolution.
  10. The planner has no right to be wrong (planners are liable for the consequences of the actions they generate).

I'll let you judge if you think devops (or even monitoring sucks :) is a wicked problem

More readings to explore:

Cynefin

The whole discission on what is a wicked problem or not reminded me of a talk by Dave Snowden. He helped creating the Cynefin model.

The Cynefin framework has five domains.The first four domains are:

  1. Simple, in which the relationship between cause and effect is obvious to all, the approach is to Sense - Categorise - Respond and we can apply best practice.
  2. Complicated, in which the relationship between cause and effect requires analysis or some other form of investigation and/or the application of expert knowledge, the approach is to Sense - Analyze - Respond and we can apply good practice.
  3. Complex, in which the relationship between cause and effect can only be perceived in retrospect, but not in advance, the approach is to Probe - Sense - Respond and we can sense emergent practice.
  4. Chaotic, in which there is no relationship between cause and effect at systems level, the approach is to Act - Sense - Respond and we can discover novel practice.
  5. Disorder

Note this a sense making framework, not a ordering framework: it's not always that exact to put your problems in each of the spaces, but it gets you thinking about which solutions to apply to which problems. And it fits in nicely with other frameworks as explained in A Tour of Adoption and Transformation models

So devops in my opinion, falls into the complex problem space.

A great video explaining it was recorded at the ALE 2011:

He explains many things, but here a few things that resonated with me:

  • why in some problem spaces there is no best practice but only good practice
  • we have to create fail-safe environments
  • providing a solution to the problems in complex problems can be done by probing
  • the human factor makes the difference / we are not machines (automation)
  • the solution is often easy once you have solved it but you need to go through the proces of discovery.

that last point reminded me of the Debt Metaphor - Ward Cunningham. @littleidea explained that Ward was using a different concept for Technical Debt that most people use: he explains technical debt as the difference between the implementation and the ideal implementation on hinsight. Not because of bad implementation, or deliberate shortcuts, but because of new insights gathered during the discovery/problem solving process.

More research can be found at:

The fact that problems don't always stay/match one of the locations on the diagram is greatly visualized by adding dimensions to the diagram (a thing that got lost in the initial publication)

To tackle complex problems he suggests using three principles of complexity based management:

  1. Use fine grained objects: avoid "chunking"
  2. Distributed Cognition: the wisdom but not the foolishness of crowds
  3. Disintermediation: connecting decision makers with raw data

This could result in the Resilient Organisation

Resilience engineering

Because in complex systems it's hard to predict the exact behavior, Dave Snowden also talks about going From Robustness to Resiliance. It almost sounded like the difference between MTBF and MTTR like John Allspaw explains in Outages Post-Mortems and Human Error 101.

I came across those articles but never put them into the light of the Snowden perspective. More to explore so.

Silos and Resilience

The final document I'd like to highlight is about Reducing the impact of Organisational Silos on Resilience.

Stone quotes five questions suggested by Angela Drummond (a practitioner in the area of silo breaking and organisational change) to help executives identify and overcome silos.

  • “does your organisation structure promote collaboration, or do silos exist?
  • “do you have collaboration in your culture and as part of your value system?
  • “do you have the IT infrastructure for effective collaboration?
  • “do you believe in collaboration? Do you model that belief?
  • “do you have a reward system for collaboration?

Quoting from the article:

Resilience cannot be achieved in isolation of other units and organisations. In summary, there is a need to recognise:

  • the characteristics of silo formation, particularly in the creation of new organisational structures or as part of change management processes
  • a convergence of interests, taking account of the fact that “we are all in this together”. Efforts are needed to achieve seamless internal relationships at the intraorganisational level and a commitment to work with others to advance community resilience (perhaps with a judicious contribution from government) at the broader societal level
  • the case for collaboration. Gains are often possible by pooling ideas and resources (the total is greater than the sum of the parts)
  • the value of harnessing grass-root capability including through continuous knowledge-building and sharing learnings in a trusted environment
  • that cost-effectiveness calculations don’t easily take account of broad organisational or social needs and that the analysis may need supplementation if wide objectives are to be met

Leadership is the key to bringing these elements together. Leadership is needed to reduce and mitigate risks before crises occur.

It was fascinating to read the collaboration and resilience go hand in hand. And breaking the silos is really a must there and requires collaboration. Also the inter-company silos fits in nicely with The Agile Executive - A new Context for Agile presentation on how we come to rely on external services in a SAAS model and this will be another silo to tackle.

Final note

This is all research in progress, but it's exciting to see a lot of different concepts fit in nicely. I apologize that this isn't yet a complete polished train of thought, but it might be useful to explore more on the subject.


Monitoring Wonderland Survey - Visualization

(2012-01-04) - Comments

A picture tells more than a ...

Now that you've collected all the metrics you wanted or even more , it's time to make them useful by visualizing them. Every respecting metrics tool provides a visualization of the data collected. Older tools tended to revolve around creating RRD graphics from the data. Newer application are leveraging javascript or flash frameworks to have the data updated in realtime and rendered by the browser. People are exploring new ways of visualizing large amounts of data efficiently. A good example is Visualizing Device Utilization by Brendan Gregg. or Multi User - Realtime heatmap using Nodejs

Several interesting books have been written about visualization:

Dashboard written for specific metric tools

Graphite

Graphs are Graphite's killer feature, but there's always room for improvement:

Grockets - Realtime streaming graphite data via socket.io and node.js

Opentsdb

Graphs in Opentsdb are based on Gnuplot

Ganglia

Collectd

Nagios

Nagios also has a way to visualize metrics in it's UI

Overall integration

With all these different systems creating graphs, the nice folks from Etsy have provided a way to navigate the different systems easily via their dashboard - https://github.com/etsy/dashboard

I also like the Idea of Embeddable Graphs as http://explainum.com implements it

Development frameworks for visualization

Generic data visualization

There are many javascript graphing libraries. Depending on your need on how to visualize things, they provide you with different options. The first list is more a generic graphic library list

Time related libraries

To plot things many people now use:

For timeseries/timelines these libraries are useful:

And why not have Javascript generate/read some RRD graphs :

Annotations of events in timeseries:

On your graphs you often want event annotated. This could range from plotting new puppet runs , tracking your releases to everything that you do in the proces of managing your servers. This is what John Allspaw calls Ops-Metametrics

These events are usually marked as vertical lines.

Dependencies graphs

One thing I was wondering is that with all the metrics we store in these tools, we store the relationships between them in our head. I researched for tools that would link metrics or describe a dependency graph between them for navigation.

We could use Depgraph - Ruby library to create dependencies - based n graphviz to draw a dependency tree, but we obviously first have to define it. Something similar to the Nagios dependency model (without the strict host/service relationship of course)

Conclusion

With all the libraries to get data in and out and the power of javascript graphing libraries we should be able to create awesome visualizations of our metrics. This inspired me and @lusis to start thinking about creating a book on Metrics/Monitoring graphing patterns. Who knows ...


Monitoring Wonderland Survey - Moving up the stack Application and User metrics

(2012-01-04) - Comments

While all the previously described metric systems have easy protocols, they tend to stay in Sysadmin/Operations land. But you should not stop there. There is a lot more to track than CPU,Memory and Disk metrics. This blogpost is about metrics up the stack: at the Application Middleware, Application and the User Usage.

To the cloud

Application Metrics

Maybe grumpy sysadmins have scared the developers and business to the cloud. It seems that the space of Application metrics, whether it's Ruby, Java , PHP is being ruled today by New Relic In a blogpost New Relic describes serving about 20 Billion Metrics A day.

It allows for easy instrumentation of ruby apps, but they also have support for PHP, Java, .NET, and Python

Part of their secret of success is the easy at how developers can get metrics from their application by adding a few files, and a token.

Several other cloud monitoring vendors are stepping into arena, and I really hope to see them grow the space and give some competition:

Some other complementary services, popular amongst developers are:

Check this blogpost on Monitoring Reporting Signal, Pingdom, Proby, Graphite, Monit , Pagerduty, Airbrake to see how they make a powerful team.

User tracking Metrics - Cloud

Clicks, Page view etc ...

Besides the application metrics, there is one other major player in web metrics. Google Analytics

I found several tools to get data out of it using the Google Analytics API

With google Analytics there is always a delay on getting your data;

If you want to have realtime statistics/metrics checkout Gaug.es http://get.gaug.es :

A/B Testing

Haven't really gotten into this, but well worth exploring getting metrics out of A/B testing

Page render time

Another important to track is the page render time. This is well explained in the Real User Monitoring- Chapter 10 - Complete Web Monitoring - O'Reilly Media

Again Newrelic provides RUM : Real User Monitoring. See How we provide real user monitoring: A quick technical review for more technical info

Who needs a cloud anyway

Putting your metrics into the cloud can be very convenient , but it has downsides:

  • most tools don't have way to redirect/replicate the metrics they collect internally
  • that makes it hard to correlate with your internal metrics
  • it's easy to get metrics in, but hard to get the full/raw data out again
  • it depends on the internet , duh, and sometimes this fails :)
  • or privacy or the volume of metrics just isn't possible to put it out in the cloud

Application Metrics - Non - Cloud

In his epic Metrics Anywhere, Codahale explains the importance of instrumenting your code with metrics. This looks very promising as this is really driven from the developers world:

Java

Or you can always use JMX to monitor/metrics from your application

And with JMX-trans http://code.google.com/p/jmxtrans you can feed jmx information into Graphite, Ganglia, Cacti/Rrdtool,

Other

Esty style: StatsD

To collect various metrics, Etsy has created StatD https://github.com/etsy/statsd a network daemon for aggregating statistics (counters and timers), rolling them up, then sending them to graphite.

There have been written clients in many languages php, java, ruby etc..

Other companies have been raving about the benefits of StatsD and for example Shopify has completely integrated it in their environment

It's incredible to see the power and simplicity of this; I've created a simple Proof of Concept to extract the statsd metrics on ZeroMQ in this experimental fork

MetricsD https://github.com/tritonrc/metricsd tries to marry both Etsy's statsD and the Coda Hale / Yammer's Metrics Library for the JVM and puts the data into Graphite. It should be drop-in compatible with Etsy's statsd, although with added explicit support for meters (with the m type) and gauges (with the g type) and introduce the h (histogram) type as an alias for timers (ms).

User tracking - Non Cloud

Clicks, Page view etc ...

Here are some Open Source Web Analytics libraries. These are merely links, haven't investigated it enough, work in progress

Another tool worth mentioning for tracking endusers is HummingBird - http://hummingbirdstats.com/ . It is NodeJS based an allows for realtime web traffic visualization. To send metrics is has a very simple UDP protocol.

A/B Testing

At Arrrrcamp I saw a great presentation on A/B Testing by Andrew Nesbitt(@teabass. Do watch the video to get inspired!

He pointed out several A/B testing frameworks:

And presented his own A/B Testing framework: Split - http://github.com/andrew/split

It would be interesting to integrate this further into traditional Monitoring/Metrics tools. View metrics per new version/enabled flags etc... In a Nutshell food for thought.

Page render time

For checking the page render time, I could not really found Open Source Alternatives.

There is a page by Steve Sounders about Episodes http://stevesouders.com/episodes/paper.php. Or you can track your Apache logs with Mod Log I/O

Conclusion

It's exciting to see the cross over between both development, operations and business. Up until now only New Relic has a very well integrated suite for all metrics. Hope the internal solutions catch up.

Now that we have all that data, it's time to talk about dashboards and visualization. On to the next blogpost.

If you are using other tools, have ideas, feel free to add them in the comments.


Monitoring Wonderland Survey - Nagios the Mighty Beast

(2012-01-03) - Comments

Controlling the tool everybody hates, but still uses

This blog post mainly contains my findings on getting data in and out of Nagios. That data can be status information, performance information and notifications. At the end there are some pointers on ruby integration with Pingdom and Jira

The idea is similar to my previous blogposting Monitoring Wonderland Survey - Metrics - API - Gateways: I want to share/open up this data for others to consume, preferably on a bus like system and using events instead of polling.

Nagios - IN

Writing Checks in Ruby

If you want to get data into Nagios, you have to write a check. These are some options for doing this in ruby:

Projects that link testing and monitoring:

Transporting check results

Nagios has many ways to collect the results of these checks:

You can test NRPE with the standalone NRPE runner

And maybe schedule the Nagios NRPE checks with Rundeck

If you don't like the spawning of separate ruby processes for each check, you can leverage Metis:https://github.com/krobertson/metis

Transport over a bus system

Instead of using the traditional provided interfaces, people are starting to send the check information over a bus for further handling:

Look ma, no Nagios Server needed

Some people have taken an alternative approach, re-using the checks libraries but reusing them in their own framework.

Nagios - OUT

Reading Status

As there is no official API to extract status information from Nagios, people have been implementing various ways of getting to the data:

Scraping the UI

Well if we really have to ...

Parsing status.dat file

All status information from Nagios is stored in the .dat file, so several people have started writing parsers for it, and exposing it as an API

Nagios-Dashboard parses the nagios status.dat file & sends the current status to clients via an HTML5 WebSocket. The dashboard monitors the status.dat file for changes, any modifications trigger client updates (push). Nagios-Dashboard queries a Chef server or Opscode platform organization for additional host information.

Parsing the log files

Using Checkmklivestatus

A better option to get adhoc status is to query Nagios via CheckMK_Livestatus http://mathias-kettner.de/checkmk_livestatus.html It is a Nagios Event Broker that hooks directly into the Nagios Core, allowing it direct acces to all structures and commands NEB's are very powerfull, and for more information look a the Nagios book - event broker section

Tools that use this API :

Quering the database/NDO

An alternative NEB handler is NDO Utils, NDO2DB. It stores all the information into a database. Or on using NDO2FS - NDO in Json or filesystem on a filesystem.

Hooking into performancehandler

RI Pienaar shows us how to hook into a process-service-perfdata handler and logs that information to a file:

The advantage is that we can get the information evented instead of having to poll the status of information. In other words ready to be put on message bus for others to read.

Listening in to events with NEB/Message queue

In order to get the events as fast as possible, I looked into using a NEB to put information on a message queue directly.

I found the following sample code:

Marius Sturm had Nagios-ZMQ https://github.com/mariussturm/nagios-zmq that allowed to get the events directly on the queue. I extended to not only read the check results or performance data, but also the notifications.

It seems Icinga is taking a similar approach with the Icinga - ZMQ - icingamq. This to enable High performance Large Scale Monitoring

An interesting difference is that is will also expose the CheckMklivestatus API directly over ZeroMQ

Adding Hosts dynamically

A bit of side track, but one of the things a lot of people struggle with is dynamically adding hosts/servers to Nagios , without restarting it. The following are links that kind of try to solve this problem, but none solves it completely. It seems most people solve this by some interaction with a Configuration Management system and a system inventory.

To read the config and write the configs, people have writing various parsers:

The reload problem doesn't look like an easy one to solve: one could create NEB that manipulates the memory host/service structures but it will also need to persist that on disk. If anyone has a good solution, please let us know!

Notification handling

There a lot more problems with Nagios, but people still use it's notification and acknowledgement system. Some interesting things I found:

Pingdom

If pingdom is your game, here are some API to information to Pingdom, and read the status

I could not find a way to make this evented , we'll have to create

Jira Notificiation

I found 4 libraries to interact with Jira - from ruby:

Conclusion:

  • We can get a long way to automate getting data in and out of Nagios
  • Exposing the API through the Livestatus works really well
  • Using the NEB Nagios-ZMQ will allow us to get the information in an evented way
  • Adding hosts dynamically still seems to be an issue

By listening in on the events over a queue, we could create a self-servicing for nagios events similar to Tattle, which does the same for Graphite:

Next blogpost we'll move up the stack a bit and start investigating options for application and enduser usage metrics.


Monitoring Wonderland Survey - Metrics - API - Gateways

(2012-01-03) - Comments

Update 4/01/2012: added ways to add metrics via logs, java pickle graphite feeder

One tool to rule them all? Not.

If you are working within an enterprise , chances are that you have different metric systems in place: You might have some Cacti, Ganglia, Collectd, etc... due to historical reasons, different departments,

This reminded me of the situation while I was working in Identity Management: you might have an LDAP, Active Directory, local HR database etc. There would be plans and discussions of using one over the other, and gateways would need to be written. I learned a few lessons there:

  1. have as few sources/stores of information as possible
  2. DON't try to chase the one tool to rule them all, aka don't use a tool for something it's not made for
  3. make it self-servicing to user and automate processes

1 to 1 gateways

Take the new Metrics hotness Graphite as an example, it has some nice graphing advantages over other tools . So people wonder , should I migrate my Ganglia, Collectd to Graphite? Graphite doesn't come with elaborate collection scripts for memory/disk/etc ... , so we have to rely on other tools like Cacti,Munin,Collectd,Ganglia to first collect the data.

So we start writing gateways to get data into Graphite:

But what happens if we also use Opentsdb for storing long term data ? We have to re-implement those gateways:

Issue 1 : Effort duplication

This just seems like a waste of energy implementing the protocol in every tool.This sure isn't the first time this happens in history: the same thing happened for Collectd -> Ganglia Plugin

If you look at the data that is transmitted it is actually pretty much the same:

a metric name, value, timestamp, optionally hostname, some metadata tags

So we could easily envision a 'universal' format that would be used to translate from and to.

Ganglia  <-> Intermediate format <-> Graphite
Collectd <-> Intermediate format <-> Opentsdb

With this intermedia format, we would only have to write one end of the equation once.

I started thinking of this like an ffmpeg for monitoring

Issue 2: Difficult to hook in additional listeners

Let's add another system that wants to listen into the metrics, something like Esper, Nagios alerting, some Dataware house tools etc... We could reuse the libraries from end to the other, but we'll have to add more gateways and put these in place everytime.

A better approach would be to use a message bus approach: every tools puts and listens on a bus and gets the data it needed. RI Pienaar has written about this approach extensively in his Series on Common Messaging Patterns. Aso John Bergmans has a great post on using AMQP and Websockets to get realtime graphics.

Some of the tools already have Message queue integrations, but there seems to be a common intermediate format missing

As a proof of concept I've created :

Building blocks

In this section I'll look for API's (ruby oriented) to get data in and out of the different metrics systems:

Graphite - IN

Sending metrics from ruby to Graphite:

These both implement the Simple Protocol, but for high performance we'd like to use the batching facility through the Pickle Format. I could not find a Pickle gem for ruby, but his could work through Ruby-Python gateway http://rubypython.rubyforge.org/.

Faster - a Java Netty based graphite relay takes the same approach https://github.com/markchadwick/graphite-relay

Another way to get your data into graphite is using Etsy's Logster https://github.com/etsy/logster

Mike Brittain greatly explains it's use in Take my logs... Please! - A velocity Online Conference SessionVideoPDF

Graphite - OUT

To get all the data out of Graphite is impossible through the standard API. You get a graph out as Raw data, but that hardly counts.

The best option seems to be to listen in to the graphite - udp receiver and duplicate the information onto a message bus.

An alternative might be to directly read from the Whisper storage, inspiration for that can be found in:

Opentsdb - IN

I could not find any ruby gem that implements the Opentsdb protocol for sending data, but creating one should be trivial. Opentsdb just use a plain TCP socket to get the data in

Opentsdb - OUT

Getting data out of Opentsdb suffers the same problem as Graphite: you can do queries on specific graph data

But you can't get it out, maybe if you directly interface with the Hbase/Java API. So again the best bet is to create a listener/proxy for the simple TCP protocol.

Ganglia - IN

Sending metrics to Ganglia is easy using the gmetric shell command. Early days code describing this can still be found at http://code.google.com/p/embeddedgmetric/

Igrigorik has written up nicely on how to use the Gmetric Ruby gem to send metrics

If you want to feed in log files into ganglia Logtailer might be your thing https://bitbucket.org/maplebed/ganglia-logtailer

Ganglia - OUT

Vladimir describes the options while he explains on how to get Ganglia data to graphite

Option 1 is to poll the Gmond over TCP and get the XML from it's current data:

Options 2 is to listen into the UDP protocol as a additional receiver.

I implemented both approaches in the https://github.com/jedi4ever/gmond-zmq

Note: As a side effect I found that the metrics send to the UDP are actualy more acurate then the values when you query the XML.

Collectd - IN

So send metrics to Collectd, you can use ruby gem from Astro that implements most of the UDP protocol

Collectd - OUT

I give Collectd for the price of best output.

It currently implements different writers:

  • Network plugin
  • UnixSock plugin
  • Carbon plugin
  • CSV
  • RRDCacheD
  • RRDtool
  • Write HTTP plugin

And the deactived ZeroMQ - https://github.com/deactivated/collectd-write-zmq

The Binary Protocol http://collectd.org/wiki/index.php/Binary_protocol is pretty simple to listen into.

Munin

If you happen to use Munin, here's some inspiration, but I haven't researched it much

Circonus

If you happen to use Circonus, here's some inspiration, but I haven't researched it much

RRD interaction from ruby

For those who want to read and write directly from RRD's in ruby, please have fun:

Alert on metrics:

With all the tools in and out, and a unified intermediate format, it will be trivial to rewrite the traditional alert check tools to listen into the bus for values. This means you can listen into for your Nagios, your ticket system, your pager system etc.. from the same source.

Graphite

Opentsdb

Ganglia

New Relic

https://github.com/kogent/check_newrelic

Conclusion

It should be feasible to create an intermediate format and reuse some of these libraries to implement both IN and OUT functionality. Why not create a Fog for monitoring information? Like implements metric receive, send,

Next stop Nagios because it deserves a blogpost on it's own ...


Monitoring Wonderland Survey - Introduction

(2012-01-03) - Comments

Introduction

While Automation is great to get you going and doing things faster and reproducible, Monitoring/Metrics are probably more valuable for learning and getting feedback from what's really going on. Matthias Meyer describes it as the virtues of monitoring. Nothing new, if you have been listening to John Allspaw on Metrics Driven Engineering (pdf), essentially putting the science back in IT as Adam Fletcher noted at the Boston devopsdays openspace session on What does a sysadmin look like in 10 years

Eager to help

Over the years I've done my fair share of monitoring setups, but the last years I was more focused on Automation. I would automate the hell out of any monitoring system the customer had. But after a while, this felt like standing on the sideline too much for me. This feeling got amplified by the Monitoring Sucks initiative of John Vincent: an initiative to improve the field where we can. The initiative has already spun some very good blogpost and one of the first blogposts monitoring sucks watch your language where they try to create a common vocabulary , reminded me a lot of the early 'what is' devops postings. So after Jason Dixon said, Monitoring Sucks, Do something about It , I decided to widen my focus again from automation to monitoring. And I found a great partner in Atlassian.

I'm certainly not the first person to do this, but I'm eager to help in the space. People like RI Pienaar have done some amazing ground work thinking about Monitoring Frameworks and making them Composable Architectures. One of the exiting areas, I'd like to focus on , is trying to make monitoring/metrics as easy as 'monitoring up' for developers and bring the traditionally operational tools in development land to better understand their application. We learned from configuration management that having common tools and a common language greatly helps overcome the devops divide.

Before jumping in the space, we decided to research the existing space extensively with its problems and solutions. This blogpost series is a summmary of these finding and will therefore will contain a lot of links.

Non technical reading

This series of blogposts is tools focused, not monitoring approach oriented, more on that in later posts, but for now I'll refer you to :

Note:

  • You will find that some tools were more predominantly researched, that's because the research was done from the perspective of Atlassian's current and future metrics/monitoring environment.
  • Also you will notice a slant towards ruby libraries, that's mainly because I feel most productive in it and I'm thinking integration with chef/puppet/fog/vagrant etc.
  • the main focus will be on Open Source Solutions, where available and commercially wherever there is a gap.

Meet the players

For people new in the field, I'd like to give a quick overview on the current players in the field , together with their official links and where possible links to books available:

A good actual overview can be found in the presentation of Jason Dixon's Trending with Purpose and Joshua Barratt - Getting more signal from your noisePDF I especially liked his approach to look at these tools from the Collect - Transport - Process - Store - Present perspectives.

Metrics

In the 'old' days, people first focused on the collect and transport problem. The standard for timeseries Storage was RRD Round Robin Database, and people would choose their metrics tools based on the collection scripts that were available. (Similar to how people choose cloud or config management it seems)

As the number of servers started to grow, people wanted to have a scalable way of collecting ,aggregating and transporting the data.

Even with the help of RRD cache, the storage of all these metrics was becoming the new bottleneck, so alternatives had be found. So Graphite introduced Whisper and Opentsdb decided to build on top of Hadoop And as the volume of data was increasing, it was begging for a self servicing way for visualization of the data.

Alerting, notification, availability

All these metric tools kind of ignore the alerting, notification and acknowlegement and rely on the real monitoring systems. So you need to complement them with some warning system like the following:

Note that most of them are suffering from the scaling perspective and flexibility and graphical overview.

Beyond servers , to applications , to business

Now that we have gotten better at monitoring and metrics of servers, we are seeing better integration with application and business metrics:

The next blogposts will contain more meat of tools surrounding, enhancing, bypassing these 'traditional players'. Stay tuned...


Markdown to Confluence Convertor

(2011-12-15) - Comments

Recently in Confluence 4.0 the Wiki Markup Editor was removed for various engineering reasons. I like to type my text in wiki style, and most of all using Markdown.

This code is a quick hack for converting markdown to Atlassian confluence markup language. Which you can still insert via the menu.

It's not a 100% full conversion, but I find it rather usuable already. I will continue to improve where possible.

The gem is based on Kramdown

Installation:

Via gem

$ gem install markdown2confluence

From github:

$ gem install bundler
$ git clone git://github.com/jedi4ever/markdown2confluence.git
$ bundle install vendor

Usage:

If using Gem:

$ markdown2confluence <inputfile>

If using bundler:

$ bundle exec bin/markdown2confluence <inputfile>

Extending/Improving it:

there is really one class to edit

  • see lib/markdown2confluence/convertor/confluence.rb Feel free to enhance or improve tag handling.

Next