availability: September 2013
For our Atlassian Hosted Platform, we have about 10K websites we need to monitor. Those sites are monitored from a remote location to measure responsetime and availability. Each server would have about 5 sub URLs on average to check, resulting in 50K URL checks.
Currently we employ Nagios with check_http and require roughly about 14 Amazon Large Instances. While the nagios servers are not fully overloaded, we make sure that all checks would complete within a 5 minutes check cycle.
In a recent spike we investigated if we could do any optimizations to:
While looking at this, we wanted the technology to be reusable with our future idea of a fully scalable and distributed monitoring in mind (think Flapjack or the new kid on the block Sensu). But for now, we wanted to focus on the checks only.
In the first blogpost of the series we look at the integration and options within Nagios. In a second blogpost we will provide proof of concept code for running an external process (ruby based) to execute and report back to nagios. Even though Nagios isn't the most fun to work with, a lot of solutions that try to replace it, focus on replacing the checks section. But Nagios gives you more the reporting, escalation, dependency management. I'm not saying there aren't solutions out there, but we consider that to be for another phase.
The canonical way in Nagios to run a check is to execute Check_http.
F.i. to have it execute a check if confluence is working on https://somehost.atlassian.net/wiki , we would provide the options:
-t (timeout)
$ /usr/lib64/nagios/plugins/check_http -H somehost.atlassian.net -p 443 -u /wiki -f follow -S -v -t 2 HTTP OK: HTTP/1.1 200 OK - 546 bytes in 0.734 second response time |time=0.734058s;;;0.000000 size=546B;;;0
Some observations:
We can reduce part of the forks by using the use_large_installation_tweaks=1 setting. The benefits and caveats are explained in the docs
Nagios itself tries to be smart to schedule the checks. It tries to spread the number of service checks within the check interval you configure. More information can be found in older Nagios documentation .
Configuration options that influence the scheduling are:
Default for the inter_check_delay_method is to use smart, if we want to execute the checks as fast as possible
When one host can't cut it anymore, we have to scale eventually. Here are some solutions that live completely in the Nagios world:
Our future solution would have a similar approach to dispatching the checks command and gathering the results back over queue, but we'd like it to be less dependent on the Nagios solution and be possible to be integrated with other monitoring solutions (Think Unix Toolchain philosophy) A great example idea can be seen in the Velocityconf presentation Asynchronous Real-time Monitoring with Mcollective
So with distribution we just split our problem again in smaller problems. So let's focus again on the single host running checks problem, after all, the more checks we can run on 1 host, the less we have to distribute.
Nagios Passive Checks easily allow you to uncouple the checks from your main nagios loop and submit the check results later. NSCA (Nagios Service Check Acceptor) is the most used solution for this.
NSCA does have a few limitations:
This lead them to using NRD (Nagios Result Distributor)
"What no one tells you when you are deploy NCSA is that it send service checks in series while nagios performs service checks in parallel"
This lead him to writing A highperformance NSCA replacement involving feeding the result direct into the livestatus pipe instead of over the NSCA protocol baked into nagios On a similar note Jelle Smet has created NSCAWEb Easily submit passive host and service checks to Nagios via external commands
We would leverage the Send NSCA Ruby Gem
Why is this relevant to our solution? Without employing some of these optimizations, our bottleneck would shift from running the checks to accepting the check results.
Another solution could be run an NRPE server , and we could probably leverage some ruby logic from Metis - a ruby NRPE server
Even after the following optimizations:
we can still optimize with:
In the next blogpost we will show the results of proof of concept code involving ruby/eventmachine/jruby and various httpclient libraries.