Monitoring URLs by the thousands in Nagios

10K websites x 5 URL’s to monitor

For our Atlassian Hosted Platform, we have about 10K websites we need to monitor. Those sites are monitored from a remote location to measure responsetime and availability. Each server would have about 5 sub URLs on average to check, resulting in 50K URL checks.

Currently we employ Nagios with check_http and require roughly about 14 Amazon Large Instances. While the nagios servers are not fully overloaded, we make sure that all checks would complete within a 5 minutes check cycle.

In a recent spike we investigated if we could do any optimizations to:

use less server resources, not only to reduce costs but also avoiding the management of multiple nagios servers which don’t dynamically rebalance checks across multiple nagios hosts.
have all checks complete within a smaller window (say 1 minute), as this would increase our MTTD

While looking at this, we wanted the technology to be reusable with our future idea of a fully scalable and distributed monitoring in mind (think Flapjack or the new kid on the block Sensu). But for now, we wanted to focus on the checks only.

In the first blogpost of the series we look at the integration and options within Nagios. In a second blogpost we will provide proof of concept code for running an external process (ruby based) to execute and report back to nagios. Even though Nagios isn’t the most fun to work with, a lot of solutions that try to replace it, focus on replacing the checks section. But Nagios gives you more the reporting, escalation, dependency management. I’m not saying there aren’t solutions out there, but we consider that to be for another phase.

Check HTTP

The canonical way in Nagios to run a check is to execute Check_http.

F.i. to have it execute a check if confluence is working on https://somehost.atlassian.net/wiki , we would provide the options:

-H (virtual hostname), -I (ipaddress) , -p (port)
-u (path of url) , -S (ssl) , -f follow (follow redirects)
-t (timeout)

$ /usr/lib64/nagios/plugins/check_http -H somehost.atlassian.net -p 443 -u /wiki -f follow -S -v -t 2 HTTP OK: HTTP/1.1 200 OK - 546 bytes in 0.734 second response time |time=0.734058s;;;0.000000 size=546B;;;0

Some observations:

For each check configure Nagios will fork twice and exec check_http, avoiding this would improve performance as fork is considered expensive.
If we were to have many URL’s on the same host, we can’t leverage connection reuse, making it less efficient
For status checking, we can configure it to use the -J HEAD if our check doens’t rely on the content of the page (saving on transfer time and reduce check time)
Redirects: not an issue of Nagios, but we currently have quite a few redirects going from the login-page logic, reducing those would again improve check time.

We can reduce part of the forks by using the use_large_installation_tweaks=1 setting. The benefits and caveats are explained in the docs

Check scheduling

Nagios itself tries to be smart to schedule the checks. It tries to spread the number of service checks within the check interval you configure. More information can be found in older Nagios documentation .

Configuration options that influence the scheduling are:

normal_check_interval : how long between re-executing the same check
retry_check_interval : how fast to retry a check if it failed
check_period: total time for a complete check cycle
inter_check_delay_method: method to schedule checks (
service_interleave_factor: time between checks to the same remote host
max_concurrent_checks: obvious not ?
service_reaper_frequency : frequency to check for rogue checks

Default for the inter_check_delay_method is to use smart, if we want to execute the checks as fast as possible

n = Don’t use any delay - schedule all service checks to run immediately (i.e. at the same time!)
d = Use a “dumb” delay of 1 second between service checks
s = Use a “smart” delay calculation to spread service checks out evenly (default)
x.xx = Use a user-supplied inter-check delay of x.xx seconds

Distributing checks

When one host can’t cut it anymore, we have to scale eventually. Here are some solutions that live completely in the Nagios world:

Our future solution would have a similar approach to dispatching the checks command and gathering the results back over queue, but we’d like it to be less dependent on the Nagios solution and be possible to be integrated with other monitoring solutions (Think Unix Toolchain philosophy) A great example idea can be seen in the Velocityconf presentation Asynchronous Real-time Monitoring with Mcollective

Submitting check results back to Nagios

So with distribution we just split our problem again in smaller problems. So let’s focus again on the single host running checks problem, after all, the more checks we can run on 1 host, the less we have to distribute.

Nagios Passive Checks easily allow you to uncouple the checks from your main nagios loop and submit the check results later. NSCA (Nagios Service Check Acceptor) is the most used solution for this.

NSCA does have a few limitations:

Opsview writes:

Only the first 511 bytes of plugin out was returned to the master, limiting the usefulness of the information you could display
Only the 1st line of data was returned, meaning you had to cramp output together
NSCA communication used fixed size packets which were inefficient
While results were sent, Nagios would wait for completion, introducing a bottleneck
If there was a communication problem with the master, results were dropped

This lead them to using NRD (Nagios Result Distributor)

Ryan Writes:

“What no one tells you when you are deploy NCSA is that it send service checks in series while nagios performs service checks in parallel”

This lead him to writing A highperformance NSCA replacement involving feeding the result direct into the livestatus pipe instead of over the NSCA protocol baked into nagios On a similar note Jelle Smet has created NSCAWEb Easily submit passive host and service checks to Nagios via external commands

We would leverage the Send NSCA Ruby Gem

Why is this relevant to our solution? Without employing some of these optimizations, our bottleneck would shift from running the checks to accepting the check results.

Another solution could be run an NRPE server , and we could probably leverage some ruby logic from Metis - a ruby NRPE server

Conclusion

Even after the following optimizations:

using head vs get
large installation tweaks
tuning the inter_check_delay_method
parallel NSCA submissions vs serial submissions

we can still optimize with:

avoid the fork process by running all checks from the same process
reusing the http connection across multiple requests for the same host (potentially even do http pipelining

In the next blogpost we will show the results of proof of concept code involving ruby/eventmachine/jruby and various httpclient libraries.