Rundeck And Nagios Nrpe Checks

I’ve been playing with Rundeck recently. For those that haven’t seen it yet it’s an application for running commands across a cluster of machines and recording the results. It has both a command line client and a very rich web interface which boths allows you to trigger commands and shows the results.

I’ve played with a few different jobs so far, including triggering Puppet runs across machines triggered by a Jenkins plugin. I’ve also been looking at running all my monitoring tasks at the click of a button (or again as part of a smoke test triggered by Jenkins) and I thought that might make a nice simple example.

My checks are written as Nagios plugins, and run periodically by Nagios. I also trigger them manually, using Dean’s NRPE runner script.

Rundeck showing a job output

The above shows a successful run across a few machines I use for experimenting with tools. Hopefully you can see the summary of the run on each of the four machines, each ran five NRPE checks and all passed. On failure we’d see the results as well as different symbols and colous. We can easily save the output to a file if we need to, rerun or duplicate the job (maybe to have it run against a different group of machines) or we can export the job definition file to load into another instance.

The same job can also be run on the command line (which makes use the of Rundeck API)

./run -j "Run NRPE checks" -p PRGMR

This example shows running a specific pre-defined job, but it’s also equally possible to fire of adhoc commands to some or all of the machines rundeck knows about.

One thing in particular that I prefer about this approach to say using Capistrano or Fabric for remote execution tasks is that you have a centralised authentication and logging capability. It would be easy enough to encapsulate the jobs into cap or fabric tasks (and manage that in source control) which means you’re not stuck if Rundeck isn’t available.

On Her Majesty's Digital Service

This blog post is mainly an excuse to use the pun in the title. It’s also an opportunity to tell folks that don’t already know I’ll be starting a new job on Monday working for the UK Government. I’m going to be work for the Government Digital Service, a new department tasked with a pretty wide range of sorting the Government out online.

The opportunity is huge. And when it came around I couldn’t turn it down. I’m going to be working with a bunch of people I’ve known and respected for a while, as well as other equally smart people. That means I’m going to be back in London again as well.

Hopefully I’ll be able to talk lots about what we’re up to. The groundwork for that has already been laid by the team who have been blogging furiously about topics of interest.

Talking Configuration Management, Vagrant And Chef At Lrug

I stepped in at the last minute to do a talk at the last London Ruby User Group. From the feedback afterwards folks seemed to enyoy it and I certainly had fun. Thanks to everyone who came along.

As well as the slides the nice Skills Matter folks have already uploaded the videos from the night.

Vim With Ruby Support Using Homebrew

I’ve spend a bit of time this weekend cleaning, tidying and upgrading software on my mac. While doing that I got round to compiling my own Vim. I’d been meaning to do this for a while, I prefer using Vim in a terminal to using MacVim, and I like having access to things like Command-T which requires Ruby support which the inbuild version lacks.

Vim isn’t in Homebrew, because Homebrew’s policy is to not provide duplicates of already installed software. Enter Homebrew Alt which provides formulas for anything not allowed by the homebrew policy. As luck would have it a Vim Formula already exists. And installing from it couldn’t be easier.

brew install

As it turns out this failed the first time I ran it because I had an rvm installed Ruby on my path. I reset this to the system version and everything compiled fine.

rvm use system

Note also that it’s really quite simple to use a different revision or different flags when compiling. Just download that file, modify it, serve it locally (say with a python one line web server) and point brew install at it. Next step, running off head for all the latest and greatest Vim features.

Jenkins Build Pipeline Example

The idea of a build pipeline for web application deployment appears to have picked up lots of interest from the excellent Continuous Delivery book. Inspired by that, some nice folks have build an excellent plugin for Jenkins unsurprisingly called the Build Pipeline Plugin. Here’s a quick example of how I’m using it for one of my projects*.

Build pipeline example in Jenkins

The pipeline is really just a visualisation of up and downstream builds in Jenkins given a starting point, plus the ability to setup manual steps rather than just the default build after ones. That means the steps are completely up to you and your project. In this case I’m using:

  1. Build - downloads the latest code and any dependencies. You could also create a system package here if you like. If successful triggers…
  2. Staging deploy - In this case I’m using capistrano, but it could easily have been rsync, fabric or triggering a chef or puppet run. If successful triggers…
  3. Staging test - This is a simple automated test suite that checks that the site on staging is correct. The tests are bundled with the code, so are pulled down as part of the build step. If the tests pass…
  4. Staging approval - This is one of the clever parts of the plugin. This jenkins job actually does nothing except log it’s successful activation. It’s only run when I press the Trigger button on the pipeline view. This acts as a nice manual gate for a once over check on staging.
  5. Production deploy - using the same artifact as deployed to staging this job triggers the deploy to the production site again via capistrano

I’m triggering builds on each commit too via a webhook. But I can also kick off a build by clicking the button the pipeline view if I need to.

Pipeline example showing build in progress

Note that I’m only allowing the last build to be deployed given only that one can be checked on staging. Again this is configuration specific to my usage, the plugins lets you operate a number of different ways. There are a number of tweaks I want to make to this, mainly around experimenting with parameterized builds to pass useful information downstream and even allow parrallel execution. For the moment I have the Block build when upstream project is building flag checked on the deploy.

 * Yes, this is a one page site. With a 5 step build process in Jenkins including a small suite of functional tests and a staging environment. This is what we call overengineering.

Varnish At Refresh Cambridge

I did a quick lightning talk at the Refresh Cambridge meetup last night, a very quick introduction to Varnish. Given 10 minutes all I really wanted to do was get people to go away and take a look at it. Lots of folks in the room hadn’t come across it before so I think the talk was hopefully well pitched.

Several people asked about slighly dynamic pages and I only got chance to mention support for ESI (Edge Side Includes) at the end. Conversation afterwards turned to various parts of the modern web stack and I had a pretty good time being opinionated. Hopefully more of the same next month.

Django Performance Patterns 1: Measuring Performance


As Django has matured it’s being used for bigger and bigger projects. At the same time it’s also being used by more and more people building relatively simple applications quickly. Everyone’s application is different, but I’d wager the vast majority of these have a range of common performance problems. Performance is often something only larger teams get to spend time really getting to grips with. This is sometimes because smaller projects can’t afford the time, or more often probably that things are thought to be fast enough anyway.

One advantage of using a framework is the sharing of common solutions to common problems that happens as a community forms. In what is hopefully going to be a bit of a series I’m going to cover some simple things everyone can do to improve application performance. The patterns are generally applicable, but I’m going to focus on Django examples.

I’m going to be pretty opinionated about the stack I’m using when necessary. I’m not looking to compare different web servers or databases or python versions. And I’d rather give concrete examples than generalise. If you’re using a different stack that’s fine, somethings will just work and others will need you to know how to configure the software you’ve already chosen. I’m also going to focus on a very small and simple to understand application. Most of these techniques scale up just fine, but I feel people don’t often use them on smaller projects because they thing you can only use them on larger ones. Or that you won’t see much impact on a smaller project. Both of these don’t ring true in my opinion and I’ll hopefully show why.

Measuring Performance

In this first part of the series lets take a quick detour to frame everything else. Lets talk about ways we can measure performance so we can see if the changes we’re making have the desired impact. If you’re not measuring performance already then start here.

We’ll start out looking at a few tools which are useful when taking a view by view approach to analysing performance. These generally ignore the impact of load on the system but because of this are generally easier to understand and read.

Django Debug Toolbar

Most Django developers will hopefully already be using the excellent Debug Toolbar It has a number of features relevant to our current quest but the most interesting is the query count. Less queries is nearly always better. That’s a whopping generalisation, but looking for unnecessary queries or duplicated queries or particularly slow running queries is a great way of making your application faster. The ORM makes it pretty easy to end up with a querysplosion if you’re not paying attention.

It’s very simple to install:

pip install django-debug-toolbar

The query section shows you the number of queries, the individual queries themselves and the time taken. It’s designed to be run in debug mode, so the actual query times will likely be lower in production, but the query that’s taking ages in development will probably still be slow when you go live.

Django debug toolbar


YSlow is a browser extension for Firefox and Chrome that gives information and recommendations about a number of mainly HTTP, CSS or javascript issues individual pages might have. It will give you a score as well as suggestions for improvement:

YSlow showing a score of 96

Also useful is the break down of the number of HTTP requests, and the affect of a primed cache on page loading.

YSlow showing http request breakdown

Profiling Middleware

Sometimes you want to know the very low level calls that go into making a page render. For this you’ll want to look at profiling tools. The Django wiki has a useful page on profiling which is good if dispiriting reading.

Django Snippets has several profiling middleware, one of which is packaged up in the excellent Django Snippetscream. We can install that like so:

pip install django-snippetscream

Just include the middleware in your debug environment:

MIDDLEWARE_CLASSES = MIDDLEWARE_CLASSES + ('snippetscream.ProfileMiddleware',)

You can then append ?prof to any of your URLs and instead of seeing the view you’ll see something like the following:

Profiling view

Look at where your code spends it’s time and whether you have repeated calls to the same methods and functions. Sometimes getting down to this level of detail is the easiest way of finding the bottleneck in your application.

Nginx Logging

Here’s the first time I’m being opinionated about the stack, by choosing Nginx as my frontend server. I’ll talk a little about why this is a good idea later, but for the moment lets concentrate on why this is useful for measuring performance.

Log files are wonderful things, and Nginx has quite a powerful syntax for adding extra information to log files. Note the last line in the definition below.

log_format timed_combined '$remote_addr - $remote_user [$time_local]  '
      '"$request" $status $body_bytes_sent '
      '"$http_referer" "$http_user_agent" '
      '$request_time $upstream_response_time $gzip_ratio';

We are adding the entire request time, the time taken by the upstream server (in my case gunicorn) to respond and also the gzip ratio. This is really handy if you’re optimising an application already in production. By collecting this data here it’s easy to then analyse the logs and pull out things like slow urls or assets not being gzipped effectively.

Django Timelog

Very similar to the above nginx logging, but implemented as a django 1.3 application (so it can be used in development as well) is one of my projects, django-timelog. As well as logging the time taken for each request, django-timelog provides a management command to analyse the resulting log file. It produces output which can show in aggregate the average response time of either views or individual URLs.

| view                     | method | status | count | minimum | maximum | mean  | stdev |
| boxes.viewsBoxDetailView | GET    | 200    | 9430  | 0.14    | 0.28    | 0.21  | 0.070 |
| boxes.viewsBoxListView   | GET    | 200    | 66010 | 0.17    | 0.28    | 0.232 | 0.045 |
| django.views.staticserve | GET    | 200    | 61295 | 0.00    | 0.02    | 0.007 | 0.006 |

It’s packaged so installation should be straightforward.

pip install django-timelog

Again this can be used in a production environment, or it can be used locally while developing. You can also use load testing tools as described in a moment to generate traffic which is then logged.

Load Testing

I’m mainly looking for a tool here which can easily generate HTTP traffic in volume, sending a decent number of concurrent requests against your application and returning some useful results. I mainly turn to ab (Apache bench) because it’s available everywhere and it’s very simple to use.

For example lets hit a site with 100 requests, with requests being sent in batches of 5.

ab -c 5 -n 100

This will print something like the following. For our purposes we’re mainly interested in the requests per second value and the mean request time.

Concurrency Level:      5
Time taken for tests:   1.981 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      328300 bytes
HTML transferred:       297400 bytes
Requests per second:    50.47 [#/sec] (mean)
Time per request:       99.064 [ms] (mean)
Time per request:       19.813 [ms] (mean, across all concurrent requests)
Transfer rate:          161.82 [Kbytes/sec] received

Load testing is a pretty large topic. For instance even with the above simple example how do we know if 100 requests is enough? (it’s not.) Or whether a concurrency of 5 is useful? Often what you’re interested in is where your application starts to saturate or where it starts to error. But even without getting bogged down in the details a simple test like this can show changes have had a positive or negative effect. I’ll show examples of this as we investigate optimisation techniques.

If you’re working on a larger project hopefully you’ll have the time to investigate other approaches too. I’m quite a fan of using production logs to replay requests for instance, and of using Funkload for running scenarios under load. I’ll hopefully write more about those later. I’ve heard good things about Tsung as well, HTTPerf is excellent and JMeter has many fans. I’m using ab for examples because it’s point and shoot and you probably already have it installed without knowing.

Hopefully that’s a useful list of tools to get a baseline of where you’re at with performance. The rest of the articles in this series will show approaches to improve performance, and come back to one or more of these tools to confirm we’re heading in the right direction.

New Ganglia Web Interface Improvements

So I’m a huge Ganglia fan. It’s my go-to tool for standard low level metrics and for more ad-hoc higher level stuff as well. So I’m very happy to see the newly released version of the Ganglia Web interface.

Here’s the old look and feel:

Old Ganglia UI

And here’s the new version:

New Ganglia UI

The first thing that stands out to me is the wide range of new options. We have a navigation bar with items like search, views, agregate graphs, events and mobile. And we have dynamic date ranges, and filtering and sorting options for metrics.

  • Search is what you’d expect. A search as you type facility for hosts and metrics, great if you have a large cluster and want to find something quickly.
  • Views are interesting. Basically a way of putting together a specific set of metrics on one screen and linking to them, rather than always having to go via the node or grid overview.
  • You could always create agregate graphs in Ganglia, but you had to delve into writing some PHP to get what you wanted. Not you can do certain types of graph on the fly, simply by specifying what you want.
  • I’m cheating slightly here and using the very latest code rather than the 2.0.0 release. But it’s worth it for the Events feature. Events gives you the ability to record events like a deploy or a daily backup onto your graphs, immediately making some types of diagnosis easier. More about how I’m using this when I finish automating it.
  • Gnaglia always worked OK on a mobile device mainly because all you really wanted was the graphs. The new mobile UI however just makes navigating much easier. It’s not yet doing device detection and automatically redirecting my iphone to that view which would be nice but it’s definately a good addition.
  • Combined with views, automatic rotation shows a series of graphs, one every 30 seconds. Particularly useful for big screen dashboards, in fact I’ve built this exact feature before more than once before.

I’ve concentrated here on the new visual features, but the new UI also contains CSV and JSON output for all metrics as well as a much easier JSON based approach to defining custom metrics. The amount of stuff in this release is huge. Massive thanks to the folks behind all the new features, in particular Vladimir Vuksan.

Ganglia is such a staple that I’m used to just using whatever is in the latest distro provided packages. When it comes to the web UI however I’m going to make an exception from now on. Not only can you run the new UI in parallel with the old, deployment is simply copying a directory into place (It’s easy to forget just how simple PHP deployment is sometimes). If you’re already using Ganglia spend a few minutes installing the new UI. If you’ve rejected Ganglia previously because it didn’t have one feature or another then now is the time to look again.

Logging Django Performance

I’ve been doing some basic performance profiling work with Ruby on Rails recently and one tool I found very useful was the request log analyzer. It’s a relatively simple command line application that you can point at the Rails application log files and it outputs lots of information in agregate. So information about request duration averages or about SQL queries run. When working on a recent Django project I wanted a tool to do the same thing and ended up writing timelog.

I did a bit of research to see if I could find something similar. Here are a few other interesting tools that didn’t quite do what I wanted:

  • Django Slow Log - This logs things like memory and load averages
  • Django Dump Slow - Similar but designed to just log slow requests rather than everything, also needs a Redis backend
  • Zamboni Middleware - This is very similar to what I wanted, but it’s not a separate module and I didn’t find anything to analyse the results

Timelog is very similar to the middleware in Zamboni, the only real difference being I’m using the new logging support in Django 1.3. More interesting is the bundled management command which will output something like:

| view                     | method | status | count | minimum | maximum | mean  | stdev           |
| boxes.viewsBoxDetailView | GET    | 200    | 9430  | 0.14    | 0.28    | 0.21  | 0.0700037118541 |
| boxes.viewsBoxListView   | GET    | 200    | 66010 | 0.17    | 0.28    | 0.232 | 0.0455415351076 |
| django.views.staticserve | GET    | 200    | 61295 | 0.00    | 0.02    | 0.007 | 0.0060574669888 |

At the moment I’ve not done any real benchmarking or optimisation of the script, but it will chew through a 300,000 line (20MB) log file in under 20s on my aging macbook.

The code for Timelog is on github at and I’ve uploaded to PyPi as well at You can install it with the usual tools like pip:

pip install django-timelog

Once installed you need to do a little configuration to get things working. First add the middleware to your MIDDLEWARE_CLASSES in your settings file.


Next add timelog to your INSTALLED_APPS list. This is purely for the management command discovery.


Then configure the logger you want to use. This really depends on what you want to do, the django 1.3 logging setup is pretty powerful. Here’s how I’ve got logging setup as an example:

TIMELOG_LOG = '/path/to/logs/timelog.log'

  'version': 1,
  'formatters': {
    'plain': {
      'format': '%(asctime)s %(message)s'},
  'handlers': {
    'timelog': {
      'level': 'DEBUG',
      'class': 'logging.handlers.RotatingFileHandler',
      'filename': TIMELOG_LOG,
      'maxBytes': 1024 * 1024 * 5,  # 5 MB
      'backupCount': 5,
      'formatter': 'plain',
  'loggers': {
    'timelog.middleware': {
      'handlers': ['timelog'],
      'level': 'DEBUG',
      'propogate': False,

In order for the analyser script to work correctly you’ll need to use the plain formatter and to register a handler for the timelog.middleware logger.

With that all configured try hitting your application a few times. You should see a log file created at the location specified in TIMELOG_LOG. Fill that up with a few lines and then run the analyze_timelog management command:

python analyze_timelog

This should output something similar to the above table but with your timings and views. The management command currently allows you to point the analyzer at a different file say from a backup, or a file you’ve pulled down from production but want to run the command locally. I’ll likely add some filters as time permits for things like not showing all views or showing only views between a given date range.

The above table showing the view function is good for big picture trends, but if you want to dig down into a particular path then you can pass an argument to not reverse the path:

python analyze_timelog --noreverse

This should give you something more like:

| view                             | method | status | count | minimum | maximum | mean  | stdev            |
| /assets/css/main.css             | GET    | 200    | 61295 | 0.00    | 0.02    | 0.007 | 0.0060574669888  |
| / bob                            | GET    | 404    | 4715  | 0.01    | 0.01    | 0.01  | 0.0              |
| /                                | GET    | 200    | 66010 | 0.17    | 0.28    | 0.232 | 0.0455415351076  |
| /__debug__/m/css/toolbar.min.css | GET    | 304    | 4715  | 0.00    | 0.00    | 0.0   | 0.0              |
| /14/                             | GET    | 200    | 9430  | 0.14    | 0.28    | 0.21  | 0.0700037118541  |

Debugging HTTP Headers with RedBot

I’ve been using the site as a bit of a playground recently and I’ve been meaning to blog about some of the overengineering I’ve been doing. Here’s a smaller starter.

Getting the headers returned by your web server correct is both easy to do and easy to forget about. Unless you go actively looking for headers with curl or similar you’ll probably miss them, and even then will you spot an incorrect header by eye? RedBot is an excellent online tool that not only shows you the headers but makes recommendations about what might be missing, invalid or things you haven’t considered.

For instance the RedBot results for look like this:

Debugging information from Redbot

Or at least they do now after a few tweaks to my nginx configuration. In particular I’ve added

add_header Vary Accept-Encoding;

I’d assumed that enabling gzip compression with gzip_vary would have set this automatically and never checked. The way RedBot provides a checklist of recommendations is very handy.

As well as checking the page itself you can also check all the assets associated with a page by adding a query string argument. For instance here is the assets view for RedBot also provides a JSON encoded version of the result which might be useful in a CI system. If you’d rather run your own instance of the software you can, the code can be found on github at It currently doesn’t work with HTTPS resources but that’s about the only thing I’d like to see added.