Problems Installing Hadoop 0.20 and Dumbo 0.21 on Ubuntu

The Hadoop wiki has a great introduction to installing this piece of software, which I wanted to do to have a play with Dumbo. The Dumbo docs also have a good getting started section which includes a few patches than need to be applied.

Dumbo can be considered to be a convenient Python API for writing MapReduce programs

Unfortunately it’s not quite that simple, at least on Ubuntu Jaunty. Hadoop now uses Java6, but if you just follow the instructions on the wikis you’ll hit a problem when you run “ant package”, namely that a third party application (Apache Forrest) requires Java 1.5. Once you fix that, the build script will complain again that you need to install Forrest. Here’s what I did to get everything working:

pre. sudo apt-get install ant sun-java5-jdk

pre. su - hadoop wget tar xzf apache-forrest-0.8.tar.gz cd /usr/local/hadoop patch -p0 < /path/to/HADOOP-1722.patch patch -p0 < /path/to/HADOOP-5450.patch patch -p0 < /path/to/MAPREDUCE-764.patch ant package -Djava5.home=/usr/lib/jvm/java-1.5.0-sun -Dforrest.home=/home/hadoop/apache-forrest-0.8/

With all that out of the way you should be able to run the simple examples found on the rather excellent dumbotics blog. If you’re using the Cloudera distribution, or when the Hadoop 0.21 gets a release, these problems will disappear but in the meantime hopefully this saves someone else a bit of head scratching.

Learnings from September

I’m keep meaning to get around to writing about why I think the future of web developers is operations but in lieu of a proper post here’s a list of things I’ve been spending my work life getting to know this month:

  • Puppet - It’s brilliant. Define (with a Ruby DSL of course) what software and services you want running on all your machines, install a daemon on each of them, and hey presto central configuration management.
  • VMWare vsphere - puppet makes more sense the more boxes you have. With vsphere I can have as many boxes as I want (nearly). Command line scripts and an actually very nice windows gui for settings up virtual machines is all pretty nice, especially running on some meaty hardware.
  • Nagios - With lots of boxes comes lots of responsibility (or something). Nagios might look a bit ugly, and bug me with it’s needless frames based admin, but I can see what people see in it. Which frankly is the ability to monitor everything everywhere for any change what-so-ever.
  • Solr - I’m now also pretty well versed in using Solr. I’ve used it in the past, but always behind a Ruby or Python library. Now I know my way around all the XML based configuration inards. Heck, I’m even running a nighly release from a couple of days ago in a production environment because I wanted a cool new feature. A special mention to the Solr community on the mailing list, twitter and irc for being great when I had questions.
  • Solaris - I nearly forgot, I spend more time than I care to remember working out how to use Open Solaris (conclusion: OK, but not Debian) and eventually Solaris 10 (conclusion: hope I don’t have to do that again). My installation notes read like some hideous hack but everything works fine in production and it’s scarily repeatable so I’ll live with it for now.

I do wonder if it’s just me that’s drawn to knowing how everything in the full web stack works. But personally I can’t just write code if I don’t understand how to deploy it or what it’s running on. Front end types know this all too well. Being a master of CSS, HTML and Javascript simply isn’t enough. You need to understand the browser to get anything done. I’m not sure it’s the same for all backend inclined folk; how many PHP programmers really understand Apache and a few other useful bits of web tech?

No database test runner added to test extensions

Thanks to Brad I’ve just released a new version of Django Test Extensions (also on GitHub with support for running tests without the overhead of setting up and tearing down the database. Django still has a few places were it assumes you’ll have a database somewhere in your project - and the default test runner is one of them.

Automating web site deployment at Barcamp Brighton

On the first day at Barcamp Brighton this year I did a brief talk about getting started with automating deployment. I kept it nice and simple and didn’t focus on any specific technology or tool - just the general principles and pitfalls of doing anything manually. You can see the “slides on Slideshare”:

As part of the presentation I even did a live demo and promised I’d upload the code I used. The following is an incredibly simple fabric file that has most the basic set of tasks. Fabric is a python tool similar to capistrano in Ruby. I don’t really care whether you’re using make, ant, rake, capistrano or just plain shell scripts. Getting from not automating things to automating deployments is the important part - and it’s easier that you think.

The other part of the code example was a very basic apache virtualhost so just in case anyone needed that as well here it is:

&lt;VirtualHost *:80&gt;
    ServerName sample.local
    DocumentRoot /srv/sample/releases/current
    &lt;Directory /srv/sample/releases/current&gt;
        Order deny,allow
        Allow from all
    ErrorLog /var/log/apache2/sample/error.log
    LogLevel warn
    CustomLog /var/log/apache2/sample/access.log combined

Another chance to DJUGL

DJUGL is back, the monthly Django meetup in London. I think the last few times have been as much about useful Python stuff as just using Django, and this time it’s officially a bit more broad ranging. If you’re in or around London on the 24th September then come along.

You can get more information on Twitter or by following Rob. But expect a few short talks, some interesting conversations and maybe some beer with other like minded developers.

I’m going to be talking about automating deployment of Python web applications. If you follow me on Twitter you’ll have heard me rambling a little about some of what I’ve been up to, and some of the posts here give an insight into what I’ve been working on. But the short version is that several friends mentioned how difficult it could be to get a working Django application from a local machine to a production web server. And I though I better get down in script form my experiences of Django, WSGI applications and web server setup to make things easier.

I think this situation is partially caused by the success of the Django development server, and partially by people coming from a PHP background. In my PHP days I think I always wanted to know how Apache did it’s think, so long ago jumped into anything and everything in httpd.conf from loading modules to virtual hosts. But not everyone does the same, and PHP does make simple deployments easy enough that you might get away without doing so. Rails went through the same problems and seems to be coming out the other side. I’m hoping that Django and Python is soon to be in the same position, where basic deployment is just a given.

Django and WSGI deployment on Solaris

Now I’m generally an Ubuntu guy, but I’ve just had the need to setup some boxes running Solaris for Django and a handful of WSGI applications. I know my way around Ubuntu pretty well. I know all the packages I need to install and in what order. Hell, I even have all that scripted so I can just run a command and it works by magic. I’ll script the following steps if I can do when I get round to it but here, in one list, are the installation instructions for Apache, mod_wsgi, Mysql, MySQLdb, setuptools and memcached that worked for me on the latest version of Open Solaris (2009.06 at the time of writing).

First up I needed to install Apache and start the service running.

pfexec pkg install SUNWapch22
svcadm enable http:apache22

You should be able to test that’s running by hitting localhost on a browser running on the same box. Now for MySQL.

pfexec pkg install SUNWmysql5
svcadm enable mysql:version_50

This installs the mysql binary into /usr/mysql/5.0/bin/mysql on the system I’m working on. As I want to talk to the MySQL database server using Python I need to install MySQLdb.

pfexec pkg install SUNWmysql-python
ln -s /usr/mysql/5.0/lib/mysql/ /usr/lib/

This installs the library files into /usr/mysql/5.0/lib and Python doesn’t know were to find them. The above command links them into the more standard /usr/lib folder were Python will pick it up nicely.

I tend to use mod_wsgi for serving Python apps behind Apache, however a mod_wsgi package isn’t part of the default package list. It is however available in the pending list so first you need to add that list of packages.

pfexec pkg set-authority -O pending
pfexec pkg refresh
pfexec pkg install mod-wsgi

This installs the module but you then need to tell Apache to load it. Add the following line to /etc/apache2/2.2/conf.d/modules-32.load or /etc/apache2/2.2/conf.d/modules-64.load depending on your architecture.

LoadModule wsgi_module    libexec/

To get Apache to load that module you need to restart it like so:

svcadm restart http:apache22

I use Pip for installing Python code, but tend to install setuptools to make installing Pip easier. I don’t know if an up to date Pip package exists.

pfexec pkg install python-setuptools

This should leave you with easy_install on your path so installing Pip, then virtualenv should be a breeze.

As an added bonus I also installed memcached for some snappy caching.

pfexec pkg install SUNWmemcached

This won’t start up by default and needs a little configuration. The first command will launch you into a prompt where you can type the rest of the commands.

svc:> select memcached
svc:/application/database/memcached> setprop memcached/options=("-u" "nobody")
svc:/application/database/memcached> quit

Once you’d done that you should be able to start memcache on the standard port.

svcadm refresh memcached
svcadm enable memcached

Et voila. The internet helped massively on my quest to track down this information. Not all of the following links turned out to work for me but all of them led me in the right direction. Thanks everyone.

I’m not a Solaris admin. I’m not really a sysadmin at all, I just end up pretending to be one of late. Any experienced Solaris people with experience of these tools reading this I’d be grateful for any hints and tips. Hopefully this saves a few people from the head scratching I’ve been doing for the last few days.

Your Own PyPi server

So one of the problems with using pip or easy_install as part of an automated deployment process is they rely on an internet connection. More than that, they rely on PyPi being up as it’s a centralised system, unlike all the apt package mirrors.

The best solution seems to be to host your own PyPi compliant server. Not only can you load all the third party modules you use onto it, but you could also upload any internal applications or libraries that you like. By running this on your local network you ensure your not dependent on pypi or an internet connection.

At the moment I’m playing with Chishop which is a django application for maintaining a PyPi compatible server. Another alternative if that doesn’t work out is EggBasket

To install from your own PyPi server you can specify the location of your Chishop instance with the -i flag.

pre. easy_install -i http://localhost:8000/ PACKAGE_NAME

This will fall back to the PyPi server if it doesn’t find the relevant package. If you want to stop that behaviour and make sure you have a local package then you can limit the hosts with the -H flag like so.

pre. easy_install -H localhost:8000 -i http::/localhost:8000/ PACKAGE_NAME

I’m not yet sure how to do this with pip, if someone wants to enlighten me in the comments then I’d be most grateful.

Fabric, Django, Git, Apache, mod_wsgi, virtualenv and pip deployment

I’ve been playing with automating Django deployments again, this time using Fabric. I found a number of examples on the web but non of them quite fit the bill for me. I don’t like serving directly from a repository, I like to have either a package or tar I can use to say “that is what went to the server”. I also like having a quick rollback command as well as being able to deploy a particular version of the code when the need arises. I also wanted to go from a clean ubuntu install (plus SSH) to a running Django application in one command from the local development machine. The Apache side of things is nicely documented in this Gist which made a good starting point.

I’m still missing a few things in this setup mind and at the moment you still have to setup your local machine yourself. I’m probably going to create a paster template and another fabfile to do that I think. The instructions are a little rough as well at the moment and I’ve left the database out of it as everyone has there own preference.

This particular fabric file makes setting up and deploying a django application much easier, but it does make a few assumptions. Namely that you’re using Git, Apache and mod_wsgi and your using Debian or Ubuntu. Also you should have Django installed on your local machine and SSH installed on both the local machine and any servers you want to deploy to.

note that I’ve used the name project_name throughout this example. Replace this with whatever your project is called.

First step is to create your project locally:

pre. mkdir project_name cd project_name startproject project_name

Now add a requirements file so pip knows to install Django. You’ll probably add other required modules in here later. Creat a file called requirements.txt and save it at the top level with the following contents:

pre. Django

Then save this file in the top level directory which should give you:

pre. project_name requirements.txt project_name

You’ll need a WSGI file called project_name.wsgi, where project_name is the name you gave to your django project. It will probably look like the following, depending on your specific paths and the location of your settings module

pre. import os import sys

  1. put the Django project on sys.path sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(file), “../”))) os.environ[“DJANGO_SETTINGS_MODULE”] = “project_name.settings” from django.core.handlers.wsgi import WSGIHandler application = WSGIHandler()

Last but not least you’ll want a virtualhost file for apache which looks something like the following. Save this as project_name in the inner directory. You’ll want to change /path/to/project_name/ to the location on the remote server you intent to deploy to.

pre. WSGIDaemonProcess project_name-production user=project_name group=project_name threads=10 python-path=/path/to/project_name/lib/python2.6/site-packages WSGIProcessGroup project_name-production WSGIScriptAlias / /path/to/project_name/releases/current/project_name/project_name.wsgi Order deny,allow Allow from all ErrorLog /var/log/apache2/error.log LogLevel warn CustomLog /var/log/apache2/access.log combined

Now create a file called .gitignore, containing the following. This prevents the compiled python code being included in the repository and the archive we use for deployment.

pre. *.pyc

You should now be ready to initialise a git repository in the top level project_name directory.

pre. git init git add .gitignore project_name git commit -m “Initial commit”

All of that should leave you with

pre. project_name .git .gitignore requirements.txt project_name project_name project_name.wsgi

In reality you might prefer to keep your wsgi files and virtual host files elsewhere. The fabfile has a variable (config.virtualhost_path) for this case. You’ll also want to set the hosts that you intend to deploy to (config.hosts) as well as the user (config.user).

The first task we’re interested in is called setup. It installs all the required software on the remote machine, then deploys your code and restarts the webserver.

pre. fab local setup

After you’ve made a few changes and commit them to the master Git branch you can run to deply the changes.

pre. fab local deploy

If something is wrong then you can rollback to the previous version.

pre. fab local rollback

Note that this only allows you to rollback to the release immediately before the latest one. If you want to pick a arbitrary release then you can use the following, where 20090727170527 is a timestamp for an existing release.

pre. fab local deploy_version:20090727170527

If you want to ensure your tests run before you make a deployment then you can do the following.

pre. fab local test deploy

The actual fabfile looks like this. I’ve uploaded a Gist of it, along with the docs, so if you want to improve it please clone it.

pre. # globals config.project_name = ‘project_name’

  1. environments def local(): “Use the local virtual server” config.hosts = [‘’] config.path = ‘/path/to/project_name’ config.user = ‘garethr’ config.virtualhost_path = “/”
  2. tasks def test(): “Run the test suite and bail out if it fails” local(“cd $(project_name); python test”, fail=“abort”) def setup(): “”” Setup a fresh virtualenv as well as a few useful directories, then run a full deployment “”” require(‘hosts’, provided_by=[local]) require(‘path’) sudo(‘aptitude install -y python-setuptools’) sudo(‘easy_install pip’) sudo(‘pip install virtualenv’) sudo(‘aptitude install -y apache2’) sudo(‘aptitude install -y libapache2-mod-wsgi’) # we want rid of the defult apache config sudo(‘cd /etc/apache2/sites-available/; a2dissite default;’) run(‘mkdir -p $(path); cd $(path); virtualenv .;’) run(‘cd $(path); mkdir releases; mkdir shared; mkdir packages;’, fail=‘ignore’) deploy() def deploy(): “”” Deploy the latest version of the site to the servers, install any required third party modules, install the virtual host and then restart the webserver “”” require(‘hosts’, provided_by=[local]) require(‘path’) import time config.release = time.strftime(‘%Y%m%d%H%M%S’) upload_tar_from_git() install_requirements() install_site() symlink_current_release() migrate() restart_webserver() def deploy_version(version): “Specify a specific version to be made live” require(‘hosts’, provided_by=[local]) require(‘path’) config.version = version run(‘cd $(path); rm releases/previous; mv releases/current releases/previous;’) run(‘cd $(path); ln -s $(version) releases/current’) restart_webserver() def rollback(): “”” Limited rollback capability. Simple loads the previously current version of the code. Rolling back again will swap between the two. “”” require(‘hosts’, provided_by=[local]) require(‘path’) run(‘cd $(path); mv releases/current releases/_previous;‘) run(‘cd $(path); mv releases/previous releases/current;’) run(‘cd $(path); mv releases/_previous releases/previous;‘) restart_webserver()
  3. Helpers. These are called by other functions rather than directly def upload_tar_from_git(): require(‘release’, provided_by=[deploy, setup]) “Create an archive from the current Git master branch and upload it” local(‘git archive –format=tar master | gzip > $(release).tar.gz’) run(‘mkdir $(path)/releases/$(release)’) put(‘$(release).tar.gz’, ‘$(path)/packages/’) run(‘cd $(path)/releases/$(release) && tar zxf ../../packages/$(release).tar.gz’) local(‘rm $(release).tar.gz’) def install_site(): “Add the virtualhost file to apache” require(‘release’, provided_by=[deploy, setup]) sudo(‘cd $(path)/releases/$(release); cp $(project_name)$(virtualhost_path)$(project_name) /etc/apache2/sites-available/‘) sudo(‘cd /etc/apache2/sites-available/; a2ensite $(project_name)‘) def install_requirements(): “Install the required packages from the requirements file using pip” require(‘release’, provided_by=[deploy, setup]) run(‘cd $(path); pip install -E . -r ./releases/$(release)/requirements.txt’) def symlink_current_release(): “Symlink our current release” require(‘release’, provided_by=[deploy, setup]) run(‘cd $(path); rm releases/previous; mv releases/current releases/previous;’, fail=‘ignore’) run(‘cd $(path); ln -s $(release) releases/current’) def migrate(): “Update the database” require(‘project_name’) run(‘cd $(path)/releases/current/$(project_name); ../../../bin/python syncdb –noinput’) def restart_webserver(): “Restart the web server” sudo(‘/etc/init.d/apache2 restart’)

What's new in Django 1.1

With the release candidate for Django 1.1 out the door I decided to have a quick look at what’s new. This isn’t a complete list, rather the bits I found most interesting.

Conditional Views

Django now has much better support for conditional view processing using the standard ETag and Last-Modified HTTP headers. This means you can now easily short-circuit view processing by testing less-expensive conditions. For many views this can lead to a serious improvement in speed and reduction in bandwidth.

A nice set of decorators for dealing with ETags and Last-Modified headers. Again very simple to use and set up, and a simple way of squeezing a little more performance out of you application.

Admin Actions

The basic workflow of Django’s admin is, in a nutshell, “select an object, then change it.” This works well for a majority of use cases. However, if you need to make the same change to many objects at once, this workflow can be quite tedious. In these cases, Django’s admin lets you write and register “actions” – simple functions that get called with a list of objects selected on the change list page.

Anything that makes the admin a little more powerful and a little more flexible is a good idea in my book. Admin actions allow you to run code over multiple objects at once, simple select them with a checkbox then select an action to run. This is worth it for the delete action alone, but you can write your own actions simply enough as well (for instance for approving a batch of comments, or archiving a set or articles.)

Editable Admin List Items

You can now make fields editable on the admin list views via the new list_editable admin option. These fields will show up as form widgets on the list pages, and can be edited and saved in bulk.

Another time saving admin addition, this time for making some fields editable from the change list rather than the object view. For quick changes, especially to boolean fields, I think this again is a nice addition.

Unmanaged Models

You can now control whether or not Django creates database tables for a model using the managed model option. This defaults to True, meaning that Django will create the appropriate database tables in syncdb and remove them as part of reset command. That is, Django manages the database table’s lifecycle. If you set this to False, however, no database table creating or deletion will be automatically performed for this model. This is useful if the model represents an existing table or a database view that has been created by some other means.

I particularly like this addition. One of the issues I had with Django was some of the built in assumptions, in particular that you’d be using a SQL database backend. Using unmanaged models looks like a great approach to using an alternative database like couchdb, tokyotyrant or mongodb or representing a webservice interface in your application.

I’m sure I’ll have missed a few other interesting changes or additions. Anyone else have a favourite?

Asteroid - simple app for running scripts and recording the results

Asteroid is a simple web interface for running scripts and recording the results. It’s like a much simpler and more general purpose version of something like Cruise Control. You can get the code on Github.

Asteroid Dashboard

I built it to solve two main problems:

  • It’s sometimes useful to have a historical record of a scripts execution, in particular whether it passed or failed and what the output was. Just running a command line script probably doesn’t give you that. It’s also useful to have a more graphical interface for those members of the team who don’t use the command line.
  • When working in a team you often want to run scripts against shared infrastructure, for instance deploying a testing release or running a test suite. Seeing what is running at present helps with that.

So it should be useful for running deployments, running test suites, running backups, etc. It currently doesn’t have scheduling or similar build in, but as everything is triggered by hitting a URL it would be simple enough to use cron for something like that. It should also be useful whatever language you write your scripts in; rake, ant, shell scripts, etc. At the end of the day it just executes a command at the console.


Asteroid uses the Django Python framework under the hood.

You’ll also need a database. The default in the shipped settings is to use sqlite but this should work with any database supported by Django.

You’ll also need a decent web browser. I’ve gone and used HTML5 as an experiment and with this being a developer tool I’m hoping to stick with it. It would be easy enough to convert the templates if this is a problem however.

The application has an optional message queue backend which can be enabled in the settings file. This is used to improve the responsiveness of the application as well as allow commands to be executed on a remote machine, rather than on the box Asteroid is running.

Other AMQP compliant message queues should work but it’s currently only tested with Rabbit.

If you are intending to do any development on Asteroid, or just want to look more closely at the code, I’d recommend installing

Usage Instructions

You should be able to just download asteroid and run it from wherever you put it, once you setup the database.

cd asteroid/configs/common syncdb runserver

This should bring the local web server up on port 8000 so visit http://localhost:8000 and see.

If you’re using the message queue backend you’ll need to run the listener script in order to get your commands executed. At the moment that means modifying a constant in the listener script to point at a running message queue instance at asteroid/bin/

cd asteroid/bin

Once you’re up and running you should be able to add commands via the admin interface at http://localhost:8000/admin/. The username and password should be those you added when creating the database via the syncdb command above.

The development configs include a few additional applications (mentioned above) which I use for testing and debugging. You can run the test suite like so:

cd asteroid/configs/development test --coverage


This is an early release that just about works for me. I can already see a number of areas I’d like to clean up a little or extend. For instance:

  • Other deployment options, including a WSGI file and a spawning startup script.
  • Use a database migration system to make upgrades easier.
  • Make the message queue listener script more robust.
  • Make the command entry more robust, it sometimes takes a bit of fiddling with to get something to run correctly.
  • Formalise running scripts on remote machines, including support for running on multiple machines.
  • Paging for long lists of commands or runs.


I’m pretty happy with how it’s shaping up so far. Under the hood it works by having the web app put a JSON document on the message queue. The JSON contains the command to be run and a callback URL. The script listening to the message queue picks up the message, runs the command, and posts a JSON document back to the webhook url. It keeps the web interface snappy, as well as meaning it can show which commands are currently in progress at any given time. It also has the side benefit of meaning you can execute commands on a remote machine, as the listener doesn’t care where it’s running.

As noted above I have a few ideas of where I want to take it, but I’m going to try using it for a bit and see how that goes. If anyone else finds it useful then do let me know.