Problems Installing Hadoop 0.20 and Dumbo 0.21 on Ubuntu

The Hadoop wiki has a great introduction to installing this piece of software, which I wanted to do to have a play with Dumbo. The Dumbo docs also have a good getting started section which includes a few patches than need to be applied.

Dumbo can be considered to be a convenient Python API for writing MapReduce programs

Unfortunately it’s not quite that simple, at least on Ubuntu Jaunty. Hadoop now uses Java6, but if you just follow the instructions on the wikis you’ll hit a problem when you run “ant package”, namely that a third party application (Apache Forrest) requires Java 1.5. Once you fix that, the build script will complain again that you need to install Forrest. Here’s what I did to get everything working:

pre. sudo apt-get install ant sun-java5-jdk

pre. su - hadoop wget http://mirrors.dedipower.com/ftp.apache.org/forrest/apache-forrest-0.8.tar.gz tar xzf apache-forrest-0.8.tar.gz cd /usr/local/hadoop patch -p0 < /path/to/HADOOP-1722.patch patch -p0 < /path/to/HADOOP-5450.patch patch -p0 < /path/to/MAPREDUCE-764.patch ant package -Djava5.home=/usr/lib/jvm/java-1.5.0-sun -Dforrest.home=/home/hadoop/apache-forrest-0.8/

With all that out of the way you should be able to run the simple examples found on the rather excellent dumbotics blog. If you’re using the Cloudera distribution, or when the Hadoop 0.21 gets a release, these problems will disappear but in the meantime hopefully this saves someone else a bit of head scratching.