Processing large files with sed and awk

I found myself using a couple of powerful but underused command line applications this week and felt like sharing.

My problem involved a large text file with over three million lines and a script that processed the file line by line, in this case running a SQL query against a remote database.

My script didn’t try and process everything in one go, rather taking off large chunks and processing them in turn, then stopping and printing out the number of lines processed. This was mainly so I could keep an eye on it and make sure it wasn’t having a detrimental affect on other systems. But once I’d run the script once (and processed the first quarter of a million records or so) I wanted to run it again, except without the first batch of lines. For this I used sed. The following command creates a new file with the contents of the original file, minus the first 254263 lines.

sed '1,254263d' original.txt > new.txt

I could then run my script with the input from new.txt and not have to reprocess the deleted lines. My next problem came when the network connection between the box running the script and the database dropped out. The script printed out the contents of the last line successfully processed, so what I wanted was a new file with the all contents of the old file past the last line. The following awk command does just that, assuming the last line processed was f251f9ee0b39beb5b5c4675ed4802113.

awk '/^f251f9ee0b39beb5b5c4675ed4802113/{f=1;next}f' original.txt > new.txt

Now I could have made the script that did the work more complicated and ensure it dealt with these cases. But it would have involved much more code and the original scripts where only a handful of throw away code. For one off jobs like this a quick dive into the command line seemed more prudent.

Speaking at DIBI

I’ll be heading back up to Newcastle in April to give a talk at what’s shaping up to be a good looking conference to kick off the year with. DIBI is trying to please everyone, with both front and backend focused streams.

Created for both sides of the web coin, DIBI brings together designers and developers for an unusual two-track web conference. World renowned speakers leading in their fields of work will talk about all things web. Taking place in Newcastle upon Tyne, (it’s oop north) at The Sage Gateshead on the 28th April 2010, we’re bringing both sides of the web world together with some awesome speakers.

I’m not a big fan of making a point of dividing frontend and backend work. You nearly always end up with javascript dominated horribleness (because we only had a front end person available) or a so called content management system that means all sites have to look the same except for the colour palette. So I’m hoping lots of cross over stuff happens and interesting conversations abound.

Oh, and if you’re wondering what I’ll be speaking about it’s probably going to be something about all the cool tools you could and should be using when building or looking after web applications. I’ll probably be doing my best to convince people to look outside the comfort of the LAMP or C#/MSSQL stacks and realise the future for lots of web developers might just be more devops.

Dreque

I’ve just found Dreque from Samuel Stauffer on GitHub. It’s yet another take on the whole messaging things which is definitely seeing a lot of activity at the back end of this year. It’s using Redis on the backend and looks really rather nice:

Submitting jobs:

<code>from dreque import Dreque
def some_job(argument):
    pass
dreque = Dreque("127.0.0.1")
dreque.enqueue("queue", some_job, argument="foo")</code>

Worker:

<code>from dreque import DrequeWorker
worker = DrequeWorker(["queue"], "127.0.0.1")
worker.work()</code>

DJUGL December

As mentioned at the last event I’ve taken over organising the Django User Group London event from Rob. Tickets are now available for the next event which is going to be on the 3rd of December at The Guardian offices in Kings Cross.

You can sign up on eventwax

Erlang Screencasts

I’ve been trying to learn Erlang for a while. What I actually mean is it’s been on my list of things to learn for months, along with all sorts of other incredibly interesting bits and pieces. I spend a little bit of time at home but the majority of my learning time is now spent commuting to London and back most days. Sometimes I’m even going all the way to Swindon which gives me even longer to not learn Erlang.

The main problem with learning something new on the train is space. Reading a book (or my new Kindle) or just using my laptop is fine. Trying to do both at once is nearly impossible (I’ve tried). So I’ve decided to give another approach a try, namely screencasts.

I’ve only done the first Erlang in Practice episode so far but I was hugely impressed with the content and general presentation. $5 as well doesn’t seem bad at all I don’t think. The episode was half an hour long, but took me a little longer, probably closer to 45 minutes, as I was playing along at home and typing the code examples as I went. I also got sidetracked with messing with my vim configuration at the same time but hey. This makes them perfect for my hour long commute. The full series is 8 episodes long and with luck I’ll be able to work through them this week.

So, good job Kevin Smith and Pragmatic for a nice, accessible start to Erlang. All I need to do now is find something interesting to hack on in Erlang.

Django Committers

I’ve been lurking on the django-developers mailing list for the last couple of weeks and that provided an excuse to play with the new Twitter Lists feature. So here’s a list of djangocommitters on twitter. If I missed someone do let me know. Their is a chance you won’t be able to see this if you’re not on the beta yet I think, sorry!

Problems Installing Hadoop 0.20 and Dumbo 0.21 on Ubuntu

The Hadoop wiki has a great introduction to installing this piece of software, which I wanted to do to have a play with Dumbo. The Dumbo docs also have a good getting started section which includes a few patches than need to be applied.

Dumbo can be considered to be a convenient Python API for writing MapReduce programs

Unfortunately it’s not quite that simple, at least on Ubuntu Jaunty. Hadoop now uses Java6, but if you just follow the instructions on the wikis you’ll hit a problem when you run “ant package”, namely that a third party application (Apache Forrest) requires Java 1.5. Once you fix that, the build script will complain again that you need to install Forrest. Here’s what I did to get everything working:

pre. sudo apt-get install ant sun-java5-jdk

pre. su - hadoop wget http://mirrors.dedipower.com/ftp.apache.org/forrest/apache-forrest-0.8.tar.gz tar xzf apache-forrest-0.8.tar.gz cd /usr/local/hadoop patch -p0 < /path/to/HADOOP-1722.patch patch -p0 < /path/to/HADOOP-5450.patch patch -p0 < /path/to/MAPREDUCE-764.patch ant package -Djava5.home=/usr/lib/jvm/java-1.5.0-sun -Dforrest.home=/home/hadoop/apache-forrest-0.8/

With all that out of the way you should be able to run the simple examples found on the rather excellent dumbotics blog. If you’re using the Cloudera distribution, or when the Hadoop 0.21 gets a release, these problems will disappear but in the meantime hopefully this saves someone else a bit of head scratching.

Learnings from September

I’m keep meaning to get around to writing about why I think the future of web developers is operations but in lieu of a proper post here’s a list of things I’ve been spending my work life getting to know this month:

  • Puppet - It’s brilliant. Define (with a Ruby DSL of course) what software and services you want running on all your machines, install a daemon on each of them, and hey presto central configuration management.
  • VMWare vsphere - puppet makes more sense the more boxes you have. With vsphere I can have as many boxes as I want (nearly). Command line scripts and an actually very nice windows gui for settings up virtual machines is all pretty nice, especially running on some meaty hardware.
  • Nagios - With lots of boxes comes lots of responsibility (or something). Nagios might look a bit ugly, and bug me with it’s needless frames based admin, but I can see what people see in it. Which frankly is the ability to monitor everything everywhere for any change what-so-ever.
  • Solr - I’m now also pretty well versed in using Solr. I’ve used it in the past, but always behind a Ruby or Python library. Now I know my way around all the XML based configuration inards. Heck, I’m even running a nighly release from a couple of days ago in a production environment because I wanted a cool new feature. A special mention to the Solr community on the mailing list, twitter and irc for being great when I had questions.
  • Solaris - I nearly forgot, I spend more time than I care to remember working out how to use Open Solaris (conclusion: OK, but not Debian) and eventually Solaris 10 (conclusion: hope I don’t have to do that again). My installation notes read like some hideous hack but everything works fine in production and it’s scarily repeatable so I’ll live with it for now.

I do wonder if it’s just me that’s drawn to knowing how everything in the full web stack works. But personally I can’t just write code if I don’t understand how to deploy it or what it’s running on. Front end types know this all too well. Being a master of CSS, HTML and Javascript simply isn’t enough. You need to understand the browser to get anything done. I’m not sure it’s the same for all backend inclined folk; how many PHP programmers really understand Apache and a few other useful bits of web tech?

No database test runner added to test extensions

Thanks to Brad I’ve just released a new version of Django Test Extensions (also on GitHub with support for running tests without the overhead of setting up and tearing down the database. Django still has a few places were it assumes you’ll have a database somewhere in your project - and the default test runner is one of them.

Automating web site deployment at Barcamp Brighton

On the first day at Barcamp Brighton this year I did a brief talk about getting started with automating deployment. I kept it nice and simple and didn’t focus on any specific technology or tool - just the general principles and pitfalls of doing anything manually. You can see the “slides on Slideshare”:

As part of the presentation I even did a live demo and promised I’d upload the code I used. The following is an incredibly simple fabric file that has most the basic set of tasks. Fabric is a python tool similar to capistrano in Ruby. I don’t really care whether you’re using make, ant, rake, capistrano or just plain shell scripts. Getting from not automating things to automating deployments is the important part - and it’s easier that you think.

The other part of the code example was a very basic apache virtualhost so just in case anyone needed that as well here it is:

&lt;VirtualHost *:80&gt;
    ServerName sample.local
    DocumentRoot /srv/sample/releases/current
    &lt;Directory /srv/sample/releases/current&gt;
        Order deny,allow
        Allow from all
    &lt;/Directory&gt;
    ErrorLog /var/log/apache2/sample/error.log
    LogLevel warn
    CustomLog /var/log/apache2/sample/access.log combined
&lt;/VirtualHost&gt;