Archiving Twitter data with Python

Alex from Twitter just got round to adding the ability to export your entire archive of tweets via the API. A few people on the mailing list had been asking for this for a while so good to see it get released.

I couldn’t resist knocking together a very quick and simple Python script to go off and get all your tweets, presented here for anyone else to play around with. Note that simple, fast and works on my machine were watchwords here. Don’t expect fancy parameters, much error handling or artificial intelligence.

<% syntax_colorize :python, type=:coderay do %> import urllib2 username = ‘‘ password = ‘‘ format = ‘json’ # json or xml filename = ‘archive.json’ # filename of the archive tweets = 164 # number of tweets pages = (int(float(tweets)/float(80)))+1 auth = urllib2.HTTPPasswordMgrWithDefaultRealm() auth.add_password(None, ‘http://twitter.com/account/', username, password) authHandler = urllib2.HTTPBasicAuthHandler(auth) opener = urllib2.build_opener(authHandler) urllib2.install_opener(opener) i = 1 response = “ print ‘Downloading tweets. Note that this may take some time’ while i <= pages: request = urllib2.Request(‘http://twitter.com/statuses/user_timeline/account.'
+ format + ‘?page=’ + str(i)) response = response + urllib2.urlopen(request).read() i = i + 1 handle = open(filename,“w”) handle.write(response) handle.close() print ‘Archived ’ + str(tweets) + ‘ of ’ + username +
‘\’s tweets to ‘ + filename <% end %>

The main thing to note here though is the ideal of data portability. Let’s just say I wanted to move to Jaiku or Pownce instead of Twitter, but didn’t want to lose all that data. If I can knock up an export script in half an hour then all you need are import data API calls and a little more scripting.

As it is, I’m more interesting in seeing what I can do with my data now I can get at it. Brian just suggested a quick visualisation tool (which already eats JSON) and I’d already been thinking about language analysis and maybe having a look at APML some more. Open data is simply more fun.