Vagrant At The Guardian

Apr 2, 2011 · 1 minute read

As recent blog posts on here make clear, I’m a big fan of Vagrant. And when Michael asked if I’d fancy talking to some of his colleagues at The Guardian about how I use it I really couldn’t say no.

I gave a short talk, running through the following slides, and running a few demos showing creating, destroying and provisioning new machines.

More interesting I thought were the questions and conversations that followed. We talked a little about how Vagrant might fit into a continuous integration setup. Another aspect some of the systems folk took to was how flexible the network configuration was and whether they might be able to use this to more effectively test firewall configurations well before the final push into a production environment. It’s not something I’ve been doing but it sounds feasable and useful in some organisations. If anyone is doing interesting things with Vagrants network config I’d be interested to know.

Devops Isn't A Methodology

Mar 26, 2011 · 4 minute read

I was reading Devops is a poorly executed scam and just couldn’t resist a reply. Not because of the entertaining title, but because I both agree and disagree quite strongly with parts of the post. Read it first if you haven’t already. And yes I know I’m feeding the internet.

I’m going to pick parts of the post out and then comment. Hopefully I’m not quoting these in any way out of context.

“It’s got the potential to make a handful of people a lot of money in the same way that Agile did, but nobody is really executing on.”

People are pretty aware of this fact I think, but watch what happens when people post on the mailing lists or turn up at community events with a purely marketing hat on. They just get no traction and even damage their product brand amongst the early adopters. The fact the term is starting to get used in job adverts and marketing materials isn’t really being driven by the people talking about what devops might or might not be. I think the main reason for this is that most of the people I talk to in person or online are actually pretty happy with their jobs and generally work inside companies rather than as independent consultants. They have often reached an age where they want to improve within a given field but would like a wider network than their current colleagues to discuss things with.

“How do you implement Devops?”

I don’t think you do. The comparisons with Agile are interesting from a community point of view but Scrum is a methodology. To me at least devops isn’t, it’s just a banner or tag under which interesting conversations are happening. The argument that “You should be doing this anyway. Not earth shattering.” is a good thing. You’d be suprised by how many people don’t do all the things they should be doing, especially in small and young companies. And one of the reasons for that is no one bothered writing a list of these things down anywhere and then discussing them. I’m not saying this huge list exists or even whether it should, but the discussion is happening.

“The underlying problem, however, is that dev and ops have different goals”

This is spot on. I think this maybe does get missed in talk that focuses more on tools but not in the wider discussion happening about business improvements. Devops quite litterally brings those two things together. You’ll always have individual goals but where you have separate operations and development teams they should have the same fundamental goals.

“Developers develop in the same environment production runs in If you deploy to Linux, you develop on Linux. No more of this coding on your Macbook Pro and deploying to Ubuntu: that is why you can’t have nice things.”

Yes, yes and yes again. I’m definitely from the developer side of the tracks and I’m constantly telling people this and it’s definitely something I don’t see enough people doing. What I’d love is for all the operations people to state this to their development team and most importantly to help them set that up. Just saying work like me or I won’t let you near the production machines is just being obstructive. Educating and helping with tooling helps build those bridges and trust. And with trust comes the access the developers want. And less stupid bugs and less deployment issues with differing package dependencies.

“None of this amounts to a methodology, as the Devops people would have you believe.”

Still unsure which Devops people are saying it’s a bonefide methodology. I see the word used sometimes but generally in passing and not I don’t think meant how you mean here. And I don’t think I’ve heard people speak about it in person. “Scrum methodology” returns more than 113,000 results in Google. “Devops methodology” returns about 150, some of which 404 and half of which are agregators pointing at the other half.

“The Devops movement smells of a scam in the making”

Some company or other is definitely going to be scammed into paying over the odds for a consultant because they used the word Devops in the sales pitch. That will have next to nothing to do with what I’d see as the Devops movement and everything to do with human nature (and sales people).

A Continuous Deployment Example Setup

Mar 20, 2011 · 5 minute read

One of the reasons behind getting around to building Vagrantbox.es recently was I was giving a talk to a group of startups on The Difference Engine programme and I wanted to have an example project to demonstrate various things. I wanted to demonstrate everything from sensible version control habbits, configuration management, basic orcestration and most importantly a solid deployment process. I’ve decided to write up what I’m doing for deployment because I think it’s pretty nice, and for all the talk about Continuous Deployment I haven’t seen many examples of code and configuration to make it happen.

Most of what I’ll cover is pretty easy to map to whatever technologies your using. For this project I’d gone for Git, Django, Gunicorn, Nginx, Fabric, Mysql and Jenkins and I’m deploying to Ubuntu running on Brightbox Cloud. Apart from the Jenkins instance in the middle you could follow the instructions and swap things out easily.

Jenkins

First up lets install Jenkins. I setup a separate cloud instance just to run the Continuous Integration server. I find this approach easier to manage but you could always run this locally if you prefer. The Jenkins folk provide very up to date packages for Debian so I chose to use those.

Plugins

Jenkins provides a huge number of optional plugins which enable various additional features. Plugins are installed via the web interface at /pluginManager. I’ve installed:

Only the Git plugin is really required for what I’m doing with deployment. Cobertura and Violations are code quality metrics tools that I use to record output from pylint and code coverage for my test suite.

The Source

My finished project was already on GitHub in a private repository. I’m using a requirements.txt file to record python dependencies so I can use pip to install them automatically and I’m using Virtualenv to sandbox this installation. I’m also using South to manage my database schema changes. I won’t go into that here as it’s pretty Python specific, Rails for instance has Active Record migrations, RVM and Bundler which do pretty much the same job. PHP has PEAR and some of the frameworks offer a migration tool.

I then created two projects in Jenkins:

Jenkins dashboard

Project 1: Vagrantboxes

This is the main build of my master branch in Git. As well as setting up the Git repo as shown below I’ve set a polling schedule to /5 * * * (that’s every 5 minutes) and also set Trigger builds remotely so I can have a task in my fabfile which triggers a build immediately.

Git config for Jenkins

I then have two build steps, both of which execute shell commands. The first installs any new requirements via pip:

bash -l -c "source bin/activate; pip install -r requirements.txt"

The second runs my test suite and generates the XML output required to show the test results in Jenkins:

bash -l -c "source bin/activate; cd vagrantboxes/configs/common; python manage.py jenkins boxes"

I’m using the rather handy Django Jenkins application for this.

So far so good. This gives us a project that, when we push some changes to GitHub, will pull those changes down to the CI server and run our test suite, giving us feedback as to whether the tests pass or fail.

Now for the trick, in Post-build Actions tick Build other projects and specify the name of another project that we’ll setup next. Mine is called Vagrantboxes-deploy.

Post build action in Jenkins

Project 2: Vagrantboxes-deploy

This project is triggered only when the previous project runs successfully. And all it’s going to do is run the deployment script on the project we just built. The setup for this project is very simply, it has one build step which just executes the following:

bash -l -c "cd /var/lib/jenkins/jobs/Vagrantboxes/workspace; source bin/activate; fab appserver deploy"

The specifics of the Fabric script here aren’t that important but I’m doing something not too disimilar to what I described here.

The reason I’ve setup a separate project for these is so I can, if I choose, trigger a deployment separately to the full build, and also so I can very easily disable deployments even if the main build is still running.

Conclusions

With this setup whenever I push code to master it triggers a build. If the test suite passes it runs the deployment script and pushes out the code to the live web servers. This suites me and this project but you might find it easier to start by pushing all successfull builds out to a staging environment. And maybe then moving on to having a new project which is only triggered manually for deploying to production.

project view in Jenkins

This setup has other advantages too. The Jenkins dashboard becomes a handy tool for recording deployment events. You can easily setup emails or IM messages or Campfire posts to alert other team members whenever a deployment happens. And it really really makes sure your delployment scripts work without hand holding.

This is a simple project that I’m working on on my own, but in a team environment you’d likely have a more complex branching strategy and more Jenkins projects. You might also introduce some gateways for manual testing but the starting point is the same. Jenkins makes archiving successful build artifacts relatively easy as well, this setup has a few race condition possibilities that you can fix by building artifacts from successful builds. Jenkins also supports building from different branches and having different branches trigger different projects, all handy if you want to grow this kind of setup.

Site For Vagrant Base Boxes

Mar 12, 2011 · 1 minute read

A brief conversation with Matt Keating on Twitter finally pushed me over the edge and I’ve built a site I’d been meaning to do for a while.

I’m a huge Vagrant fan, but one thing that often comes up is where to find base boxes. My newly launched site Vagranbox.es provides just that. At the moment that just means user submitted boxes being checked and then posted. I’ll likely add comments and ratings and the like if things become popular but that’s for later.

vargrantbox.es homepage

So, if you know of or host a useful box please let me know. I’ll try to keep up with any submissions.

Devops - More Than Marketing - Talk By James Turnbull

Mar 2, 2011 · 2 minute read

I’ve just found my notes from James Turnbull’s talk at FOSDEM. I found the talk excellent, and I’m already part of the choir. But much of the audience I’d guess have only come across the devops term in passing, or worse had it pushed at them as part of marketing materials. Hopefully I captured the main points:

So what is devops all about?

Cooperation (between development and operations teams)
Buzzword bingo?
Pop culture movement?
Discussion
It’s early days
No one has all the answers
Nothing is fixed in stone
It’s all about outreach

It’s about

Simplicity - Repeatable, Reusable, Easy to communicate
Relationships - Engage early, engage often, “Toss it over the fence”, Talk to people
Process - Test everything, Automate everything, Redundancy and expectation of failure, Transparent and open to everyone

Tools

Not just ops tools - Config mnagement, Deployment and orchestration, Monitoring, Security, Testing
Use for entire lifecycle dev -> test -> ops
Not just dev tools - Version control, Agile, Application architecture
Testing methodology - Low level vs functional
Documentation - “The only time the network diagram is up to date is after the post mortem”

Continuous improvement

Nothing stands still - Customers, Products, Technology, Your team
Strike often, striek hard, be aggressive

It’s a culture change

This is Hard
People hate change
People hate people who introduce change
Fear of change is irrational - Listen, Concrete examples
Make developers resonsible for uptime - Pagers

FUD

“We’ve always done this”
“That can’t work here”
“This is all about one group or another”
“You’re an elitist bunch of Europeans”

Dangers

Marketing speak
Lip service
Disenchantment
Disenfranchisement

Takeaway

“Not about a person, or a team. About changing how your operations team works”
Automate away small boring repetative tasks to make time for interesting activities
Embed ops people into dev teams
Drag devs to ops standups
Build shared appreciation
Metrics conversations are really powerful

Configuration Management For Development Environments

Feb 8, 2011 · 1 minute read

I had the pleasure of speaking at Fosdem last weekend to a packed Configuration amd systems management devroom.

My presentation covered some of the same ground as recent blog posts, namely why you should be using virtualisation and config management tools to manage your local development environment.

People even said nice things about it:

@garethr basically has this subject completely covered. He’s even advocating the correct editor. excellent #fosdem talk

All in all another good event, I have notes about some of the other talks I went along to that I’ll try write up soon.

Using Checkinstall With Virtualenv For Python Deployments

Jan 29, 2011 · 4 minute read

Michael Brunton-Spall wrote last week about some frustrations with packagings and deploying Python web applications. Although his experience was with Python, the problems he describes are the same for Ruby and PHP and a whole host of languages. The following example uses Python, but works equally as well for anything else.

Michael has three simple rules for his servers:

they cannot access the internet
they cannot access internal services that are for development
they cannot have compilers / utilities on them

I won’t go into all the reasons for doing this (you can read the blog post linked to above) but these are pretty sensible security precautions.

My approach to this problem would be to use your friendly system packages and using a handy tool called Checkinstall to create a deb or rpm. I’m going to use as an example the Eventlet library. This is available in PyPi and one of it’s dependencies (Greenlets) provides a C extension. The same approach would work for an entire Python web application too. I’m as ever using the apt package management tool but this should work with yum as well.

The first step is to build the package on a build machine. This should be a machine or virtual machine running the same operating system as your production web servers. You might build these packages manually or as part of a continuous integration system. On this machine we’ll need the compilers and development tools:

sudo apt-get install build-essential python-dev python-setuptools checkinstall
sudo easy_install virtualenv

We’ll also create a virtualenv into which we’ll be installing our packages:

sudo virtualenv --no-site-packages /usr/local/environment
source /usr/local/environment/bin/activate

Now, instead of just calling easy_install to install the package, we prefix it with checkinstall.

sudo checkinstall /usr/local/environment/bin/easy_install eventlet

This will prompt for various meta data about the package you want to create, including the name and version of the package. If you’re using this method in the real world you’ll want to decide on a versioning and naming scheme for your packages to avoid clashes with system provided packages. You can also set many of these options from the command line rather than having to manually fill them in each time.

Once everything has been filled in successfully this should run through, installing eventlet and greenlets and eventually creating a deb or rpm package depending on what platform you’re running on. You should see something like:

Done. The new package has been installed and saved to

 /home/vagrant/eventlet-gareth_20110129-1_i386.deb

 You can remove it from your system anytime using: 

      dpkg -r eventlet-gareth

Now lets grab that package and take it to one of our front end web servers via a controlled deployment process. That front end web server needs the virtualenv creating but nothing else. So:

sudo apt-get install python-virtualenv
sudo virtualenv --no-site-packages /usr/local/environment

(Now you might be thinking that installing the python-virtualenv package in this way breaks rule 1 above. And you’d be right in most cases, but I’m guessing Michael’s systems team have a local package repo for authorised packages, or alternatively you could download the package to the build machine and push it to the production environment.)

Now install the package we created earlier.

sudo dpkg -i eventlet-gareth_20110129-1_i386.deb

That should throw all the required files into the virtualenv environment we created. No compilers. No calls to internal or external systems. Just move some precompiled binaries and text files to predefined places on disk.

I used a PyPi package as an example. Checkinstall could have been pointed at a custom build file written especially for your own application, one that moves files and folders to where they are needed. Say something that looks like this:

#!/bin/sh
cp /home/stage/myapplication /var/www/apps/

The running checkinstall against that (or a more complex build file using capistrano or ant or fabric) you can create a package containing your application code and install it into the specified place.

Why Developers Should Care About System Packages

Jan 16, 2011 · 8 minute read

First a bit of background. I’m a software developer (lately in Ruby and a tiny bit of Java, previously in Python, C# and PHP; yes I got around a bit), but have spent enough time looking after production hardware (mainly debian, solaris and recently a bit of RHEL) to have a feel for sysadmin work. I even have friends who are systems administrators. I mainly use a shiny apple laptop for my development work, but I actually execute all the code on Linux virtual machines. The aim of this post is to bridge a divide, not start a flame war about specific tools.

I’m writing this partly to address a tweet I made that in hindsight needed more than 140 characters. Actually a number of my recent tweets have been on the same theme so I should be more helpful. What I’m seeing recently is an increase in the ways I’m being asked to install software and for me at least that’s annoying.

Several projects will ask you to do something like curl http://bit.ly/installsh | sh which downloads a shell script and executes it.
Some will insist I have git installed
A new framework might come with it’s own package manager

I’m a polyglot programmer (so I shouldn’t care about #3) that uses git for everything (scratch #2) and who writes little bash scripts to make my life easier (exactly like #1). So I understand exactly how and why these solutions appear fine. And for certain circumstances they are, in particular for local development on a machine owned and maintained by one person. But on a production machine and even on my clean and tidy virtual machines none of these cut it for me in most cases.

Most developers I know have only a passing awareness of packaging so I’m going to have an aside to introduce some cool tricks. I think this is one place where sysadmins go wrong, they assume developers understand their job and that they know the various tools intimately.

System Package Tips

I’m going to show examples using the debian tools so these apply to debian and ubuntu distros. RPM and the Yum tool have similar commands too, I just happen to know debs better.

List all installed packages

This one is a bit obvious, it’s probably going to be available in anyones home grown package management system. But if you’re installing software via hand using git or a shell script then you can’t even ask the machine what is installed.

dpkg -l

List files from package

I love this one. Have you ever installed a package and wondered where the config files are? You can soft of guess based on your understanding of the OS file system layout but this command is handy.

dpkg -L lynx
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/lynx
/usr/share/doc/lynx/copyright
/usr/share/doc/lynx/changelog.gz
/usr/share/doc/lynx/changelog.Debian.gz

Where did that file come from?

Have a file on disk that you’re not sure where it came from? Ask the system package manager. The more everything is installed from packages the more useful this becomes.

dpkg -S /bin/netstat

Unmet dependencies

At the heart of a good package system is the ability to map dependencies and to have unmet dependencies installed as needed. Having tools to query that tree is useful in various places.

apt-cache unmet

Will give you output a little like the followning:

Package libdataobjects-sqlite3-ruby1.9.1 version 0.10.1.1-1 has an unmet dep:
 Depends: libdataobjects-ruby1.9

What needs upgrading?

The apticron tool can alert you to packages that are now out of date. It’s easy to set it up to email you each day for each host and tell you about packages that need upgrading. Remember that the reason one of these might have an update could be a documented security bug and it becomes even more important to know about it quickly.

apticron report [Fri, 19 Jan 2007 18:42:01 -0800]
========================================================================

apticron has detected that some packages need upgrading on: 

    faustus.example.com
    [ 1.2.3.4 ]

The following packages are currently pending an upgrade:

    xfree86-common 4.3.0.dfsg.1-14sarge3
    libice6 4.3.0.dfsg.1-14sarge3
    libsm6 4.3.0.dfsg.1-14sarge3
    xlibs-data 4.3.0.dfsg.1-14sarge3
    libx11-6 4.3.0.dfsg.1-14sarge3
    libxext6 4.3.0.dfsg.1-14sarge3
    libxpm4 4.3.0.dfsg.1-14sarge3

I’m really not an expert on using debs but even I find these tools useful, and you don’t get the same capabilities when you use anything else.

Good and bad examples

Still here? Good. I’m going to pick on a few pieces of software to give examples of what I mean. All of this software I actively use and think is brilliant earth shattering stuff, I’m not dissing the software so if any fanboys reading can kindly not attack me please, I’m one of you.

RabbitMQ (Erlang)

The nice folk building the RabbitMQ message queue provide downloads of the source code as well as various system packages. Knowing that some people will want to use the latest and greatest version of the application they also host the latest deboan packages in their own package repo with details on their site.

Chef (Ruby)

The Chef configuration management system also provides multiple methods to install their software. For people already using, happy and familiar with it they provide everything as a ruby gem. If you prefer system packages they have those too. They also provide their own deb repo for people to grab the latest software.

Cloudera Hadoop (Java)

Before I found the Cloudera Hadoop packages I remember having great fun manually applying patches to get everything working. Cloudera do exactly the same as the above two developers, namely host their owns debs.

RVM

RVM is a fantastic way of managing multiple ruby versions and multiple isolated sets of gems. But it’s also probably the first place I saw the install from remote shell script approach.

bash < <( curl http://rvm.beginrescueend.com/releases/rvm-install-head )

I like to do the same things on my development machine as I do in production, and the main problem I have with RVM is that it’s so useful I want it everywhere. I’d prefer if the system wide install had some sort of option to install the rubies from packages rather than compile everything on the machine (meaning you need a full set of compile tools installed everywhere), or that we can automate the creation of the packages using rvm.

Solr

You’ll probably find packages for the Solr search server in recent distros. It’s hugely popular predominantly because it’s a fantasic piece of software. But everytime I have a look at the system packages I can’t quite get them to work, or they are out of date. I now know my way around Solr setup relatively well and just end up creating my own packages and I’ve spoken to other folk who have done the same. The Solr documentation recommends downloading a zip file to get started and I can’t see any mention of the packages. My guess is the packages aren’t maintained as part of the core development which is a quick way to get them out of sync with current progress.

Enough beating up on my fellow developers

System packages aren’t blameless, I think the culture often seen in debian of splitting the developer from the package maintainer is part of the problem. This manifests in various ways, all negative:

Out of date packages. The biggest complaint from developers about system packages is nearly always that they are out of date. Maintainers should more readily release packaging scripts (ideally back to the project) so people can easily roll their own.
The documentation around packaging is either fantastic or terrible, depending on what you want to do and who you are. It turns out making your own packages (using something like checkinstall) is actually quite easy.
The official debian docs I think focus on the role of package maintainer, rather than trying to push that downstream to the developers. That doesn’t make them bad, it just means we need documentation aimed at a developer just getting started with packaging their software.
Developers hosting their own package repository and asking people to point at that is also quite easy. The projects I praised above all do it nicely. But simple attractive documentation is hard to come by.

What to do

First up lets talk more about the distribution and installation of software. And lets do that in the spirit of making things better for everyone involved. The ongoing spat between Ruby and Debian people is just counterproductive. This would be a good article if it didn’t lead with:

This system (apt-get) is out-dated and leads to major headaches. Avoid it for Ruby-related packages. We do Ruby, we know what’s best. Trust us.

We need better documentation aimed at developers. I’m going to try and write some brief tutorials soon (otherwise I’d feel like this rant was just me complaining) but I’m not an expert. I’ll hapily help promote or collate good material as well. Maybe it already exists and I just can’t find it?

I’m a git user and a big GitHub fan, but one of the features of Launchpad I really like is the Personal Package Archive. This lets you upload source code and have it automatically built into a package. This is specific to Ubuntu but that’s understandable given Launchpad is also operated by Canonical. What I’d like is the same feature in GitHub but that allowed building debs and RPMs for different architectures. Alternatively a webhook based third party that could do the same would be awesome (anyone fancy building one? I might pitch in). The only real advantage of it being GitHub would be it would make packages immediately cool, which hopefully you all now realise that they are.

My Default Recipes For Vagrant Virtual Machines

Jan 10, 2011 · 3 minute read

I’ve written about Vagrant previously and the more I use it the more it impresses me and the more it changes how I work. For those that haven’t yet used vagrant the brief summary is, it’s a way of managing, creating and destroying headless virtualbox virtual machines. So when I’m sat at my computer and I want a new 32 bit virtual machine based on Maverick I just type.

vagrant init maverick32
vagrant up

It has some other magic tricks as well, like automatically setting up NFS shares between the host and guest and allowing you to specify ports to forward in the configuration file. You access the machine via ssh, either using the handy vagrant ssh command or by using vagrant ssh-config to dump the relevant configuration to place in ~/.ssh/config.

I’ve been using virtualisation for a few years, initially purely for testing and experimentation, and then eventually for all my development. I’d have a few VMware images, I’d use snapshots and occasionally rollback, but I very rarely created new virtual machines. It was quite a manual process. With vagrant that’s changing. Everytime I start investigating a new tool or new technology or work on a pet project I create a new virtual machine. That way I know exactly what I’m dealing with, and with vagrant the cost of doing that is the 30s waiting for the new machine to boot.

Or rather it would be if I didn’t then have to install and configure the same few things on every machine. Pretty much whatever I might be doing I found myself installing the same things, namely zsh, vim, git and utils like ack, wget, curl and lynx. This is exactly what the provisioning support in vagrant is for, so I set out to use chef to do this for me.

I decided to use a remote tar file for the recipes. I’m not really bothered about managing a chef server just for my personal virtual machines, but I did want to have a canonical source of the cookbooks that wasn’t local to just one of my machines. Plus this means anyone else who shares my opinions about what you want on a new virtual machine can use them too.

My Vagrantfile now looks like this:

Vagrant::Config.run do |config|
  config.vm.box = "maverick32"
  config.vm.provisioner = :chef_solo
  config.chef.recipe_url = "http://cloud.github.com/downloads/garethr/chef-repo/cookbooks.tar.gz"
  config.chef.add_recipe "garethr"
  config.chef.cookbooks_path = [:vm, "cookbooks"]
  config.chef.json.merge!({ :garethr => {
      :ohmyzsh => "https://github.com/garethr/oh-my-zsh.git",
      :dotvim => "https://github.com/garethr/dotvim.git"
    }})
end

You can see the cookbook on GitHub at github.com/garethr/chef-repo. By default it uses the official oh-my-zsh repo and the vim configuration from jtimberman. My own versions are very minor personal preference modifications of those. The Vagrantfile example above shows how you can override the defaults and use your own configs instead if you choose.

One question I was asked about this approach was why I didn’t just create a basebox with all these things installed by default, this would reduce the time taken on first boot as software wouldn’t have to be installed each time. However it would also mean maintaining the basebox’s myself, and as I use different Linux distributions or versions this would be a headache. While doing this and working with vagrant I’ve been thinking about the ecosystem around the tool and I’m planning on writing my thoughts on that subject over the next week or so.

Solr Libraries and Good API Design

Jan 1, 2011 · 4 minute read

I’m a huge Solr fan. Once you understand what it does (it’s a search engine, which means more than you think) and how it works you spot lots of thorny problems that map to it’s features really well. In my experience it’s also very fast and very stable once installed and setup. Oh, and the community support is great as well.

When I talk to some folks about Solr all they can think about is full text search. The main reason for this I think is a number of poor libraries. I’ve come across lots of Python or Ruby libraries that simply say you don’t have to know anything about Solr, just install this code and you get full text search! This works in the same way as using the default Mysql or Apache configs works, nowhere near as well as if you get your hands dirty even a little. Some of the ruby gems even ship the Solr jar file in the gem. Now you don’t even need to know Solr exists. You take a generic configuration and run it using a rake task behind which is some unknown Java application server. Good luck debugging that when it goes wrong, that’s one hell of a leeky abstraction.

In better news I’ve now found two excellent Solr libraries, one’s that start with the assumption that you know what you’re doing or happy to learn about the tools you’re using. All you really want from a library is a good API that maps to how you write in that language.

Delsolr (Ruby)

The delsolr API is beautiful. It seemlessly merges the worlds of Ruby and Solr in a way that’s easy to write and easy to guess. It’s also clever, the design accepts that new features might be added to Solr before the library is updated or that the library might not support every usecase or option. In these cases you can still pass information through to Solr directly.

Solr’s interface is based around URLs, so any library is really just giving you an interface to creating those URLs.Writing the following in Ruby:

rsp = solr.query('standard',
               :query => '*:*',
               :filters => {:status => 'Active'},
               :facets => [{:field => 'project'}]
    ])

Results in the following URL:

/select?q=*:*&wt=ruby&facet=true&facet.field=status&facet.field=name&fq=status:Active

If you already know Solr and how to construct URLs for searches by hand you’ll immediately get the Ruby code. You can probably even guess how to pass other params like sort or order.

Another nice touch is that you can use either hashes or Lucene search syntax for each attribute. So:

:filters => {:status => 'Active'}

Is the same as:

:filters => 'status:Active'

Sunburnt (Python)

Sunburnt is a python Solr interface from the nice folks at Timetric. I’ve not had chance to use this library in anger as it was released after I’d dont quite a bit of python-solr work in an old job but I’d definately use it now. The API looks like:

rsp = solr.query('*:*').filter(status='Active').facet_by('project').execute()

It’s based around chaining so again you can probably guess how to make further queries from even this simple example.

Both Sunburnt and Delsolr also support adding documents to the index.

Uses

Once you understand facets and the usefulness of filter queries you see lots of places where Solr is useful apart from text search. Lots of ecommerce operations use facetted search interfaces, I’m sure everyone has spent time clicking through nested heirachies and watching the numbers (showing the number of products) next to the links decrease? You can built these interfaces using SQL but it’s incredibly expensive and gets out of hand quickly. Caching only helps a bit due to the number of permutations in all but the smallest stores or simplest products. It’s a similar problem with tagging, it’s pretty easy to kill your database

But it’s not just things that have the word search in that you can map Solr to. Two good examples are Timetric (from whom the Sunburnt library comes from) and the Guardian Content API. Both of these present lots of read data straight from Solr with great success and less database killing performance issues. Solr can really be seen as a simple place to denormalise your data, one advantage being that it keeps your database schema clean.

Learning More

Solr could do with better documentation for beginners. The wiki is an excellent reference once you know how to write schema and configuration files but I think the getting started section sacrifices introducing configuration in favour of getting people searching quicker. The example schema and solrconfig files that ship with Solr are also amazingly useful references (officially the best commented XML I’ve ever seen) but also intimidating to beginners. The Drupal community appear to be writing some good docs that fill this gap though, here’s a few links that I’d recommend: