InfraKit Hello World

Docker just shipped InfraKit a few days ago at LinuxCon and, while at the Docker Distributed Systems Summit, I wanted to see if I could get a hello world example up and running. The documentation is lacking at the moment, epecially around how to tie the different components like instances and flavors together.

The following example isn't going to do anything particularly useful, but it's hopefully simple enough to help anyone else trying to get started. I'm assuming you've checked out and built the binaries as described in the README.

First create a directory. We're going to be using InfraKit to manage local files in that directory as part of the demo.

mkdir test

Now create an InfraKit configuration file. We're going to use the file instance plugin to manage files in out directory. This means everything works on the local machine, rather than trying to launch real infrastructure in AWS or similar. InfraKit also requires a flavor plugin. I'm using vanilla here just to meet the requirement for a flavor plugin, but it's not going to actually do anything in this demo. It might be useful to write a noop flavor plugin or similar.

cat garethr.json
    "ID": "garethr",
    "Properties": {
        "Instance" : {
            "Plugin": "instance-file",
            "Properties": {
        "Flavor" : {
            "Plugin": "flavor-vanilla",
            "Properties": {
                "Size": 1

InfraKit is based on running separate plugins. Each plugin runs as a separate process and provides a filesystem socket in /run/infrakit/plugins. First start up the file plugin:

$ ./infrakit/file --dir=./test
INFO[0000] Starting plugin
INFO[0000] Listening on: unix:///run/infrakit/plugins/instance-file.sock
INFO[0000] listener protocol= unix addr= /run/infrakit/plugins/instance-file.sock err= <nil>

Next, in a separate terminal run the vanilla plugin:

$ ./infrakit/vanilla
INFO[0000] Starting plugin
INFO[0000] Listening on: unix:///run/infrakit/plugins/flavor-vanilla.sock
INFO[0000] listener protocol= unix addr= /run/infrakit/plugins/flavor-vanilla.sock err= <nil>

An finally run the group plugin. I'm passing --log=5 to enable more verbose outout so it's easier to see what's going on with the group.

$ ./infrakit/group --log=5
INFO[0000] Starting discovery
DEBU[0000] Opening: /run/infrakit/plugins
DEBU[0000] Discovered plugin at unix:///run/infrakit/plugins/instance-file.sock
INFO[0000] Starting plugin
INFO[0000] Starting
INFO[0000] Listening on: unix:///run/infrakit/plugins/group.sock
INFO[0000] listener protocol= unix addr= /run/infrakit/plugins/group.sock err= <nil>

With that all setup we can create a group based on our configuration file from above.

$ ./infrakit/cli group --name group watch garethr.json
watching garethr

Have a look in the test directory. You should see a single file has been created.

$ ls test

Let's delete that file and see what happens:

rm test/* 

Hopefully InfraKit will spot the instance (a file in this case) no longer exists and recreate it. You should see something like the following in the logs:

INFO[0612] Created instance instance-1475833820 with tags map[infrakit.config_sha:B2MsacXz8V_ztsjAzu3tu3zivlw=]

This is obviously a less-than-useful example but hopefully provides a good hello world example for anyone trying to run InfraKit in it's current early stage.

Everyone is Not a Software Company

The Everyone is a Software Company meme has been around for a number of years, but it feels increasingly hard to get away from recently. That prompted this post.

But what do we mean by Software Company?

To be software company you're going to need to employee software engineers and other professionals. Applying that logic to a large number of companies at once, and looking at how existing software companies are setup, we find a few large problems.

Google as an example

In my talk at Velocity, entitled The Two Sides of Google Infrastructure for Everyone Else I argued both for and against the idea of wholesale adoption of Google-like software and development/operations practices. Even though they derive the lions share of revenue from advertising it's easy to argue that Google are a software company. But what does that look like? What makes Google a software company?

From the Google Annual Report 2015

61,814 full-time employees: 23,336 in research and development, 19,082 in sales and marketing, 10,944 in operations, and 8,452 in general and administrative functions

So, roughly 50% of Google is involved in building or running software. Glassdoor says salaries for engineers at Google average about $126,000-$162,000.

The US Bureau of Labor Statistics says that in 2014 the number of computer programming jobs in the US was 1,114,000, with median pay in 2015 of $100,690 a year. The total number of jobs in the US is about 143 million, with the average wages at $44,569.20 according to the Social Security Administration.

The Google Annual Report also states:

Competition for qualified personnel in our industry is intense, particularly for software engineers, computer scientists, and other technical staff

So, quick summary:

  • Software engineers are expensive relative to others employees
  • Demand for the best engineers means even higher wages
  • Proportionally there aren't many software developers
  • There isn't a large surplus of unemployed software engineers

Now the data above is mainly from US sources, although the Google data is from an international company with offices around the world. My experience says this is likely similar in Europe. Looking into data for India and China would be super interesting I'd wager.


One obvious problem is short-term supply and demand. Everyone wants experienced software folks for their transformation effort. But the more organisations that buy into the everyone is a software company story the greater the demand for a finite supply of people. For most that means you'll to able to find less people that you want because of competition and afford even less people because all that competition pushes up salaries.

I've seen that firsthand while working for the UK Government. People occasionally complained that Government was hampering commercial organisations growth by employing lots of developers and operations people in London.

You're also immediately in competition for software professionals with existing software companies. Given the high salaries, most of those employers already have developer friendly working environments and established hiring practices suited to luring developers to work for them. This sort of special case is hard for large companies without an existing empowered developer organisation. I saw a lot of that at the Government as well.

But the real macro problems are much more interesting. Even if you think 50% is a high mark for the ratio of software folk to others, you probably agree you need a lot more than you have today. And those developers just don't exist today to allow everyone to be a software company. Nor would I argue is education in the near term producing enough skilled people to fill that gap tomorrow. So, what happens?

  • Does everyone sort-of become a software company but not quite?
  • Do most organisations struggle to hire and maintain a software team and see the endeavour fail?
  • Do increasing numbers of developers end up working for a small number of larger and larger software companies?
  • Does outsourcing bounceback, adapt and demonstrate innovation and transformation qualities to go along with the scale?
  • Countries like India or China are able to produce enough software engineers at scale to allow there companies to act on everyone becoming a software company?
  • We see clear winners and losers, ie. companies which become software companies and accelarate away from those that don't?

Personally I think to take advantage of the idea behind the meme we're going to need order of magnitude more efficient approaches to software delivery. What that looks like is the most interesting question of all.


The above is not a detailed analysis, and undoutedly has a few holes. It also doesn't overly question the advantage of being a software company, or really question what we actually mean by everyone. But I think the central point holds: Everyone is NOT a software company, nor will everyone be a software company any time soon, unless we come up with a fundamentally better approach to service delivery.

Operations is more than just Systems Administration

I think one of the patterns of the last few years has been the democratization of systems administration, especially for web applications. Whether that's Heroku or Docker, or Chef or Puppet, more and more traditional developers are doing work that would have been somebody else's problem only a few years ago. But running in parallel to that thread is another less positive trend, that of conflating operations with just systems administation. The story seems to go that now we know Ansible (or some other tool) we just need developers to run the show.

In this post I'm going to try and introduce some of the other operational disciplines, especially for developers who maybe have come to operations via the above resurgence in infrastructure tooling over the past few years.

Note that this post has a slight bias towards more normal organisations. That is to say if you're in a 5 person software startup you probably don't have operational problems to worry too much about yet. I'm also not playing down the practice of systems administration, most experienced sysadmins I know are also quite rounded operations pros as well.

Service Management

If you've worked in operations, or in many large organisations you'll have come across the term Service Management. This tends to be linked to various service management frameworks; like ITIL or MOF (Microsoft Operations Framework). The framework will describe, often in great detail, activities and processes for things like incident response, configuration management, change management, capacity planning and more.

While I was at The Government I wrote what I think is a reasonable introduction to Service Management albeit from a specific point-of-view. This was based on my experience of trying, and likely sometimes failing, to encourage teams to think about how the products they we're working on would be run. Each of the topics touched on in the overview is worthy of it's own stack of books, but I will repeat the ITIL service list here as (whatever you might think of the framework or a specific implementation) I'd found it a useful starting point for conversations - in particular stressing the breadth of topics under service management.

Service Strategy

  • IT service management
  • Service portfolio management
  • Financial management for IT services
  • Demand management
  • Business relationship management

Service Design

  • Design coordination
  • Service Catalogue management
  • Service level management
  • Availability management
  • Capacity Management
  • IT service continuity management
  • Information security management system
  • Supplier management

Service Transition

  • Transition planning and support
  • Change management
  • Service asset and configuration management
  • Release and deployment management
  • Service validation and testing
  • Change evaluation
  • Knowledge management

Service Operation

  • Event management
  • Incident management
  • Request fulfillment
  • Problem management
  • Identity management
  • Continual Service Improvement

For each of the above points, whether you are using ITIL or not, it's useful to have a conversation. Some of these areas do provide ample opportunity for automation and for using tooling to minimise the effort required. But much of this is about designing how you are going to operate a service throughout it's lifetime.

Operations user stories

One of the other things I published while at The Government was a set of user stories for a web operations team. These grew out of work on launching GOV.UK and have had input from various past colleagues. In hindsight I'd probably do somethings here differently, the stories assume a certain context which isn't explicitly spelled out for instance. But they have a couple of things going for them in that they demonstrate how traditional operations activities can be planned out as part of a more developer-friendly planning approach, and also they are public and have been tested by more than a single team.

Not everything is a programming problem

The main point I think is that not everything can be turned into a programming problem to solve. Automation has it's place, and many manual processes and practices can benefit from automation. But the wide range of activities involved in running a non-trivial and often non-ideal system in production tend to mean making trade-offs and prioritization decisions frequently. This is where softer skills like arguing for funding or additional head count, or building a business case for further work, come into play. Operations management is much more than systems administration.

Further reading

This is little more than a plea for people to think more about operations, separate to the more technical aspects of systems administration. If you're interested in learning more however I would recommend some good reading material:

  • Visible Ops Handbook - still an excellent and pragmatic introduction to many of the topics noted above.
  • Designig Delivery - a bang up-to-date tome covering a range of service design topics.
  • Basic Service Management - a 50 page starter book covering the fundamentals of service management as generally discussed in more detail elsewhere. A great starting point.

Provisioning droplets with Puppet

I love DigitalOcean for quickly spinning up machines. I also like managing my infrastructure using Puppet. Enter the garethr-digitalocean module. This currently provides a single Puppet type; droplet.

Lets show a quick example of that, by launching two droplets, called test-digitalocean and test-digitalocean-1.

droplet { ['test-digitalocean', 'test-digitalocean-1']:
  ensure => present,
  region => 'lon1',
  size   => '512mb',
  image  => 14169855,

With the above manifest saved as droplets.pp we can run it with:

$ puppet apply --test droplets,pp

This will ensure those two droplets exist in that region, and have that size. If they don't exist it will launch droplets using the specified image. This means we can run the same command again, and rather that create more instances it will simply report that we currently have those droplets already.

Querying resources

Puppet also comes with puppet resource, a handy way of querying the state of a given resource or type. Running the following will list all of your droplets, whether you created them using Puppet or not.

$ puppet resource droplet
droplet { 'test-digitalocean':
  ensure              => 'present',
  backups             => 'false',
  image               => '14169855',
  image_slug          => 'ubuntu-15-10-x64',
  ipv6                => 'true',
  price_monthly       => '10.0',
  private_address     => '',
  private_networking  => 'true',
  public_address      => '',
  public_address_ipv6 => '2A03:B0C0:0001:00D0:0000:0000:0090:B001',
  region              => 'lon1',
  size                => '1gb',

Mutating resources

The type also supports mutating droplets, for instance changing the size of a droplet if you change the model in Puppet. The API client doesn't support all possible changes, but you can disable backups, enable IPv6 and switch on private networking as needed. Here's a quick sample of the output showing this in action.

Info: Loading facts
Notice: Compiled catalog for gareths-macbook.local in environment production in 0.43 seconds
Info: Applying configuration version '1449225401'
Info: Checking if droplet test-digitalocean exists
Info: Powering off droplet test-digitalocean
Info: Resizing droplet test-digitalocean
Info: Powering up droplet test-digitalocean
Notice: /Stage[main]/Main/Droplet[test-digitalocean]/size: size changed '1gb' to '512mb'
Error: Disabling IPv6 for test-digitalocean is not supported
Error: /Stage[main]/Main/Droplet[test-digitalocean]/ipv6: change from true to false failed: Disabling IPv6 for test-digitalocean is not supported
Error: Disabling private networking for test-digitalocean is not supported
Error: /Stage[main]/Main/Droplet[test-digitalocean]/private_networking: change from true to false failed: Disabling private networking for test-digitalocean is not supported
Info: Checking if droplet test-digitalocean-1 exists
Info: Created new droplet called test-digitalocean-1
Notice: /Stage[main]/Main/Droplet[test-digitalocean-1]/ensure: created
Info: Class[Main]: Unscheduling all events on Class[Main]
Notice: Applied catalog in 60.61 seconds

But why?

Describing your infrastructure at this level in code has several advantages:

  • Having a shared model of your infrastructure in code allows for a discussion around that model
  • You can be convident in the model because of the idempotent nature of running the code
  • The use of code for this model allows for activities like code review, change control based on pull requests, unit testing, user created abstrations and more
  • The use of Puppet means you can use it as above as a command line interface, or run it every period of time to enfore and report on the state of you infrastructure
  • Puppet ecosystem tools like PuppetDB, Puppet Board or Puppet Enterprise mean you can store data over time for later analysis

The module also acts as a reasonable example of a simple Puppet type and provider. If you're interested in extending Puppet for your own services this is hopefully a good place to start understanding the API.

Some Security Implication of Unikernels

I was attending the first GOTO London conference last week, in particlar the Rugged Track. One of the topics of conversation that came up was unikernels, and their potential for improving the state of software security. Unikernels are pretty new outside research groups, I’m just lucky enough to live and work in Cambridge where some of that research is happening. The security advantages of unikernels are one of the things that attracted me in the first place. I thought it might be interesting to jot a few of those down for other people interested in security and the future of infrastructure.

As with my last post, it’s worth having a basic understand of Unikernels. I’d recommend reading Unikernels - the rise of the virtual library operating system.


Every unikernel is provided the isolation guarantees from a hypervisor. Not only are these guarantees reasonably well understood, they tend to make use of hardware features too. It’s interesting to note that recent container runtime work is heading in this direction too, with ptojects like Clear Containers from Intel, Bonneville from VMware and the new stage1 in rkt.

No User Space

With a typical server OS we have kernel space and user space. Part of the idea here is to ensure the underlying machine doesn’t crash, whatever horrible things people do in user space. But this means you can do horrible things. The unikernel model is similar to the Erlang philosophy of let it crash. You only have kernel space, you entire application resides in it. Most things out of the ordinary are going to crash the kernel. This makes the sort of exploratory testing useful in exploit development harder.

Really Immutable Infrastructure

People often talk about immutable infrastructure. I’d wager there is more talk than reality however. When you push, people are often not using read-only file systems and retain the capability to login to machines to make ad-hoc changes. What they mean by immutable is that they only change machines at deploy time. This ignores both the fact they have the technical capability to change them anytime, and that an attacker could change them outside that deployment cycle. With unikernel systems there is often just the compiled kernel, you can’t just change files on disk. The defaults force an immutable way of working.

Clean Slate TLS

As a typical developer or operator you’ve probably learned more than you wanted to know about the OpenSSL source code. It’s not well understood and not likely to be so anytime soon and has some pretty spectacular bugs like Heartbleed. The Core Infrastructure Initiative is laudable and will improve things but it’s still a problematic codebase. Functional programming is often regarded as an easier way of writing understandable code. Types are a good thing, especially when it comes to security systems. So a pure OCaml TLS implementation as used by MirageOS makes sense on lots of levels. Yes this is quite an undertaking, but the bitcoin pinata tests show promise.

Formal Proofs

Knowing whether an application really does exactly what you want it to do (and no more) is a hard problem to solve. Unit tests and other form of automated testing help, but are still reliant on people to both write and design the tests. A formal proof system can provide much stronger guarentees of correctness, it’s an approach used in some cases for missing-critical components of Amazon’s AWS. MirageOS is implemented in OCaml. One of the most popular OCaml programmes is Coq, which just so happens to be a formal proof management system. I’ve not seen many examples yet of this approach, probably due to the effort involved, but the capability is there for building formally specified unikernels. I’d wager a similar thing is possible with Haskell and HalVM. Making that easier to do for typical developers could open up much more secure development practices for certain usecases.