Working for a software vendor

One of the reasons I moved to Puppet two and a bit years ago was because I was interested in the software industry. In particular I was interested in being on the vendor side for a while. My background is mainly as a service provider, software as a service, in-house developer/ops type person. This has definitely been an interesting experience, but I’ve not tried too much to explain why, until now.

First, what do we mean by vendor?

a person or company offering something for sale, especially a trader in the street.

So in the context of a software vendor we specifically mean:

a person or company offering software for sale

Note that we’re selling the software, not access to some service provided by software (ie. SaaS). SaaS and other as-a-service models are growing part of the industry, but the business model, development cycle, company structure and other aspects are quite different in my experience, though lots of hybrid models exist too.

Economics and scale

One of the interesting aspects of the software vendor world is the economics, the revenues, and the fact lots of companies are public. This in turn means a large amount of VC money goes into trying to create another large software vendor, because the potential payout is huge.

Take a sample set of companies from the last 10 years or so that are still private: Docker, Puppet, Chef, MongoDB, Elastic, CoreOS, Mesosphere, Weave, Cloudera, etc. Somewhat biased towards my own interests I’ll admit.

Now take a sample of large, public, software vendors: Oracle, Microsoft, CA, SAP, Sage, BMC, VMware. Not counting companies like Intel, Cisco, IBM, Dell (no longer public), and HP with huge software portfolios.

Let’s pick on Sage, a UK software company selling accounting software. As of 2014 Sage had 1169 people in software development R&D roles and they made $1.6billion from software and related services in 2015. That’s probably about the (order of magnitude) number of people employed in R&D roles in the above private softare companies. The revenue is (and I’m guessing here) a bit higher at Sage than those companies combined too. SAP is an order of magnitude larger, both in terms of people (18908 in 2014) and revenues ($18billion also in 2014). Oracle revenues were $38billion as another data point.

So all the cool (or not so cool) companies from the past 10 years or so are a rounding error to the size of the industry. But you wouldn’t know that from reading Hacker News or other parts of the internet. This disconnect is a constant source of interest to me as I spend time with Puppet customers and with the wider infrastructure community at conferences and the like.

A world of difference

My gut feeling is that most people working as software developers, designers, product managers, etc. don’t work for software vendors. Apart from maybe in localised areas like Silicon Valley. But because of the mentioned money and scale (and PR spend) of the big players a great deal of press interest centers around vendors. Docker is probably the best current example of this but it’s more general than one company. This makes what happens in software-vendor-startup-land more visible to everyone else than, say, IT reality in large financial companies.

At the heart of a good software company is a product being built and maintained by a team of engineers, designers, managers, etc. In many ways this is similar to lots of peoples experience of building software (whether at work or at home as part of one open source project or another). But the support surrounding this tends to vary greatly from other areas. A dedicated marketing and product marketing team, dedicated sales staff, a professional services function, training, documentation, public relations personell are all required to turn the software into revenue. And importantly these teams have to work closely together, and be actively involved with the development of the product.

This is very different from an in-house development position, but it’s also quite different from most SaaS operations. SaaS tends (generalising here) to be based around large numbers of individual users with monthly recurring revenues of 10s or 100s of US dollars. Software vendors selling to large enterprises tends to be looking at single large deals of 10s of thousands to many millions of dollars. This tends to mean large differences in total number of customers, revenue per customer, time needed to close a deal, requirement for staff local to a customer, etc. All of that makes for a very different operation and feedback cycle.

Some interesting observations

Software has a much longer shelf-life in the real-world than people typically think on the internet. Take the datacenter automation market. This IDC report for example pegs the market at $2.3billion in 2015. VMware takes the lions share with roughly 30%, with BMC with 10%. For reference Puppet has 3.2% and Chef 1.2%. Obviously this is just one report, and it’s now a year old, but it’s an interesting data point. And compare that to what you might expect if you just follow the software rather than the market. Even in 2015 some people would have been saying “surely everything is Docker and Kubernetes now?“. The reality is closer to it being all shell scripts and BladeLogic for the majority of IT shops.

For the most part, innovators (and some early adopters) don’t buy software, instead they build or co-opt it. Take Netflix, Uber, Amazon, Google, Facebook or similar. All are well-known for building much of there core software and infrastructure and using open source solutions for much of the rest. And it’s not just software, all of the above also have large internal investments in bespoke hardware as well. So who buys software from software vendors? Taking the Rogers’ Innovation Adoption Curve it’s the early majority, the late majority and laggards. That’s ~85% of the market. Most of the noise on the internet about software is from innovators and early adopters, or people who want to be in those groups. But most of the software sold is to people with very different wants and needs. This chasm explains much of the frustration experienced with software, and the difficulty of building software for often very different types of users at the same time.

Much of the writing about continuous delivery and continuous deployment assumes you’re releasing a web site or at least a central, single, service. At the very least this is most peoples experience and context. But shipping software than people install and run themselves tends to make software deployment a pull rather than a push. A vendor can release a new version, but how to make the customer upgrade? Technically this could be reasonably straightforward (Chrome auto-updates for example) but for expensive, often critical, systems in sometimes regulated or otherwise controlled or low trust environments, this turns out to be trickier and more about people than just technology. This is an entire topic on it’s own so I’ll leave it there for now.

Continuous integration for packaged software (true for some, but not most, projects outside software vendors) tends to hit a permutation explosion quite quickly. Take server software because that’s what I’m most familiar with. You’ll definitely support the latest version of RHEL, plus probably a few older versions, and maybe Centos and some of the other variants (Oracle Linux, Scientific Linux) as well. Ubuntu LTS releases probably makes the list, as might Debian stable. You’ll also likely want to test on at least Windows Server 2016 and 2012. You may have need to keep going and support BSD, AIX, HP-UX, SUSE, etc. Puppet has an unreasonably long list of supported and tested platforms for instance. Throw in other variations or configurations or architectures and you have a serious CI environment. Compare this to a more typical case of a deployment pipeline to a single known operating system and version on a server you control.

Open source

One of the notable things about the lists above of older (public) and newer (currently private) software companies is that all of the newer ones are based around an open source software product or products. We’ve had companies based around open source for a long time, but very few make it to the public markets (where we get data to see if they actually work as companies). A recent exception is Hortonworks (HDP) which opened at $26.38 in December 2014 but is down to $8.31 as of this writing, with revenues around $40million a quarter. Red Hat (RHT) did $2billion in 2016 (which remember is 5% of Oracles revenues, but still a large amount).

So undoutedly open source has had a large effect on the software industry as a whole. But the impact on the public markets to date is minimal in terms of new companies. It will be super interesting to see if in 5 years time the list of public software companies based on open source software is larger than it is today.


I mainly wrote this post so I had something to reference when I talk to people about the software industry, and in particular what it’s like working for a software vendor. Speculating about or second guessing one vendor or another is an internet sport (non-more-so than for those that work at other vendors) but from the outside it’s worth an appreciation of some of the differences I think and a bit of empathy for the decisions made. And if the above makes you think this all sounds rather interesting then you’d be right.

The coming of the Kubernetes distributions

Very few people today start using Linux by downloading the linux kernel and starting from scratch. Most people start with a Linux distribution; for instance Debian, Ubuntu or CentOS. These distributions provide some opinions, some central infrastructure, a brand, strong versioning for the entire ecosystem and a bunch of other things. I posit that we’ll see the same pattern emerge with Kubernetes.

What even is Kubernetes?

I’ve seen Kubernetes described as all of the following:

  • An operating system for your datacenter
  • The distributed systems toolkit
  • The Linux kernel for distributed systems

I think all of these descriptions point to the developers intent that Kubernetes is something to build upon, rather than a simple out-of-the-box experience. It’s predominantly about building agreement on the primitives/APIs of distributed systems.

A name for a thing

I’ve not seen much discussion of this in general yet, I think because it’s early days and many of the people looking at Kubernetes today are either developers or early adopter types. These people have been “downloading the kernel and starting from scratch”, even until recently most likely running from source downloaded directly from GitHub. If the Kubernetes ecosystem is to grow then that’s not how more mainstream IT will adopt Kubernetes.

The reason for discussing this now is that I think a name is useful. That way we can talk about Kubernetes (singular, the software) separate from distrubutions of Kubernetes (many of them, from different vendors and communities). I’d be happy to see a different name, but I think distribution probably fits best.

Any evidence?

Absolutely. A range of software vendors are providing what I’m calling Kubernetes distributions. Here is a sample, I’m sure there are and will be more. I’m also sure over time some will disappear or maintain only a niche audience.

  • OpenShift from Red Hat
  • Tectonic from CoreOS
  • Kismatic from Apprenda
  • Rancher
  • Canonical Distribution of Kubernetes
  • GKE from Google
  • Azure Container Service from Microsoft
  • Photon Platform from VMware
  • Navops from Univa

Note that Canonical are already using the term distribution in the name. I’ve seen it used in passing in CoreOS, OpenShift and Apprenda press materials too.

What can we expect from Kubernetes distributions?

Running with the analogy that Kubernetes is “an operating system for your datacenter” and that we’ll have a range of competing Kubernetes distributions, what else can we expect over the next few years?

Package repositories (aka. app stores)

One of the things provided by the traditional Linux distributions has been a central package repository. Most of the packages you’re installing from apt or yum are coming from that currated set of available packages. Not to mention community efforts like EPEL. We already have two package concepts within the Kubernetes ecosystem - container images (often from Docker Hub today, or from internal repositories) and Charts, part of the Helm package management tool (now a CNCF project).

In the short term expect the shared public Charts repository and Docker Hub to dominate. But over time different vendors will launch there own repositories. Partly this will be about building a trusted ecosystem, partly about limiting permutations for support and testing, and partly about control. The prize here is to be “the enterprise app store” and no vendor in this space isn’t going to at least try to own that as part of their platform.

Kubernetes standards and compliance

In an environment with many distributors of core software, it’s common for people to emphasise portability. As vendors extend their distribution (to provide higher level, but potentially proprietary features) this can become muddier. Some level of certification is often the answer. See CloudFoundry or OpenStack for recent examples. Kubernetes is already part of the CNCF, part of the Linux Foundation. I’d expect to see the works standards and certification eventually float around, but my guess is not in the short term.

A fight over who is the most open

Much of the container conversation recently has centered around a weaponisation of open. I think as the different distributions try and take the community with them at the same time as trying to scale sales this will continue. This will be an irritation and is probably best avoided.

Pressure for AWS to offer Kubernetes as a service

I would presume AWS has a very good idea of how many people are actually using Kubernetes on it’s platform. I think as that grows, and as other vendors efforts mature, they will come under pressure to offer the Kubernetes API as a service. I’m still split on whether that will actually happen but that’s a longer blog post about economics.

Differentiating features

Ultimately vendors will try and differentiate themselves in this new market. To begin with the majority of business will be targetting the container-curious and mainly talking up the benefits of containers and Kubernetes. But some potentialy customers are going to insist on comparing Kubernetes distributions and winning there is going to be about clear differentiation. Do you want to be the budget offering or the provider with the unique selling point?

Interesting questions

An observation at the moment is that all the current Kubernetes distributions I’m aware of are vendor-owned. Whether Open Source or not, they are driven by a single vendor (CoreOS, Red Hat, Apprenda, etc.) It’s interesting to see whether, in the current climate, we see a genuinely free and open source Kubernetes distribution emerge, similar to the role Debian plays in the Linux distribution world.

Unikernels and The End of the General Purpose Operating System

The previous post went into why I think the days of the general purpose operating system (for servers) are numbered. But one interesting area I didn’t comment on (but did talk about in the talk of the same name) was Unikernels.

It’s all about cost

One of the topics I didn’t really touch on in discussing the end of the generally purpose operating system was cost. Historically, maintaining a general purpose operating system has been a costly endeavour, something only the largest companies or communities could sustain by themselves. Think Red Hat, Oracle, Microsoft, Sun, IBM, Debian, etc. The result of that is the assumption when building software that you should target one or more of a small number of operating systems. In doing so you’re ceding some ground, and likely some revenue, to another vendor. You’re also stuck with any underlying limitations of that OS as well as its release cadence. And invariably you’re also stuck with the multiplying support cost of supporting your software on multiple versions of that OS over time.

I would posit that up until relatively recently the cost of that support burden was hugely outweighted by the cost of maintaining an actual operating system. But that’s now changing, as I outlined in the previous post. Now a small or medium sized software company (be it CoreOS, Rancher, Docker, Pivotal, etc.) can build and maintain it’s own operating system as well. This is very much about the rising level of abstraction - all of the above leverage the huge efforts that go into the Linux kernel and into other projects like systemd (CoreOS) or Alpine (Docker’s Moby) for instance.

Enter Unikernels

But where do Unikernels fit into this narrative? I’d argue that they represent the fulfilment of this democratization. If building and maintaining a traditional OS is only possible for the largest of companies, and building and maintaining a more special-purpose OS (say for running containers, or a storage device) is cost-effective for medium sized softare companies, then Unikernels will allow anyone to build their own single-purpose operating systems.

There are other technical reasons for (and against) Unikernels as an approach but most focus on the technical. I think the economic side is worth some consideration too. And not just the typical development and support costs, but the ability to own the end-to-end unit of software has lots of benefits, and Unikernels may make those benefits available to everyone, including small organisations and individuals.

The End of the General Purpose Operating System

As interesting chat on Twitter today reminded me that not everyone is probably aware that we’re seeing a concerted attempt to dislodge the general purpose operating system from our servers.

I gave a talk about some of this nearly two years ago and I though a blog post looking at what I got right, what I got wrong and what’s actually happening would be of interest to folks. The talk was written only a few months after I joined Puppet. With a bunch more time working for a software vendor there are some bits I missed in my original discussion.

What do you mean by general purpose and by end?

First up, a bit of clarification. By general purpose OS I’m referring to what most people use for server workloads today - be it RHEL or variants like CentOS or Fedora, or Debian and derivatives like Ubuntu. We’ll include Arch, the various BSD and opensolaris flavours and Windows too. By end I don’t literally mean they go away or stop being useful. My hypothosis is that, slowly to begin with then more quickly, they cease to be the default we reach for when launching new services.

The hypervisor of containers

The first part of the talk included a discussion of what I’d referred to as the hypervisor of containers, what today would more likely be referred to as a CaaS, or containers as a service. I even speculated that VMWare would have to ship something in this space (See vSphere Integrated Containers and the work on Photon OS) and that counting out OpenShift would be premature (OpenShift 3 shipped predominantly as a Kubernetes distribution). I’ll come back to why this is a threat to your beloved Debian servers shortly.

The race to PID1

For anyone who has run Docker you’ll likely have wrestled with the question of where does the role of the host process supervisor (probably systemd) start and the container process supervisor (the Docker engine) end? Do you have to interact directly with both of them?

Now imagine if all of the software on your servers was run in containers. Why do I need two process supervisors now with 100% overlap? The obvious answer is you don’t, which is why the fight between Docker and systemd is inevitable. Note that this isn’t specific to Docker either. In-scope for cri-o is Container process lifecycle management.

Containers as the unit of software

Hidden behind my hypothosis, which mainly went unsaid, was that containers are becoming the unit of software. By which I mean the software we build or buy will increasingly be distributed as containers and run as containers. The container will carry with it enough metadata for the runtime to determine what resources are required to run it.

The number of simplying assumption that come from this shared contract should not be underestimated. At least at the host level you’re likely to need lots of near-identical hosts, all simply advertising their capabilities to the container scheduler.

Operating system as implementation detail

What we’re witnessing in the market is the development of vertically integrated stacks.

  • Docker for Mac/Windows/AWS/Azure ships with it’s own operating system, an Alpine Linux derivative nicknamed Moby, which is not intended for direct management by end users.
  • Tectonic from CoreOS is a Kubernetes distribution which runs atop a cluster of managed CoreOS hosts. Most of the operating system is managed with frequent atomic rolling updates.
  • OpenShift Enterprise from RedHat is another Kubernetes derivative, this time running atop Atomic host.
  • Pivotal CloudFoundry ships with the IaaS, host OS, kernel, file system, container OS all tested together

In all of these cases the operating system is an implementation detail of the higher level software. It’s not intended to be directly managed, or at least managed to the same degree as the general purpose OS you’re running today.

This is how the end comes for the majority of your general purpose operating system running servers. The machines running containers will be running something more single purpose, and more and more of the software you’re running will be running in containers.

The reason why you’ll do this, rather than compose everything yourself, is compatability. Whether it’s kernel versions, file system drivers, operating system variants or a hundred variations that make your OS build different from mine. Building and testing software that runs everywhere is a sisyphean task. Their is also the commercial angle at play here, and the advantage of being able to support a single validated product to everyone.


There are lots of implications to this move, and it’s going to be interesting to see how it plays out with both early adopters and enterprise customers alike.

  • What does this mean for corporate operating system policies?
  • How do standard agent-based monitoring systems work in a world of closed vertical stacks?
  • Will we see this pattern for other types of service in the AWS Marketplace, where instance launched are inaccessible but automatically updating?
  • How does such fast moving software work in environments with rigid change control processes or audit requirements?
  • Many large organisations will end up running more than one of these types of system, how best to manage such heterogenous environments?
  • Will we see push back from some parties? In particular the open source community who may see this mainly serving the needs of vendors?
  • Does the end of the general purpose OS lead to greater specialism amongst systems administrators?

I’d love to chat about any of this with other folks who have given it some thought. It’s interesting watching grand changes play out across the industry and picking up on patterns that are likely obvious in hindsight. And if you like this sort of thing let me know and I’ll try and find time for more speculation.

InfraKit Hello World

Docker just shipped InfraKit a few days ago at LinuxCon and, while at the Docker Distributed Systems Summit, I wanted to see if I could get a hello world example up and running. The documentation is lacking at the moment, epecially around how to tie the different components like instances and flavors together.

The following example isn’t going to do anything particularly useful, but it’s hopefully simple enough to help anyone else trying to get started. I’m assuming you’ve checked out and built the binaries as described in the README.

First create a directory. We’re going to be using InfraKit to manage local files in that directory as part of the demo.

mkdir test

Now create an InfraKit configuration file. We’re going to use the file instance plugin to manage files in out directory. This means everything works on the local machine, rather than trying to launch real infrastructure in AWS or similar. InfraKit also requires a flavor plugin. I’m using vanilla here just to meet the requirement for a flavor plugin, but it’s not going to actually do anything in this demo. It might be useful to write a noop flavor plugin or similar.

cat garethr.json
    "ID": "garethr",
    "Properties": {
        "Instance" : {
            "Plugin": "instance-file",
            "Properties": {
        "Flavor" : {
            "Plugin": "flavor-vanilla",
            "Properties": {
                "Size": 1

InfraKit is based on running separate plugins. Each plugin runs as a separate process and provides a filesystem socket in /run/infrakit/plugins. First start up the file plugin:

$ ./infrakit/file --dir=./test
INFO[0000] Starting plugin
INFO[0000] Listening on: unix:///run/infrakit/plugins/instance-file.sock
INFO[0000] listener protocol= unix addr= /run/infrakit/plugins/instance-file.sock err= <nil>

Next, in a separate terminal run the vanilla plugin:

$ ./infrakit/vanilla
INFO[0000] Starting plugin
INFO[0000] Listening on: unix:///run/infrakit/plugins/flavor-vanilla.sock
INFO[0000] listener protocol= unix addr= /run/infrakit/plugins/flavor-vanilla.sock err= <nil>

An finally run the group plugin. I’m passing --log=5 to enable more verbose outout so it’s easier to see what’s going on with the group.

$ ./infrakit/group --log=5
INFO[0000] Starting discovery
DEBU[0000] Opening: /run/infrakit/plugins
DEBU[0000] Discovered plugin at unix:///run/infrakit/plugins/instance-file.sock
INFO[0000] Starting plugin
INFO[0000] Starting
INFO[0000] Listening on: unix:///run/infrakit/plugins/group.sock
INFO[0000] listener protocol= unix addr= /run/infrakit/plugins/group.sock err= <nil>

With that all setup we can create a group based on our configuration file from above.

$ ./infrakit/cli group --name group watch garethr.json
watching garethr

Have a look in the test directory. You should see a single file has been created.

$ ls test

Let’s delete that file and see what happens:

rm test/*

Hopefully InfraKit will spot the instance (a file in this case) no longer exists and recreate it. You should see something like the following in the logs:

INFO[0612] Created instance instance-1475833820 with tags map[infrakit.config_sha:B2MsacXz8V_ztsjAzu3tu3zivlw=]

This is obviously a less-than-useful example but hopefully provides a good hello world example for anyone trying to run InfraKit in it’s current early stage.

Everyone is Not a Software Company

The Everyone is a Software Company meme has been around for a number of years, but it feels increasingly hard to get away from recently. That prompted this post.

But what do we mean by Software Company?

To be software company you’re going to need to employee software engineers and other professionals. Applying that logic to a large number of companies at once, and looking at how existing software companies are setup, we find a few large problems.

Google as an example

In my talk at Velocity, entitled The Two Sides of Google Infrastructure for Everyone Else I argued both for and against the idea of wholesale adoption of Google-like software and development/operations practices. Even though they derive the lions share of revenue from advertising it’s easy to argue that Google are a software company. But what does that look like? What makes Google a software company?

From the Google Annual Report 2015

61,814 full-time employees: 23,336 in research and development, 19,082 in sales and marketing, 10,944 in operations, and 8,452 in general and administrative functions

So, roughly 50% of Google is involved in building or running software. Glassdoor says salaries for engineers at Google average about $126,000-$162,000.

The US Bureau of Labor Statistics says that in 2014 the number of computer programming jobs in the US was 1,114,000, with median pay in 2015 of $100,690 a year. The total number of jobs in the US is about 143 million, with the average wages at $44,569.20 according to the Social Security Administration.

The Google Annual Report also states:

Competition for qualified personnel in our industry is intense, particularly for software engineers, computer scientists, and other technical staff

So, quick summary:

  • Software engineers are expensive relative to others employees
  • Demand for the best engineers means even higher wages
  • Proportionally there aren’t many software developers
  • There isn’t a large surplus of unemployed software engineers

Now the data above is mainly from US sources, although the Google data is from an international company with offices around the world. My experience says this is likely similar in Europe. Looking into data for India and China would be super interesting I’d wager.


One obvious problem is short-term supply and demand. Everyone wants experienced software folks for their transformation effort. But the more organisations that buy into the everyone is a software company story the greater the demand for a finite supply of people. For most that means you’ll to able to find less people that you want because of competition and afford even less people because all that competition pushes up salaries.

I’ve seen that firsthand while working for the UK Government. People occasionally complained that Government was hampering commercial organisations growth by employing lots of developers and operations people in London.

You’re also immediately in competition for software professionals with existing software companies. Given the high salaries, most of those employers already have developer friendly working environments and established hiring practices suited to luring developers to work for them. This sort of special case is hard for large companies without an existing empowered developer organisation. I saw a lot of that at the Government as well.

But the real macro problems are much more interesting. Even if you think 50% is a high mark for the ratio of software folk to others, you probably agree you need a lot more than you have today. And those developers just don’t exist today to allow everyone to be a software company. Nor would I argue is education in the near term producing enough skilled people to fill that gap tomorrow. So, what happens?

  • Does everyone sort-of become a software company but not quite?
  • Do most organisations struggle to hire and maintain a software team and see the endeavour fail?
  • Do increasing numbers of developers end up working for a small number of larger and larger software companies?
  • Does outsourcing bounceback, adapt and demonstrate innovation and transformation qualities to go along with the scale?
  • Countries like India or China are able to produce enough software engineers at scale to allow there companies to act on everyone becoming a software company?
  • We see clear winners and losers, ie. companies which become software companies and accelarate away from those that don’t?

Personally I think to take advantage of the idea behind the meme we’re going to need order of magnitude more efficient approaches to software delivery. What that looks like is the most interesting question of all.


The above is not a detailed analysis, and undoutedly has a few holes. It also doesn’t overly question the advantage of being a software company, or really question what we actually mean by everyone. But I think the central point holds: Everyone is NOT a software company, nor will everyone be a software company any time soon, unless we come up with a fundamentally better approach to service delivery.

Operations is more than just Systems Administration

I think one of the patterns of the last few years has been the democratization of systems administration, especially for web applications. Whether that’s Heroku or Docker, or Chef or Puppet, more and more traditional developers are doing work that would have been somebody else’s problem only a few years ago. But running in parallel to that thread is another less positive trend, that of conflating operations with just systems administation. The story seems to go that now we know Ansible (or some other tool) we just need developers to run the show.

In this post I’m going to try and introduce some of the other operational disciplines, especially for developers who maybe have come to operations via the above resurgence in infrastructure tooling over the past few years.

Note that this post has a slight bias towards more normal organisations. That is to say if you’re in a 5 person software startup you probably don’t have operational problems to worry too much about yet. I’m also not playing down the practice of systems administration, most experienced sysadmins I know are also quite rounded operations pros as well.

Service Management

If you’ve worked in operations, or in many large organisations you’ll have come across the term Service Management. This tends to be linked to various service management frameworks; like ITIL or MOF (Microsoft Operations Framework). The framework will describe, often in great detail, activities and processes for things like incident response, configuration management, change management, capacity planning and more.

While I was at The Government I wrote what I think is a reasonable introduction to Service Management albeit from a specific point-of-view. This was based on my experience of trying, and likely sometimes failing, to encourage teams to think about how the products they we’re working on would be run. Each of the topics touched on in the overview is worthy of it’s own stack of books, but I will repeat the ITIL service list here as (whatever you might think of the framework or a specific implementation) I’d found it a useful starting point for conversations - in particular stressing the breadth of topics under service management.

Service Strategy

  • IT service management
  • Service portfolio management
  • Financial management for IT services
  • Demand management
  • Business relationship management

Service Design

  • Design coordination
  • Service Catalogue management
  • Service level management
  • Availability management
  • Capacity Management
  • IT service continuity management
  • Information security management system
  • Supplier management

Service Transition

  • Transition planning and support
  • Change management
  • Service asset and configuration management
  • Release and deployment management
  • Service validation and testing
  • Change evaluation
  • Knowledge management

Service Operation

  • Event management
  • Incident management
  • Request fulfillment
  • Problem management
  • Identity management
  • Continual Service Improvement

For each of the above points, whether you are using ITIL or not, it’s useful to have a conversation. Some of these areas do provide ample opportunity for automation and for using tooling to minimise the effort required. But much of this is about designing how you are going to operate a service throughout it’s lifetime.

Operations user stories

One of the other things I published while at The Government was a set of user stories for a web operations team. These grew out of work on launching GOV.UK and have had input from various past colleagues. In hindsight I’d probably do somethings here differently, the stories assume a certain context which isn’t explicitly spelled out for instance. But they have a couple of things going for them in that they demonstrate how traditional operations activities can be planned out as part of a more developer-friendly planning approach, and also they are public and have been tested by more than a single team.

Not everything is a programming problem

The main point I think is that not everything can be turned into a programming problem to solve. Automation has it’s place, and many manual processes and practices can benefit from automation. But the wide range of activities involved in running a non-trivial and often non-ideal system in production tend to mean making trade-offs and prioritization decisions frequently. This is where softer skills like arguing for funding or additional head count, or building a business case for further work, come into play. Operations management is much more than systems administration.

Further reading

This is little more than a plea for people to think more about operations, separate to the more technical aspects of systems administration. If you’re interested in learning more however I would recommend some good reading material:

  • Visible Ops Handbook - still an excellent and pragmatic introduction to many of the topics noted above.
  • Designig Delivery - a bang up-to-date tome covering a range of service design topics.
  • Basic Service Management - a 50 page starter book covering the fundamentals of service management as generally discussed in more detail elsewhere. A great starting point.

Provisioning droplets with Puppet

I love DigitalOcean for quickly spinning up machines. I also like managing my infrastructure using Puppet. Enter the garethr-digitalocean module. This currently provides a single Puppet type; droplet.

Lets show a quick example of that, by launching two droplets, called test-digitalocean and test-digitalocean-1.

droplet { ['test-digitalocean', 'test-digitalocean-1']:
  ensure => present,
  region => 'lon1',
  size   => '512mb',
  image  => 14169855,

With the above manifest saved as droplets.pp we can run it with:

$ puppet apply --test droplets,pp

This will ensure those two droplets exist in that region, and have that size. If they don’t exist it will launch droplets using the specified image. This means we can run the same command again, and rather that create more instances it will simply report that we currently have those droplets already.

Querying resources

Puppet also comes with puppet resource, a handy way of querying the state of a given resource or type. Running the following will list all of your droplets, whether you created them using Puppet or not.

$ puppet resource droplet
droplet { 'test-digitalocean':
  ensure              => 'present',
  backups             => 'false',
  image               => '14169855',
  image_slug          => 'ubuntu-15-10-x64',
  ipv6                => 'true',
  price_monthly       => '10.0',
  private_address     => '',
  private_networking  => 'true',
  public_address      => '',
  public_address_ipv6 => '2A03:B0C0:0001:00D0:0000:0000:0090:B001',
  region              => 'lon1',
  size                => '1gb',

Mutating resources

The type also supports mutating droplets, for instance changing the size of a droplet if you change the model in Puppet. The API client doesn’t support all possible changes, but you can disable backups, enable IPv6 and switch on private networking as needed. Here’s a quick sample of the output showing this in action.

Info: Loading facts
Notice: Compiled catalog for gareths-macbook.local in environment production in 0.43 seconds
Info: Applying configuration version '1449225401'
Info: Checking if droplet test-digitalocean exists
Info: Powering off droplet test-digitalocean
Info: Resizing droplet test-digitalocean
Info: Powering up droplet test-digitalocean
Notice: /Stage[main]/Main/Droplet[test-digitalocean]/size: size changed '1gb' to '512mb'
Error: Disabling IPv6 for test-digitalocean is not supported
Error: /Stage[main]/Main/Droplet[test-digitalocean]/ipv6: change from true to false failed: Disabling IPv6 for test-digitalocean is not supported
Error: Disabling private networking for test-digitalocean is not supported
Error: /Stage[main]/Main/Droplet[test-digitalocean]/private_networking: change from true to false failed: Disabling private networking for test-digitalocean is not supported
Info: Checking if droplet test-digitalocean-1 exists
Info: Created new droplet called test-digitalocean-1
Notice: /Stage[main]/Main/Droplet[test-digitalocean-1]/ensure: created
Info: Class[Main]: Unscheduling all events on Class[Main]
Notice: Applied catalog in 60.61 seconds

But why?

Describing your infrastructure at this level in code has several advantages:

  • Having a shared model of your infrastructure in code allows for a discussion around that model
  • You can be convident in the model because of the idempotent nature of running the code
  • The use of code for this model allows for activities like code review, change control based on pull requests, unit testing, user created abstrations and more
  • The use of Puppet means you can use it as above as a command line interface, or run it every period of time to enfore and report on the state of you infrastructure
  • Puppet ecosystem tools like PuppetDB, Puppet Board or Puppet Enterprise mean you can store data over time for later analysis

The module also acts as a reasonable example of a simple Puppet type and provider. If you’re interested in extending Puppet for your own services this is hopefully a good place to start understanding the API.

Some Security Implication of Unikernels

I was attending the first GOTO London conference last week, in particlar the Rugged Track. One of the topics of conversation that came up was unikernels, and their potential for improving the state of software security. Unikernels are pretty new outside research groups, I’m just lucky enough to live and work in Cambridge where some of that research is happening. The security advantages of unikernels are one of the things that attracted me in the first place. I thought it might be interesting to jot a few of those down for other people interested in security and the future of infrastructure.

As with my last post, it’s worth having a basic understand of Unikernels. I’d recommend reading Unikernels - the rise of the virtual library operating system.


Every unikernel is provided the isolation guarantees from a hypervisor. Not only are these guarantees reasonably well understood, they tend to make use of hardware features too. It’s interesting to note that recent container runtime work is heading in this direction too, with ptojects like Clear Containers from Intel, Bonneville from VMware and the new stage1 in rkt.

No User Space

With a typical server OS we have kernel space and user space. Part of the idea here is to ensure the underlying machine doesn’t crash, whatever horrible things people do in user space. But this means you can do horrible things. The unikernel model is similar to the Erlang philosophy of let it crash. You only have kernel space, you entire application resides in it. Most things out of the ordinary are going to crash the kernel. This makes the sort of exploratory testing useful in exploit development harder.

Really Immutable Infrastructure

People often talk about immutable infrastructure. I’d wager there is more talk than reality however. When you push, people are often not using read-only file systems and retain the capability to login to machines to make ad-hoc changes. What they mean by immutable is that they only change machines at deploy time. This ignores both the fact they have the technical capability to change them anytime, and that an attacker could change them outside that deployment cycle. With unikernel systems there is often just the compiled kernel, you can’t just change files on disk. The defaults force an immutable way of working.

Clean Slate TLS

As a typical developer or operator you’ve probably learned more than you wanted to know about the OpenSSL source code. It’s not well understood and not likely to be so anytime soon and has some pretty spectacular bugs like Heartbleed. The Core Infrastructure Initiative is laudable and will improve things but it’s still a problematic codebase. Functional programming is often regarded as an easier way of writing understandable code. Types are a good thing, especially when it comes to security systems. So a pure OCaml TLS implementation as used by MirageOS makes sense on lots of levels. Yes this is quite an undertaking, but the bitcoin pinata tests show promise.

Formal Proofs

Knowing whether an application really does exactly what you want it to do (and no more) is a hard problem to solve. Unit tests and other form of automated testing help, but are still reliant on people to both write and design the tests. A formal proof system can provide much stronger guarentees of correctness, it’s an approach used in some cases for missing-critical components of Amazon’s AWS. MirageOS is implemented in OCaml. One of the most popular OCaml programmes is Coq, which just so happens to be a formal proof management system. I’ve not seen many examples yet of this approach, probably due to the effort involved, but the capability is there for building formally specified unikernels. I’d wager a similar thing is possible with Haskell and HalVM. Making that easier to do for typical developers could open up much more secure development practices for certain usecases.

A Discussion of The Operational Challenges With Unikernels

What are Unikernels

Most of this post assumes a basic understanding of what unikernels are so I’d recommend reading Unikernels – the rise of the virtual library operating system before moving on.

Why are Unikernels interesting

As a starting point: complexity. Managing infrastructure, and the software that runs on it, is too complicated. You can impose organisational rules to control this complexity (we only deploy on Debian, we only run JVM applications, the only allowed database is MySQL) but that limits you in other ways too, and in reality is nearly always broken somewhere in any non-trivial environment (this appliance uses Ubuntu, this software is only certified on Windows, PostgreSQL doesn’t run on the JVM). So you turn to software to manage that complexity; Puppet or Chef do a great job of allowing configuration complexity to be managed in code (where you can test it) and Docker allows for bundles of complexity to be isolated from other bundles of complexity. But there are still an awful lot of moving parts.

Another reason is the growing realisation that security is important. Securing systems on the internet is hard. Even though the basics are broadly understood they are often not implemented, and the people attempting to compromise systems are smart, well paid and highly incentivised (basically like you). It’s generally easier to break something than to build it. Part of this is a numbers game – to run a reasonable sixed system you might need to run 50 different services, and install 200 packages on every host. An attacker has to compromise just one of those to win.

A further reason, if one were needed, is the proliferation of many small internet connected devices, aka. The Internet of Things. Part of this relates to the above points about security concerns, but some of it is simply a matter of managing that many single purpose, low power, devices. The overhead of a typical general purpose operating system and application runtime just don’t fit this model.

Enter unikernels. Unikernels actually remove unneeded complexity. You’re running a hypervisor and the unikernel and that’s it. The unikernel contains only those libraries that you have specifically required. That drastically reduces the surface area for attack as well as meaning you’re running less software, hopefully enough less that your power needs are reduced too. By specifically requiring individual libraries you’re also making complexity visible. Rather than using a general purpose operating system with it’s 100s of packages and millions of lines of code you are at least choosing what to include.

Operational challenges

While I think some part of the future looks like unikernels their are some large operational challenges to overcome before they break out of very specific niches or research projects. Note that

there are architectural and software development challenges as well, I just happen to think they’re easier to deal with.

Development environment

There are a few properties of a development environment that I think are essential to modern development; development/production parity being one of the most important. Tools like Vagrant, and a move towards infrastructure as code, and more recently Docker have made great strides here in the past several years. The different unikernel implementations are generally based on lesser known software stacks (Haskell, OCaml, Erlang, etc) so some of this is familiarity. But what does development/production partity mean for a unikernel based system? We’re not just talking about the individual unikernel here either – how do I deploy unikernels? How do I compose several unikernels together to build an application? What does a Continuous integration or deployment pipeline look like? In my view the unikernel movement should focus some efforts here. Not only will this make it easier for people to get started, but having strong opinions early will allow the nascent community to solve the problem together, rather than everyone solving it just-in-time for themselves.

Managing the hypervisor

I’d argue today most developers don’t spent much time directly working with hypervisors. Either you’re running on an in-house VMware, KVM or Xen install with some (hopefully self-service, automated) provisioning mechanism in place or you’re using a public cloud like AWS, Azure, etc. The current generation of unikernel systems mainly target Xen. I think in the short term at least this means getting to know the hypervisor. Xen is solid software, but I don’t see a great deal of automation around it – say well maintained Puppet modules, API clients or a Terraform provider. In the long term we’ll hopefully have higher level interfaces, but in the short term efforts here would lower the barrier to entry considerably.

Double down on AWS

Given the above, and given the ubiquity of EC2 (which is based on Xen) it might be wise to build up first-class tools around using EC2 as a target environment for unikernel deployments. EC2 supports custom kernels, but these require a number of convoluted steps that could be automated away (note that I’m talking about more than just a shell script here). Also what are the best practices around autoscaling groups andunikernels? Or VPC networks and unikernels?

The network

With the explosion in containers and microservices it’s becoming clearer (if it wasn’t already) how important the network is. By removing the operating system we remove things like host firewalls and the new breed of overlay networks. At the same time if we are to tap the dynamic potential of unikernels we’ll need a similarly dynamic and automatable network. Maybe this becomes more of an application concern, with services communicating via other services which act as firewalls and intelligent proxies, but that still leaves the underlying network to be managed.


However much testing you do beforehand you’ll still likely end up with problems in production, and as you scale up you’ll hit issues that you simply can’t recreate outside the live environment. This is were good debugging capabilities come in. While general purpose operating systems might be complex they are well know, and tools like ps, top, free, ping, telnet, netcat, dtrace, etc. are commonly used by anyone debugging systems. Note that in many cases you’re debugging a combination of systems; is the performance issue an application problem, a network problem, a storage problem or some interesting combination of several facters?

By removing the general purpose operating system, unikernel based environments remove most of the current debugging tools at the same time. Part of this Is good application development hygiene (logs, metrics and status endpoints for instance), but what about the more interactive debugging practices? What does debugging a system based on unikernels look like?


The word may be overloaded but the need to arrange and manage a number of components that make up a larger system is a real need. This might be something like Docker’s Compose file or Brooklyn’s Blueprints, or it could be something more akin to the APIs from Cloud Foundry, Kubernetes or Mesos. Testing some of these models with unikernel based systems will be an interesting test of how coupled to containers the existing models are. The lack of legacy again opens up the potential to come up with a truly modern alternative here too.


Unless you’re in an environment where security is your number 1 concern then the current state of Unikernels probably means choosing to adopt them now is a little bleeding edge. But I think that will change over time as the various projects mature and address some of the issues described above. In the meantime I’d love to see more discussion of some of the operational challenges. I think talking about the needs of operators at this early stage should make the resulting ecosystems more robust whsen it comes to future production deployments.