Predictions for the direction of serverless platforms

While at JeffConf I had a particularly interesting conversation with Anne, James and Guy. Inspired by that, and the fact James is currently writing a blog post a day, I thought I’d have a go at writing up some of my thoughts.

I want to focus on a couple of things relevant to the evolution of Serverless as a platform and the resulting commercial ecosystem, namely the importance of Open Service Broker and a bet on OpenWhisk.

The above is a pretty high-level picture of how I think Serverless platforms are playing out. I’ve skipped a few things from the diagram that I’ll come back to, and I’m not trying to detail all options, just the majority by usage, but the main points are probably:


OpenWhisk is one of a number of run-your-own-serverless platforms vying for attention. My guess is it’s the one that will see actual market adoption over time.

One of the reasons for that is that Red Hat recently announced they are supporting OpenWhisk. Specifically they are working to run OpenWhisk on top of Kubernetes which I see as being something that will tap into the growing number of people and companies adopting Kubernetes in one form or another. Kubernetes provides a powerful platform to build higher-level user interfaces on, and I see it ending up as the default here.

There are a few things that could make this prediction much messier. One of those is that OpenWhisk is currently in the incubator for the Apache Foundation. CNCF has been building up a product portfolio across tools in this space, and has a Serverless working group. It would be a shame if this somehow ends up with a perceived need to bless something else. The other might be say an aquisition of Serverless Framework by Docker Inc, or a new entrant completely.

Open Service Broker

Without supporting services to bind to, Serverless computing doesn’t look as interesting. Although you’ll see some standalone Lambda usage it’s much more common to see it combined with API Gateway, DynamoDB, S3, Kinesis, etc. The catch is that these are AWS services. Azure (with Azure Functions) and Google Cloud (with Cloud Functions) are busy building similar ecosystems. This makes the barrier to entry for a pure technology play (like OpenWhisk or something else) incredibly high.

AWS Services work together because they are built by the same organisation, and integrated together via the shared AWS fabric. How can you build that sort of level of agreement between a large number of totally unconnected third parties? What about exposing internal services (for instance your large Oracle on-premise cluster) to your serverless functions? Enter Open Service Broker.

Open Service Broker has been around since the end of last year. It defines an API for provisioning, deprovisioning and binding services for use by other platforms, for instance by Cloud Foundry or Kubernetes or similar. Here’s a quick example broker. I think we’ll see Open Service Broker powering some powerful higher-level user interfaces in other platforms eventually which aim to compete directly with, for instance, the ease of binding an API Gateway to a Lambda function in AWS. Especially if you’re a service provider I’d get a march on the competition and start looking at this now. This is also a potential avenue for the other cloud providers to expand on there claims of openness. I’d wager on us seeing Azure services available over Open Service Broker for instance.

A few extra observations

  • Anyone but Amazon will be real - on both the buyer side (a little) and the vendor side (a lot) you’ll see a number of unlikely collaborations between competitors at the technology level.
  • Open Source is a fundamental part of any AWS competition. AWS hired some great open source community focused people recently so expect to see them try to head that threat off if it gets serious
  • The fact Amazon provides the entire stack AND runs a chunk of the infrastructure for the other options help explains why people (again especially other vendors) fear AWS.
  • Both Pivotal and Red Hat will be in the mix when it comes to hosted serverless platforms, with CloudFoundry and OpenShift (powered by OpenWhisk) respectively.
  • The stack in the middle of some OS, some provisioning, Kubernetes and OpenWhisk will have lots of variants, including from a range of vendors. Eventually some might focus on serverless, but for others it will be simply another user interface atop a lower-level infrastructure play.
  • If you’re thinking “But Cloud Foundry isn’t serverless?” then you’re not paying attention to things like the recent release of Spring Functions.
  • In theory Azure and Google Cloud could end up playing in the same place as AWS. But with the lead AWS Lambda has in the community that is going to take something more than just building a parallel ecosystem. I could totally see Azure embracing OpenWhisk as an option in Azure Functions, in the way Azure Container Service is happy to provide multiple options.

Overall I’ll be interested to see how this plays out, and whether my guesses above turn out to be anywhere near right. There will be a lot of intermediary state, with companies large and small, existing and new, shipping software. And like anything else it will take time to stablise, whether to something like the above or something else entirely. I feel like the alternative to the above is simply near total domination by 1 or maybe 2 of the main cloud providers, which would be less interesting to guess about and made for a shorter blog post.

Schemas for Kubernetes types

I’ve been playing around building a few Kubernetes developer tools at the moment and a few of those led me to the question; how do I validate this Kubernetes resource definition? This simple question led me through a bunch of GitHub issues without resolution, conversations with folks who wanted something similar, the OpenAPI specification and finally to what I hope is a nice resolution.

If you’re just after the schemas and don’t care for the details just head on over to the following GitHub repositories.

OpenShift gets a separate repository as it has an independent version scheme and adds a number of additional types into the mix.

But why?

It’s worth asking the question why before delving too far into the how. Let’s go back to the problem; I have a bunch of Kubernetes resource definitions, lets say in YAML, and I want to know if they are valid?

Now you might be thinking I could just run them with kubectl? This raises a few issues which I don’t care for in a developer tool:

  • It requires kubectl to be installed and configured
  • It requires kubectl to be pointed at a working Kubernetes cluster

Here are a few knock-on effects of the above issues:

  • Running what should be a simple validate step in a CI system now requires a Kubernetes cluster
  • Why do I have to shell out to something and parse it’s output to validate a document, for instance if I’m doing it in the context of a unit test?
  • If I want to validate the definition against multiple versions of Kubernetes I need multiple Kubernetes clusters

Hopefully at this point it’s clear why the above doesn’t work. I don’t want to have to run a boat load of Kubernetes infrastructure to validate the structure of a text file. Why can’t I just have a schema in a standard format with widespread library support?

From OpenAPI to JSON Schema

Under-the-hood Kubernetes is all about types. Pods, ReplicationControllers, Deployments, etc. It’s these primatives that give Kubernetes it’s power and shape. These are described in the Kubernetes source code and are used to generate an OpenAPI description of the Kubernetes HTTP API. I’ve been spelunking here before with some work on generating Puppet types from this same specification.

The latest version of OpenAPI in fact already contains the type information we seek, encoded in a superset of JSON Schema in the definitions key. This is used by the various tools which generate clients from that definition. For instance the official python client doesn’t know about these types directly, it all comes from the OpenAPI description of the API. But how do we use those definitions separately for our own nefarious validation purposes? Here’s a quick sample of what we see in the 50,000 line-long OpenAPI definition file

- definitions: {
    io.k8s.api.admissionregistration.v1alpha1.AdmissionHookClientConfig: {
      description: "AdmissionHookClientConfig contains the information to make a TLS connection with the webhook",
      required: [
      properties: {
        caBundle: {
          description: "CABundle is a PEM encoded CA bundle which will be used to validate webhook's server certificate. Required",
          type: "string",
          format: "byte"
        service: {
          description: "Service is a reference to the service for this webhook. If there is only one port open for the service, that port will be used. If there are multiple ports open, port 443 will be used if it is open, otherwise it is an error. Required",
          $ref: "#/definitions/io.k8s.api.admissionregistration.v1alpha1.ServiceReference"

The discussion around wanting JSON Schemas for Kubernetes types has cropped up in a few places before, there are some useful comments on this issue for instance. I didn’t find a comprehensive solution however, so set out on a journey to build one.


The tooling I’ve build for this purpose is called openapi2jsonschema. It’s not Kubernetes specific and should work with other OpenAPI specificied APIs too, although as yet I’ve done only a little testing of that. Usage of openapi2jsonschema is fairly straightforward, just point it at the URL for an OpenAPI definition and watch it generate a whole bunch of files.


openapi2jsonschema can generate different flavours of output, useful for slightly different purposes. You probably only need to care about this if you’re generating you’re own schemas or you want to work completely offline.

  • default - URL referenced based on the specified GitHub repository
  • standalone - de-referenced schemas, more useful as standalone documents
  • local - relative references, useful to avoid the network dependency

The build script for the Kubernetes schemas is a simple way of seeing this in practice.

Published Schemas

Using the above tooling I’m publishing Schemas for Kubernetes, and for OpenShift, which can be used directly from GitHub.

As an example of what these look like, here are the links to the latest deployment schemas for 1.6.1:

A simple example

There are lots of use cases for these schemas, although they are primarily useful as a low-level part of other developer workflow tools. But at a most basic level you can validate a Kubernetes config file.

Here is a quick example using the Python jsonschema client and an invalid deployment file:

$ jsonschema -F "{error.message}" -i hello-nginx.json
u'template' is a required property

What to do with all those schema?

As noted these schemas have lots of potential uses for development tools. Here are a few ideas, some of which I’ve already been hacking on:

  • Demonstrating using with the more common YAML serialisation
  • Testing tools to show your Kubernetes configuration files are valid, and against which versions of Kubernetes
  • Migration tools to check your config files are still valid against master or beta releases
  • Integration with code editors, for instance via something like Schema Store
  • Validation of Kubernetes configs generated by higher-level tools, like Helm, Ksonnet or Puppet
  • Visual tools for crafting Kubernetes configurations
  • Tools to show changes between Kubernetes versions

If you do use these schemas for anything please let me know, and I’ll try and keep them updated with releases of Kubernetes and OpenShift. I plan on polishing the openapi2jsonschema tool when I get some time, and I’d love to know if anyone uses that with other OpenAPI compatible APIs. And if all you want to do is validate your Kubernetes configuration and don’t care too much about what’s happening under the hood then stick around for the next blog post.

Replacing cron jobs with Lambda and Apex

Everyone has little scripts that want running on some schedule. I’ve seen entire organisations basically running on cron jobs. But for all the simplicity of cron it has a few issues:

  • It’s a per-host solution, in a world where hosts might be short-lived or unavailable for some other reason
  • It requires a fully configured and secured machine to run on, which comes with direct and indirect costs

There are a variety of distributed cron solutions around, but each adds complexity for what might be a throw-away script. In my view this is the perfect usecase for trying out AWS Lambda, or other Serverless platforms. With a Serverless platform the above two issues go away from the point-of-view of the user, they are below the level of abstraction of the provided service. Lets see a quick example of doing this.

Apex and Lambda

There are a number of frameworks and tools for helping deploy Serverless functions for different platforms. I’m going to use Apex because I’ve found it provides just enough of a user interface without getting in the way of writing a function.

Apex supports a wide range of different languages, and has lots of examples which makes getting started relatively easy. Installation is straightforward too.

A sample function

The function isn’t really relevant to this post, but I’ll include one for completeness. You can write functions in officially supported languages (like Javascript or Python) or pick one of the languages supported via a shim in Apex. I’ve been writing Serverless functions in Go and Clojure recently, but I prefer Clojure so lets use that for now.

(ns net.morethanseven.hello
    (:gen-class :implements [])
    (:require [ :as io]
              [clojure.string :as str])
    (:import ( Context)))

(defn -handleRequest
  [this input-stream output-stream context]
  (let [handle (io/writer output-stream)]
    (.write handle (str "hello" "world"))
    (.flush handle)))

This would be saved at functions/hello/src/net/morethanseven/hello.clj, and the Apex project.json file should point at the function above:

  "runtime": "clojure",
  "handler": "net.morethanseven.hello::handleRequest"

You would also need a small configuration file at functions/hello/project.clj:

(defproject net.morethanseven "0.1.0-SNAPSHOT"
  :description "Hello World."
  :dependencies [[com.amazonaws/aws-lambda-java-core "1.1.0"]
                 [com.amazonaws/aws-lambda-java-events "1.1.0" :exclusions [
                 [org.clojure/clojure "1.8.0"]]
  :aot :all)

The above is really just showing an example of how little code a function might contain, the specifics are relevant only if you’re intersted in Clojure. But imagine the same sort of thing for your language of choice.


The interesting part (hopefully) of this blog post is the observation that using AWS Lambda doesn’t mean you don’t need any infrastructure or configuration. The good news is that, for the periodic job/cron usecase this infrastructure is fairly standard between jobs.

Apex has useful integration with Terraform to help manage any required infrastructure too. We can run the following two commands to provision and then manage our infrastructure.

apex infra init
apex infra deploy

Doing so requires us to write a little Terraform code. First we need some variables, we’ll invclude this in infrastructure/

variable "aws_region" {
  description = "AWS Region Lambda function is deployed to"

variable "apex_environment" {
  description = "Apex configured environment. Auto provided by 'apex infra'"

variable "apex_function_role" {
  description = "Provisioned Lambda Role ARN via Apex. Auto provided by 'apex infra'"

variable "apex_function_hub" {
  description = "Provisioned function 'hub' ARN information. Auto provided by 'apex infra'"

variable "apex_function_hub_name" {
  description = "Provisioned function 'hub' name information. Auto provided by 'apex infra'"

And then we need to describe the resources for our cron job in infrastructure/

resource "aws_cloudwatch_event_rule" "every_five_minutes" {
    name = "every-five-minutes"
    description = "Fires every five minutes"
    schedule_expression = "rate(5 minutes)"

resource "aws_cloudwatch_event_target" "check_hub_every_five_minutes" {
    rule = "${}"
    target_id = "${var.apex_function_hub_name}"
    arn = "${var.apex_function_hub}"

resource "aws_lambda_permission" "allow_cloudwatch_to_call_hub" {
    statement_id = "AllowExecutionFromCloudWatch"
    action = "lambda:InvokeFunction"
    function_name = "${var.apex_function_hub_name}"
    principal = ""
    source_arn = "${aws_cloudwatch_event_rule.every_five_minutes.arn}"

Here we’re running the job every 5 minutes, but it should be relatively easy to see how you can change that frequency. See the Terraform and AWS Lambda documentation for all the possible options.

On complexity, user interface and abstractions

The above is undoutedly powerful, and nicely solves the described probles with using Cron. However it’s not all plain sailing I feel with Serverless as a Cron replacement.

Let’s talk about complexity. If I can make the assumptions that:

  • The underlying host I was going to run my cron job on host is well managed, potentially by another team in my organisation
  • Hosts don’t suffer downtime, or if they do it’s occasional and they are brough back up quickly

Then I can just use cron. And the interface to cron looks something more like:

*/5 * * * * /home/garethr/

I still had to write my function (the Clojure code above) but I collapsed the configuration of three distinct AWS resources and the use of a new tool (Terraform) into a one-line crontab entry. You might have provisioned that cron job using Puppet or Chef which adds a few lines and a new tool, which sits somewhere between hand editing and the above example.

This is really a question of user interface design and abstractions. On one hand Serverless provides a nice high-level shared abstraction for developers (the function). On another Serverless requires a great deal of (virtual) infrastructure, which at the moment tends not to be abstracted from the end-user. In the simple case above I had to care about aws_cloudwatch_event_targets, aws_cloudwatch_event_rules and aws_lambda_permissions. The use of those non-abstract resources all couples my simple cron example to a specific platform (AWS) when the simple function could likely run on any Serverless platform that supports the JVM.

Serverless, not Infrastructureless

I’m purposefully nit-picking with the above example. Serverless does provide a more straightforward cron experience, mainly because it’s self-contained. But the user interface even for simple examples is still very raw in many cases. Importantly, in removing the concept of ervers, we don’t appear to have removed the need to configure infrastructure, which I think is what many people thing of when they hope to be rid of servers in the first place.

Conference speaking as a software vendor

While reviewing 100s of proposals for upcoming conferences (Velocity EU and PuppetConf) I tweeted the following, which seemed to get a few folks interest.

I should write a blog post on “conference talk submissions for vendors/consultants”. Rule one: own your bias

This reply in particular pushed me over the edge into actually writing this post:

Would love to hear advice, been told more than once a talk was rejected since I work for a vendor, even though my talk was not a pitch.

Writing something up felt like it might be useful to a few folks so here goes. This is but one persons opinion, but I at least have some relevant experience:

  • I’ve worked for a software vendor for about 3 years
  • Before that I worked for the UK Government, and before that various SaaS companies, in-house software teams and very briefly myself
  • I’ve helped out on the programme committee or similar for a variety of events going back a bunch of years; Velocity, PuppetConf, Lisa, Devopsdays, QCon
  • I’ve been lucky enough to speak at a wide range of events in lots of different places

Some of the following advice is pretty general, if you’re interested in speaking at software conferences (at least those I’ve seen or spoken at) then hopefully much of this is relevant to you. But I also want to focus on people working for a software vendor. I think that’s particularly relevant as more and more traditional vendors court the open source community and software developers inparticular, and roles like developer evangelist see lots of people moving from more practioner roles to work vendor-side.

Learn to write a good proposal

My main experience with speaking at conferences, or helping currate content, is via a call-for-proposals. The conference will have some sort of theme or topic, set a date, and see what content is submitted. Their are a variety of oft-made points here about sponsoring to get a talk slot, or submitting early, or approaching the organisers, but in my experience the best bet is to write a really good proposal around a good idea. That’s obviously easier said that done, but it’s not as hard as you may think.

  • Introduce the problem you’re talking about, but don’t make this the whole proposal.
  • State in the proposal what you’re going to talk about. Be specific. Don’t say “an open source tool” when you mean a specific tool which has a name. Don’t hide the details of the talk because you intend to reveal them when presenting.
  • Include explicitly what the audience will do or be able to do differently after seeing the talk. Go so far as to say “after this talk attendee will…”
  • Novely really helps, as long as the topic is interesting and relevant. Given 100 proposals about deploying your web application with Docker, the one that says “and we do this for the International Space Station” is getting picked.
  • If you know this is going to be a popular topic for other proposals then you also need to confince the CFP committee why you’re the best person to present it.
  • Liberally use bullet points. This might be a personal thing but if I’m reading 200+ proposals making it quick to read helps get your point across.

Unless you write a good proposal around an interesting idea you probably won’t be accepted whether you’re a software vendor or not. Try get feedback on your proposals, especially from folks who have previously spoken at the event you’re hoping to present at.

Appreciate you’re selling something

If you work for a software vendor your job is to sell software. Not everyone likes that idea, but I think at least for the purposes of this blog post it stands true. It doesn’t matter if you’re an evangelist reporting into Marketing, or a software developer in Engineering or anything else - you stand to gain directly or indirectly from people purchasing your product. From the point of view of a conference you are thoroughly compromised when it comes to talking about your product directly. This can be disheartening as you likely work at the vendor and want to talk at the conference because you’re genuinely interested, even evangelical, about the product. You need to own that bias - there really is no way around it.

If you work for a vendor, and the CFP committee even thinks the proposal is about that product, you’ll probably not be given the benefit of doubt.

And the answers is…

Probably the worst example of vendor talks that do sometimes get accepted go something like this:

  • Introduce a problem people can relate to
  • Show some (probably open source) options for solving, focusing on the large integration cost of tying them together
  • Conclude with the answer being the vendors all-in-one product

I think for expo events or for sponsored slots this approach is totally valid, but for a CFP it bugs me:

  • The proposal probably didn’t explain that the the answer was the commercial product sold by the vendor
  • The talk is unlikely to cover like-for-like competition, for instance it’s probably not going to spend much time referring to direct commercial competitors
  • As noted above, the presented is thoroughly biased, which doesn’t make for a great conference talk

Please don’t do this. It’s a easy trap to fall into, mainly because you hopefully genuinely believe in the product you’re talking about and the problem you’re solving. If you really want talks like this try and encourage your customers to give then - a real-world story version of this is much more interesting.

Talk about the expertise rather than the product

It’s not all doom for those working for software vendors and wanting to talk at conferences. The reality is that while you’re working on a discrete product you’re also likely spending a lot of time thinking about a specific domain. I spent a bunch of years using Puppet as a product, but I’ve spent more time while working at Puppet thinking about the nature of configuration, and about wildly heterogenuous systems. When I worked for the Government I interacted with a handful of departments and projects. At Puppet I’ve spoken with 100s of customers, and read trip reports on meetings with others. Working for a vendor gives you a different view of what’s going on, especially if you talk to people from other departments.

In my experience, some of the best talks from those working for software vendors can be quite meta, theoretical or future facing. You have the dual benefit of working in a singular domain (so you can do deep) and hopefully having access to lots customers (so you can go broad).

Talks as a product design tool

As someone working for a vendor, a big part of my job is designing and building new features or new products. I’ve regularly found giving talks (including the time taken to think about the topic and put together the demos and slides) to be a great design tool. Assuming you’re already doing research like this, even in the background, pitching a talk on the subject has a few advantages:

  • It starts you writing for a public audience early in the design process
  • The talk being accepted, and the feedback from the talk, provide early input into the design process
  • The talk (or published slides) can encourage people thinking about similar things to get in touch

You can see this if you flick through talks I’ve given over the past few years. For instance What’s inside that container? and more recently Security and the self-contained unit of software provide some of the conceptual underpinnings for Lumogon. And The Dockerfile explosion - and the need for higher-level tools talk I gave at DockerCon led to the work on Puppet Image Build.

These talks all stand alone as (hopefully) useful and interesting presentations, but also serve a parallel internal purpose which importantly doesn’t excert the same bias on the content.

Some good examples

The above is hopefully useful theory, but I appreciate some people prefer examples. The following include a bunch of talks I’ve given at various conference, with a bit of a rationale. I’ve also picked out a few examples of talks by other folks I respect that work at software vendors and generally give excellent talks.

A great topic for most vendors to talk about at suitable conferences is how they approach building software. I spoke about In praise of slow (Continuous Delivery) at Pipeline conference recently, about how Puppet (and other vendors) approach techniques like feature flags, continuous delivery and versioning but for packaged software. That had novelty, as well as being relevant to anyone involved with an open source project.

Probably my favourite talk I’ve given in the last year, The Two Sides to Google Infrastructure for Everyone Else looks at SRE, and infrastructure from two different vantage points. This talk came directly from spending time with the container folks on one hand, and far more traditional IT customers on the other, and wondering if they meet in the middle.

Charity Majors is the CEO at Honeycomb and likes databases a lot more than I do. The talk Maslows Hierachy of Database Needs is just solid technical content from an expert on the subject. Closer to the Honeycomb product is this talk Observability and the Glorious Future, but even this avoids falling into the trap described above and stays focused on generally applicable areas to consider.

Jason Hand from VictorOps has given a number of talks about ChatOps, including this one entitled Infrastructure as Conversation. Note that some of the examples use VictorOps, but the specific tool isn’t the point of the talk. The abstract on the last slide is also a neat idea.

Bridget Kromhout works for Pivotal, the folks behind Cloud Foundry amongst other things. But she invariably delivers some of the best big picture operations talks around. Take a couple of recent examples I Volunteer as Tribute - the Future of Oncall and Ops in the Time of Serverless Containerized Webscale. Neither talk is about a specific product, instead both mix big picture transformation topics with hard-earned ops experience.

As a final point, all of those examples have interesting titles, which comes back to the first point above. Make content that people really want to listen to first, and if you’re a vendor own your biases.

Kubernetes configuration without the YAML

Tomorrow at KubeCon in Berlin I’m running a birds-of-a-feather session to talk about Kubernetes configuration. Specifically we’ll be talking about whether Kubernetes configuration benefits from a domain specific language. If you’re at the conference and this sounds interesting come along.

The problem

The fact the programme committee accepted the session proposa is hopefully a good indication that at least some other people in the community think this is an interesting topic to discuss. I’ve also had a number of conversations in person and on the internet about similar areas.

There are a number of other traces of concerns with using YAML as the main user interface to Kubernetes configuration. This comment from Brian Grant of Google on the Kubernetes Config SIG mailing list for instance:

We’ve had a few complaints that YAML is ugly, error prone, hard to read, etc. Are there any other alternatives we might want to support?

And this one from Joe Beda, one of the creators of Kubernetes:

I want to go on record: the amount of yaml required to do anything in k8s is a tragedy. Something we need to solve. (Subtweeting HN comment)

This quote from the Borg, Omega and Kubernetes paper in ACM Queue, Volume 14, issue 1 nicely sums up my feelings:

The language to represent the data should be a simple, data-only format such as JSON or YAML, and programmatic modification of this data should be done in a real programming language

This quote also points at the problem I see at the moment. The configuration and the management of that configuration are separate but related concerns. Far too many people couple these together, ultimately moving all of the management complexity onto people. That’s a missed opportunity in my view. The Kubernetes API is my favourite think about the project, I’ve waxed lyrical about it allowing for different higher-level user interfaces for different users to all interact on the same base platform. But treating what is basically the wire format as a user interface is just needlessly messy.

But what advantages do we get using a programming language to modify the data? For me it comes down to:

  • Avoiding repetition
  • Combining external inputs
  • Building tools to enforce correctness (linting, unit testing, etc.)
  • The abililty to introduce abstractions

It’s the last part I find most compelling. Building things to allow others to interact with a smaller domain specific abstraction is one way of scaling expertise. The infrastructure as code space I’ve been involved in has lots of stories to tell around different good (and bad) ways of mixing data with code, but the idea that data on it’s own is enough without higher-level abstractions doesn’t hold up in my experience.

What can I use instead?

Lukily various people at this point have build tools in this space. I’m not sure could or should be a single answer to the question (whether there should be a default is a harder question to answer) but the following definitely all show what’s possible.

Obviously I wrote one of these so I’m biased but different tools work for different people and in different contexts. For example Skuber looks nice but I mainly don’t like Scala. And I’ve been using Jsonnet for Packer templates recently with success, so I’m very interested in kubecfg which provides a nice Kubernetes wrapper to that tool.

Ultimately this is still a developing space, but compared to a year ago it is now definitely moving. For me, I hope the default for managing Kubernetes configuration slowly but surely switches away from just hand rolling data. Towards what, only time and people reporting what works for them will tell.

Republishing service manual content

One of the many things I did some work on while at GDS back in 2013 was the Government Service Design Manual. This was intended to be a central resource for teams across (and outside) Government about how to go about building, designing and running modern internet-era services. It was a good snapshot of opinions from the people that made up GDS on a wide range of different topics. Especially for the people who spent time in the field with other departments, having an official viewpoint published publicly was hugely helpful.

Recently the Service Manual got a bit of a relaunch but unfortunately this involved deleting much of the content about operations and running a service. Even more unfortunately the service manual is now described as something that:

exists to help people across government build services that meet the Digital Service Standard and prepare for service assessments.

So in basic terms it’s refocusing on helping people pass the exam rather than being all about learning. Which is a shame. Compare that with the original intent:

Build services so good that people prefer to use them

However, all that content is not lost. Luckily the content is archived on GitHub and was published under the terms of the Open Government License which allows for anyone to “copy, publish, distribute and transmit the Information”. So I’m choosing to republish a few of the pieces I wrote and found useful when talking to and assisting other Government departments. These represent an interesting snapshot from a few years ago, but I think mainly stand the test of time, even if I’d change a few things if I wrote them today.

What is Devops?

This post was originally written as part of the Government Service Design Manual while I was working for the UK Cabinet Office. Since my original in 2013 it was improved upon by several others I’m republishing it here under the terms of the Open Government licence.

Devops is a cultural and professional movement in response to the mistakes commonly made by large organisations. Often organisations will have very separate units for:

  • development
  • quality assurance
  • operations business

In extreme cases these units may be:

  • based in different locations
  • work for different organisations
  • under completely different management structures

Communication costs between these units, and their individual incentives, leads to slow delivery and a mountain of interconnected processes.

This is what Devops aims to correct. It is not a methodology or framework, but a set of principles and a willingness to break down silos. Specifically Devops is all about:


Devops needs a change in attitude so shared ownership and collaboration are the common working practices in building and managing a service. This culture change is especially important for established organisations.


Many business processes are ready to be automated. Automation removes manual, error-prone tasks – allowing people to concentrate on the quality of the service. Common areas that benefit from automation are:

  • release management (releasing software)
  • provisioning
  • configuration management
  • systems integration
  • monitoring
  • orchestration (the arrangement and maintenance of complex computer systems)
  • testing


Data can be incredibly powerful for implementing change, especially when it’s used to get people from different groups involved in the quality of the end-to-end service delivery. Collecting information from different teams and being able to compare it across former silos can implement change on its own.


People from different backgrounds (ie development and operations) often have different, but overlapping skill sets. Sharing between groups will spread an understanding of the different areas behind a successful service, so encourage it. Resolving issues will then be more about working together and not negotiating contracts.

Why Devops

The quality of your service will be compromised if teams can’t work together, specifically:

  • those who build and test software
  • those that run it in production

The root cause is often functional silos; when one group owns a specific area (say quality) it’s easy for other areas to assume that it’s no longer their concern.

This attitude is toxic, especially in areas such as:

  • quality
  • release management
  • performance

High quality digital services need to be able to adapt quickly to user needs, and this can only happen with close collaboration between different groups.

Make sure the groups in your team:

  • have a shared sense of ownership of the service
  • have a shared sense of the problem
  • develop a culture of making measurable improvements to how things work

Good habits

Devops isn’t a project management methodology, but use these good habits in your organisation. While not unique to Devops, they help with breaking down silos when used with the above principles:

  • cross-functional teams – make sure your teams are made up of people from different functions (this helps with the team owning the end-to-end quality of service and makes it easier to break down silos)
  • widely shared metrics – it’s important for everyone to know what ‘good’ looks like so share high and low level metrics as widely as possible as it builds understanding
  • automating repetitive tasks – use software development to automate tasks across the service as it:
    • encourages a better understanding of the whole service
    • frees up smart people from doing repetitive manual tasks
  • post-mortems – issues will happen so it’s critical that everyone across different teams learns from them; running post-mortems (an analysis session after an event) with people from different groups is a great way of spreading knowledge
  • regular releases – the capacity for releasing software is often limited in siloed organisations, because the responsibilities of the different parts of the release are often spread out across teams – getting to a point where you can release regularly (even many times a day) requires extreme collaboration and clever automation

Warning signs

Like agile, the term Devops is often used for marketing or promotional purposes. This leads to a few common usages, which aren’t necessarily in keeping with what’s been said here. Watch out for:

  • Devops tools (nearly always marketing)
  • a Devops team (in many cases this is just a new silo of skills and knowledge)
  • Devops as a job title (you wouldn’t call someone “an agile”)

Further reading

Agile and IT service management

This post was originally written as part of the Government Service Design Manual while I was working for the UK Cabinet Office. Since my original in 2013 it was improved upon by several others I’m republishing it here under the terms of the Open Government licence.

The Digital by Default standard says that organisations should (emphasis on operate added):

Put in place a sustainable multidisciplinary team that can design, build and operate the service, led by a suitably skilled and senior service manager with decision-making responsibility.

This implies a change to how many organisations have traditionally run services, often with a team or organisation building a service separate from the one running it. This change however does not mean ignoring existing good practice when it comes to service management.

Agile and service management

The principles of IT service management (ITSM) and those of agile do not necessarily conflict – issues can arise however when organisations implement rigid processes without considering wider service delivery matters, or design and build services without thinking about how they will be operated.

The agile manifesto makes the case for:

  • Individuals and interactions over processes and tools
  • Working software over comprehensive documentation
  • Customer collaboration over contract negotiation
  • Responding to change over following a plan

It is too easy to position service management as opposed to agile as traditional service management practices can be viewed as focusing on processes, tools, documentation, planning and contract negotiation – the items on the right hand side of the points above.

However, the agile manifesto goes on to say:

That is, while there is value in the items on the right, we value the items on the left more.

To build and run a successful service you will need to work on suitable processes and manage third party relationships. Using existing service management frameworks (especially as a starting point) is one approach to this problem.


ITIL (the Information Technology Infrastructure Library) is one such framework. ITIL does a particularly good job of facilitating shared language. For instance it’s definition of a service is:

A service is a means of delivering value to customers by facilitating outcomes customers want to achieve.

The current version of ITIL currently provides 5 volumes and 26 processes describing in detail various aspects of service management:

Service Strategy

  • IT service management
  • Service portfolio management
  • Financial management for IT services
  • Demand management
  • Business relationship management

Service Design

  • Design coordination
  • Service Catalogue management
  • Service level management
  • Availability management
  • Capacity Management
  • IT service continuity management
  • Information security management system
  • Supplier management

Service Transition

  • Transition planning and support
  • Change management
  • Service asset and configuration management
  • Release and deployment management
  • Service validation and testing
  • Change evaluation
  • Knowledge management

Service Operation

  • Event management
  • Incident management
  • Request fulfillment
  • Problem management
  • Identity management

Continual Service Improvement


ITIL also describes four functions that should cooperate together to form an effective service management capability.

  • Service operations
  • Technical management
  • Application management
  • Operations management

The importance of implementation

The above processes and functions make for an excellent high level list of topics to discuss when establishing an operating model for your service, whether or not you adopt the formal methods. In many cases if you have well understood, well established and well documented processes in place for all of the above you should be in a good position to run your service.

When looking to combine much of the rest of the guidance on the service manual with ITIL or other service management frameworks it is important to challenge existing implementations. This is less about the actual implementation and more often about the problems that implementation was designed to originally solve.

An example – service transition

As an example ITIL talks a great deal about Service Transition – getting working functionality into the hands of the users of the service. This is a key topic for The Digital Service Standard too which says that teams should:

Make sure that you have the capacity and technical flexibility to update and improve the service on a very frequent basis.

GOV.UK for instance made more than 100 production releases during its first two weeks after launch.

This high rate of change tends to challenge existing processes designed for a slower rate of change. If you are releasing your service every month or every 6 months then a manual process (like a weekly or monthly in-person change approval board or CAB) may be the most suitable approach. If you’re releasing many times a day then the approach to how change is evaluated, tested and managed tends towards being more automated. This moves effort from occasional but manual activities to upfront design and automation work. More work is put in to assure the processes rather than putting all the effort into assuring a specific transition.

Service management frameworks tend to acknowledge this, for instance ITIL has a concept of a standard change (something commonly done, with known risks and hence pre-approved), but a specific implementation in a given organisation might not.

Other frameworks exist

It is important to note that other service management frameworks and standards exist, including some that are of a similar size and scope to ITIL:

Many organisations also use smaller processes and integrate them together. The needs of your service and organisation will determine what works best for you.

Problematic concepts

Some traditional language tends to cause some confusion when discussing service management alongside agile. It’s generally best to avoid the following terms when possible, although given their widespread usage this isn’t always possible. It is however worth being aware of the problems these concepts raise.


Projects tend to imply a start and an end. The main goal of project work is to complete it, to reach the end. Especially for software development the project can too often be viewed as done when the software is released. What happens after that is another problem entirely – and often someone else’s problem.

However when building services the main goal is to meet user needs These needs may change over time, and are only met by software that is running in production and available to those users.

This doesn’t mean not breaking work down into easily understandable parts, but stories, sprints and epics are much more suited to agile service delivery.

Business as usual

The concept of business as usual also clashes with a model of continuous service improvement. It immediately brings to mind before and after states, often with the assumption that change is both much slower and more constrained during business as usual. In reality, until you put your service in the hands of real users as part of an alpha or beta you won’t have all the information needed to build the correct service. And even once you pass the live standard you will be expected to:

continuously update and improve the service on the basis of user feedback, performance data, changes to best practice and service demand

Further reading

User stories for web operations teams

This post was originally written as part of the Government Service Design Manual while I was working for the UK Cabinet Office in 2013. I’m republishing it here under the terms of the Open Government licence.

This document outlines the typical scope of infrastructure and web operations (sometimes erroneously referred to as hosting) work on a large service redesign project.

The sample list of user stories provided is not intended to be a complete list of all areas of interest nor are you likely to need to do all of this for every service. The idea is for this list to be a good starting place from where to you can write additional stories, delete ones you do not require and split stories into smaller ones. Importantly you also need to provide your own acceptance criteria specific to the needs of your service.

Remember these stories are a placeholder for a conversation. For some contexts, that conversation will be ‘this does not apply to my service’ – that is fine. But there will almost certainly be other stories not listed here which do apply.

The problem

An issue we have observed on a number of projects is a lack of understanding early on in a project about the work required to run a large online service. Often this is placed under hosting and is investigated too late in the process.

Intended audience

The hosting of a complex and sensitive software application requires a team of people with specialist skills to design, setup and operate. Because this work is generally not user facing and can be highly technical it is sometimes easy to leave until later – with potentially dire consequences for launching safely and on time.

Service managers

Does your team have people who deeply understand this topic? If you are not an expert then it is important to involve people permanently in the team who are. They can explain the technical trade offs and decisions which may affect your service.

Delivery managers

As well as understanding the potentially large scope of work, many of the areas discussed here have lead times associated with third parties. The earlier stories related to these topics are brought into project backlogs the sooner estimates can be made and deadlines understood.


The following stories are intended to provide a starting point for any project, rather than be a complete set. Individual projects would be expected to take and modify stories as needed and importantly to apply their own acceptance criteria specific to their requirements.

The majority of these stories are from the point of view of developers, web operations engineers and the responsible service manager. Although not ideal, for this particular technical topic this works reasonably well. Feel free to change the focus when using them in your backlog.


Development process
As a developer working on the service
So that we can ensure a high level of quality
And so we can maximise the integrity of the source code
I want a well documented and understood development process

Out-of-hours support
As the service manager responsible for the service
So that we can ensure a suitable level of availability and integrity
I want to understand the requirement for Out-of-hours support

Disaster recovery
As the service manager responsible for the service
So that in the event of a disaster everyone doesn’t panic and make things up
I want a clear disaster recovery plan in place to deal with different types of catastrophic event

Release process
As the service manager responsible for the service
So that the service can be changed on a very frequent basis
And so that changes do not cause problems for users
I want a well documented and understood release process

Security response
As the service manager responsible for the service
So that security incidents are handled with extra care
And so that the service meets its wider Government obligation to GovCert
I want a well documented and understood security incident process

As the service manager responsible for the service
So that communication with users is done in a joined up way
I want a central helpdesk function to deal with events, incidents and requests

Request Management
As the service manager responsible for the service
So that questions from users can be dealt with efficiently
I want a clear information request management policy

Event Management
As the service manager responsible for the service
So that likely events that could affect the running of the service can be dealt with smoothly
I want a clear event management policy

Incident Management
As the service manager responsible for the service
So that problems that arise with that service can be dealt with efficiently
I want a clear incident management policy

Operations manual
As the service manager responsible for the service
So that information about the running of the service is not kept in individuals’ heads
And so information is readily available to people running the service
I want a single place to store content for a service operations manual

Shared service

Source code hosting
As a developer working on the service
So we have somewhere to securely store our source code
I want access to a central source code hosting service or repository

Continuous Integration
As a developer working on the service
So we can ensure a high level of quality in the code
And so we can minimise the time needed for regression testings
I want a Continuous Integration environment which automatically runs tests against every commit

External DNS
As a web operations engineer
So that visitors to the service don’t need to remember an IP address that will change
I want a process and supplier relationship to manage external DNS addresses


Sensitivity of source code
As a developer working on the service
So that I understand the controls that need to be in place
And so that I know who and how I may share it
I want a clear policy around the sensitivity of source code

Third party code
As a developer working on the service
I want a clear policy around use of third party source code libraries
So that I do not introduce unknown security problems

Change evaluation
As the service manager responsible for the service
So that I can release changes to production quickly
And so that we can meet our obligation to the Digital by Default Service Standard
I want a documented process for evaluating and deciding on a change to the production service

Access control
As the service manager responsible for the service
So that the confidentiality, integrity and availability of the service isn’t compromised
And so that suitable technical controls can be put in place to enforce it
I want a clear policy on who has access to what on the production system

Separation of duties
As the service manager responsible for the service
So that we can ensure the service has enough people in the right roles
I want to understand any required separation of duties (whether driven by legislation or security concerns)

As the service manager responsible for the service
So that security clearances can be arranged early in the project to avoid access restrictions later on
I want to know what level of clearances are required for different roles (including third parties)

Releasing open source
As a developer working on the service
So that I do not introduce unknown security problems
And so that we can meet our obligation to the Digital by Default Service Standard
I want a clear policy around releasing code as open source


Government networks
As a technical architect
So that the right suppliers are contracted
And so that long lead times are factored into the project plan early
I want to know whether the service requires access to a Government network like the PSN or GSI

Multiple infrastructure providers
As the service manager for this service
So that I understand the intended availability constraints
I want to know whether multiple suppliers of Infrastructure are required

Capacity planning
As a web operations engineer
So that we can estimate the number and size of infrastructure components (instances, firewalls, load balancers etc.)
And so that resource based costs can be estimated
I want to carry out some capacity planning activities

Network architecture
As a technical architect
So that I can build out a production environment to an agreed specification
I want a network architecture design


Web servers
As a web operations engineer working on the service
So that we can serve HTTP request
And so we can proxy requests to application servers
I want to install and configure a web server

As a web operations engineer working on the service
So that data can be stored in a manner befitting its structure
And so the stored data can be queried as quickly as required
I want to install and configure a suitable database server

As a web operations engineer working on the service
So that data can still be read even during a failure of a single database server
I want to configure some failover or other redundancy mechanism for the database

As a web operations engineer working on the service
So that data can still be written even during a failure of a single database server
I want to configure some failover or other redundancy mechanism for the database

Load balancers
As a web operations engineer working on the service
So that web requests can still be served even with the failure of one or more web servers
I want to install and/or configure a load balancer

Internal DNS
As a web operations engineer working on the service
So that we can easily address our services and instances
I want to install and/or configure a mechanism to manage internal DNS

Database backups
As the service manager for the service
So that we can recover from a large failure of our database infrastructure
I want regular automated backups to be taken of the data stored in the database

As the service manager for the service
So that we can recover from a large failure of a single suppliers infrastructure
I want regular automated backups to be stored off site

HTTP cache
As a web operations engineer working on the service
So that the service remains fast when serving identical content
And so load is minimised on the application servers
I want to install an HTTP cache

Email gateway
As a developer working on the service
So that the service can send email to administrators or end users
I want to setup and configure a suitable email gateway

Application servers
As a developer working on the service
So that the code I write can be run on server instances
I want to install and configure a suitable application server

Internal package repository
As a web operations engineer working on the service
So that we can use software not available in our operating system repositories
And so that we can use the security, dependency management and versioning features
I want to install and configure an internal package repository

Artifact repository
As a developer working on the service
So that we can share and version individual code components that need it
I want to install and configure an artifact repository

Message queue
As a developer working on the service
So that I can easily and efficiently process work asynchronously
I want to install and configure a suitable message queue or work queue system

Search server
As a developer working on the service
So that I can quickly and efficiently search through large amounts of data
I want to install and configure a suitable search engine

Object cache
As a developer working on the service
So that I can minimise the number of queries to the database
And so that I can keep the service fast and responsive to users
I want to install and configure a object caching system


Metric collection service
As a web operations engineer working on the service
So that we can collect large numbers of time series metrics from the running service
I want to install and configure a metric collection system

Application running monitoring checks
As a web operations engineer working on the service
So that we can run checks against metrics from the metrics system
And so that we can run active checks based on arbitrary code
I want to install and configure a monitoring system

Smoke tests
As a developer working on the service
So that I know that I haven’t broken anything when deploying my application
I want a series of smoke tests to be run after all deployments

Application metrics
As a developer working on the service
So that I can gain visibility of how my application is running in production
And so we can find and fix problems with it quickly
I want a simple way of instrumenting my application to feed metrics to the metrics system

System metrics
As a web operations engineer working on the service
So that we can identify and fix problems with the system, ideally before they occur
I want to set up collection of low level system metrics like load, disk, network io, etc.

Security monitoring
As a web operations engineer working on the service
So that we notice quickly and are alerted to any incidents with a security flavour
I want to configure suitable security monitoring tools

As a web operations engineer or developer supporting the service
So that I know about any issues as they happen
I want to set up suitable notifications from the monitoring system

Transactional monitoring
As a developer working on a transactional service
So that we can block fraudulent or otherwise suspect transactions
I want to install and configure a transactional monitoring system with suitable rules

External monitoring
As the service manager for the service
So that in the event of a failure of the monitoring system
And so that the service is monitoring from outside our local network
I want an external monitoring capability with basic checks to monitoring service uptime

Monitoring data feed from infrastructure provider
As a web operations engineer working on the service
So that I am aware of problems in the hypervisor, physical or network infrastructure
I want a feed of monitoring data from the Infrastructure supplier


Log collection
As a web operations engineer working on the service
So that I can easily see everything that is happening in specific applications
I want to collect all the logs from applications running on the same host in one place

Log aggregation
As a web operations engineer working on the service
So that I don’t have to go to an individual machine to view its logs
I want all logs from all machines to be aggregated together

Log storage
As a web operations engineer working on the service
So that logs can be kept for a suitable period of time
I want to provision enough storage for log archiving

Log viewing
As a web operations engineer working on the service
So that I can see what is happening across the infrastructure
I want a mechanism for viewing and searching logs in as near real time as possible

As a developer working on the service
So that I can extract information from logs to aid with improving the service
I want a mechanism to run queries across the aggregated logs

Configuration management

Configuration management client
As a web operations engineer working on the service
So that changes to server configuration can be made safely and quickly
I want to install software to manage configuration management

Configuration management database
As a web operations engineer working on the service
So that configuration changes are tracked over time
And so that current state of available to query
I want to install software to manage a configuration management database

Configuration management server
As a web operations engineer working on the service
So that all nodes do not have all configuration information
I want to install software to allow centralised management of Configuration management code


Configuration management code deployment mechanism
As a web operations engineer working on the service
So that configuration changes can be made safely and in an auditable manner
I want a deployment process and tooling for configuration management code

Application deployment mechanism
As a developer working on the service
So that changes to applications can be made available to users
And so that changes are made in a safe and auditable manner
I want a deployment process and tooling for application code

Release tracking
As the service manager for the service
So that we have an auditable log of what was changed when by whom
I want an up-to-date list of releases to be maintained

As a web operations engineer working on the service
So that we don’t have to compile customised applications from source before using them
And so we can take advantage of dependency and version management capabilities of the OS
I want a process and tooling for creating our own system packages

As a web operations engineer working on the service
So that I can run commands across multiple instances quickly
I want tooling in place which allows some orchestration based on the current instances

Database migrations
As a web operations engineer working on the service
So that I can have confidence that database migration scripts will work when applied to production
I want database migrations to be deployed through the same sequence of environments as code changes

Management of secrets
As a web operations engineer working on the service
So that I can ensure confidential communication between particular parts of the system
I want a process or tool for managing secrets such as keys and passwords

Access control

End user devices
As the service manager responsible for the service
So that management access to the infrastructure can be locked down to prevent unauthorised access
I want to know what kind of protection the management end user devices require

User directory
As a web operations engineer
So that we do not have to maintain multiple lists of privileged users
And so that users can be added and removed once in a central fashion
I want to install and configure something to provide a single user directory

Key based authentication
As a web operations engineer
So that we are not vulnerable to password based login attempts to individual servers
I want to set-up public key based authentication

Single sign-on
As a web operations engineer
So that any third party web interfaces we use can be accessed via a single login
I want to install and configure a single sign-on systems

Network/VPN configuration
As a web operations engineer
So that management functions can not be accessed via the public internet
And so that we reduce the surface area for attack
I want to restrict management access to a VPN and/or non-public restricted network


Other environments
As the service manage for the service
So that I can see the very latest working version of the service at any time
And so I can share that with people in and outside the team
I want a preview environment to be provisioned which is similar to production

As a web operations engineer working on the service
So that the we have a clean environment in which to test production deployments
And so that we have a secure environment to test with production-like data
I want to provision a staging environment which mimics production as closely as possible

Production environment
As a web operations engineer working on the service
So that the service can launch to the public
I want to provision a production environment

Base image(s)
As a web operations engineer working on the service
So that all server instances start out with sensible security settings
I want to create a base image running the chosen operating system with hardened configuration

Public network interfaces
As a web operations engineer working on the service
So that the application only receives wanted traffic from the internet
And so that we don’t accidentally expose sensitive or insecure components of the system
I want to configure and test the public network interfaces for the system

Private network configuration
As a web operations engineer working on the service
So that individual internal components can only talk with known parts of the system
And so we limit the extent of any security breach
I want to configure and test the private network interfaces for the system

Network codes of connection
As a web operations engineer working on the service
Given I need to communicate with a system only available on a Government network
So that the two systems can talk with each other
I want to meet the code of connection requirements and configure access to the network

Management network
As a web operations engineer working on the service
So that network traffic used to manage the infrastructure is separate from public traffic
And so we can monitor irregularities in network traffic separately
I want to configure a separate management network

Platform load balancers
As a web operations engineer working on the service
So that we can reduce the number of single points of failure
And so that we can scale out to deal with a large amount of traffic
I want to provision load balancers to distribute traffic between multiple instances

Platform firewalls
As a web operations engineer working on the service
So that unwanted traffic can be filtered before it enters our virtual infrastructure
I want to configure the external facing IaaS firewalls to only allow certain traffic

Dynamic environments
As a web operations engineer working on the service
So that we are not constrained by a fixed number of environments
And so we can easy run full stack tests or experiments
I want to be able to easily provision an environment running the full service

Elastic scaling
As a web operations engineer working on the service
So that the service can automatically deal with unexpected increases in traffic
I want to configure tooling to automatically scale the number of instances based on load

Security controls

Operating system hardening
As a web operations engineer
So that we are making use of built-in operating system security controls
I want to automate a default set of hardening rules for our chosen operating system

Malware detection
As a web operations engineer
So that instances which may be compromised can be dealt with quickly
I want to automate the detection of potential malware

Intrusion detection
As a web operations engineer
So that instances which are being attacked or probed can defend themselves
I want to configure an intrusion detection and prevention system

Virus scanning
As a web operations engineer
So we can be sure that files in the system don’t have viruses
I want to install virus scanning for files passing a network boundary

Host firewalls
As a web operations engineer
So that the surface area for attack is limited
And so that services which should only be available locally aren’t exposed on the internet
I want to install and configure a local firewall

On instance event auditing
As a web operations engineer
So that I know when things like logins or other sensitive events happen on instances
I want to set-up some auditing of events

Rate/connection limiting
As a web operations engineer
So that large spikes in traffic from a single source don’t overwhelm application
I want to configure some level of rate and connection limiting for web requests

Secure storage of key material
As a web operations engineer
So that any highly sensitive cryptographic keys are not lost, resulting in a compromise
I want to have a mechanism in place to securely store key material

Third party DDoS protection
As a web operations engineer
So that a the site does not go down under a denial of service attack
I want to purchase and/or configure a level of DDoS protection


Performance testing
As the service manager responsible for the service
So that we know the service will be fast and responsive under realistic traffic
I want to be able to run a comprehensive performance test suite against the service

As a developer working on the service
So that we know changes to the code do not negatively affect performance
I want the performance test suite to run as part of the continuous integration system

Load testing
As the service manager responsible for the service
So that we know the service will still be working under larger amounts of traffic than are expected
I want to be able to run a comprehensive load test suite against the service

Application penetration testing
As the service manager responsible for the service
So that the service does not get compromised due to a vulnerability
And so we meet our accreditation obligations
I want to run a suitable number of penetration tests against the applications under development

As the service manager responsible for the service
So that the service does not get compromised due to a vulnerability
And so we meet our accreditation obligations
I want to run a suitable number of penetration tests against third party installed applications used as part of the service

Infrastructure penetration testing
As the service manager responsible for the service
So that the service does not get compromised due to a vulnerability
And so we meet our accreditation obligations
I want to run a suitable number of penetration tests against the infrastructure configuration

Operating system

Operation system selection
As a web operations engineer working on the service
So that we have a clear path to receiving security updates
And so we can more easily find support for our systems
I want to select and install a suitable default operating system for the service

File systems
As a web operations engineer working on the service
So that we get the best possible performance and reliability from the disk
I want to select a suitable file system and partition layout

Resource isolation
As a web operations engineer working on the service
So that noisy applications cannot affect other applications on the instance
I want to be able to isolate running applications from each other in terms of memory and CPU

Read-only file systems
As a web operations engineer working on the service
So that I can protect against files being changed due to compromises in the application
I want to be able to configure a read-only file system if appropriate.

Working for a software vendor

One of the reasons I moved to Puppet two and a bit years ago was because I was interested in the software industry. In particular I was interested in being on the vendor side for a while. My background is mainly as a service provider, software as a service, in-house developer/ops type person. This has definitely been an interesting experience, but I’ve not tried too much to explain why, until now.

First, what do we mean by vendor?

a person or company offering something for sale, especially a trader in the street.

So in the context of a software vendor we specifically mean:

a person or company offering software for sale

Note that we’re selling the software, not access to some service provided by software (ie. SaaS). SaaS and other as-a-service models are growing part of the industry, but the business model, development cycle, company structure and other aspects are quite different in my experience, though lots of hybrid models exist too.

Economics and scale

One of the interesting aspects of the software vendor world is the economics, the revenues, and the fact lots of companies are public. This in turn means a large amount of VC money goes into trying to create another large software vendor, because the potential payout is huge.

Take a sample set of companies from the last 10 years or so that are still private: Docker, Puppet, Chef, MongoDB, Elastic, CoreOS, Mesosphere, Weave, Cloudera, etc. Somewhat biased towards my own interests I’ll admit.

Now take a sample of large, public, software vendors: Oracle, Microsoft, CA, SAP, Sage, BMC, VMware. Not counting companies like Intel, Cisco, IBM, Dell (no longer public), and HP with huge software portfolios.

Let’s pick on Sage, a UK software company selling accounting software. As of 2014 Sage had 1169 people in software development R&D roles and they made $1.6billion from software and related services in 2015. That’s probably about the (order of magnitude) number of people employed in R&D roles in the above private softare companies. The revenue is (and I’m guessing here) a bit higher at Sage than those companies combined too. SAP is an order of magnitude larger, both in terms of people (18908 in 2014) and revenues ($18billion also in 2014). Oracle revenues were $38billion as another data point.

So all the cool (or not so cool) companies from the past 10 years or so are a rounding error to the size of the industry. But you wouldn’t know that from reading Hacker News or other parts of the internet. This disconnect is a constant source of interest to me as I spend time with Puppet customers and with the wider infrastructure community at conferences and the like.

A world of difference

My gut feeling is that most people working as software developers, designers, product managers, etc. don’t work for software vendors. Apart from maybe in localised areas like Silicon Valley. But because of the mentioned money and scale (and PR spend) of the big players a great deal of press interest centers around vendors. Docker is probably the best current example of this but it’s more general than one company. This makes what happens in software-vendor-startup-land more visible to everyone else than, say, IT reality in large financial companies.

At the heart of a good software company is a product being built and maintained by a team of engineers, designers, managers, etc. In many ways this is similar to lots of peoples experience of building software (whether at work or at home as part of one open source project or another). But the support surrounding this tends to vary greatly from other areas. A dedicated marketing and product marketing team, dedicated sales staff, a professional services function, training, documentation, public relations personell are all required to turn the software into revenue. And importantly these teams have to work closely together, and be actively involved with the development of the product.

This is very different from an in-house development position, but it’s also quite different from most SaaS operations. SaaS tends (generalising here) to be based around large numbers of individual users with monthly recurring revenues of 10s or 100s of US dollars. Software vendors selling to large enterprises tends to be looking at single large deals of 10s of thousands to many millions of dollars. This tends to mean large differences in total number of customers, revenue per customer, time needed to close a deal, requirement for staff local to a customer, etc. All of that makes for a very different operation and feedback cycle.

Some interesting observations

Software has a much longer shelf-life in the real-world than people typically think on the internet. Take the datacenter automation market. This IDC report for example pegs the market at $2.3billion in 2015. VMware takes the lions share with roughly 30%, with BMC with 10%. For reference Puppet has 3.2% and Chef 1.2%. Obviously this is just one report, and it’s now a year old, but it’s an interesting data point. And compare that to what you might expect if you just follow the software rather than the market. Even in 2015 some people would have been saying “surely everything is Docker and Kubernetes now?“. The reality is closer to it being all shell scripts and BladeLogic for the majority of IT shops.

For the most part, innovators (and some early adopters) don’t buy software, instead they build or co-opt it. Take Netflix, Uber, Amazon, Google, Facebook or similar. All are well-known for building much of there core software and infrastructure and using open source solutions for much of the rest. And it’s not just software, all of the above also have large internal investments in bespoke hardware as well. So who buys software from software vendors? Taking the Rogers’ Innovation Adoption Curve it’s the early majority, the late majority and laggards. That’s ~85% of the market. Most of the noise on the internet about software is from innovators and early adopters, or people who want to be in those groups. But most of the software sold is to people with very different wants and needs. This chasm explains much of the frustration experienced with software, and the difficulty of building software for often very different types of users at the same time.

Much of the writing about continuous delivery and continuous deployment assumes you’re releasing a web site or at least a central, single, service. At the very least this is most peoples experience and context. But shipping software than people install and run themselves tends to make software deployment a pull rather than a push. A vendor can release a new version, but how to make the customer upgrade? Technically this could be reasonably straightforward (Chrome auto-updates for example) but for expensive, often critical, systems in sometimes regulated or otherwise controlled or low trust environments, this turns out to be trickier and more about people than just technology. This is an entire topic on it’s own so I’ll leave it there for now.

Continuous integration for packaged software (true for some, but not most, projects outside software vendors) tends to hit a permutation explosion quite quickly. Take server software because that’s what I’m most familiar with. You’ll definitely support the latest version of RHEL, plus probably a few older versions, and maybe Centos and some of the other variants (Oracle Linux, Scientific Linux) as well. Ubuntu LTS releases probably makes the list, as might Debian stable. You’ll also likely want to test on at least Windows Server 2016 and 2012. You may have need to keep going and support BSD, AIX, HP-UX, SUSE, etc. Puppet has an unreasonably long list of supported and tested platforms for instance. Throw in other variations or configurations or architectures and you have a serious CI environment. Compare this to a more typical case of a deployment pipeline to a single known operating system and version on a server you control.

Open source

One of the notable things about the lists above of older (public) and newer (currently private) software companies is that all of the newer ones are based around an open source software product or products. We’ve had companies based around open source for a long time, but very few make it to the public markets (where we get data to see if they actually work as companies). A recent exception is Hortonworks (HDP) which opened at $26.38 in December 2014 but is down to $8.31 as of this writing, with revenues around $40million a quarter. Red Hat (RHT) did $2billion in 2016 (which remember is 5% of Oracles revenues, but still a large amount).

So undoutedly open source has had a large effect on the software industry as a whole. But the impact on the public markets to date is minimal in terms of new companies. It will be super interesting to see if in 5 years time the list of public software companies based on open source software is larger than it is today.


I mainly wrote this post so I had something to reference when I talk to people about the software industry, and in particular what it’s like working for a software vendor. Speculating about or second guessing one vendor or another is an internet sport (non-more-so than for those that work at other vendors) but from the outside it’s worth an appreciation of some of the differences I think and a bit of empathy for the decisions made. And if the above makes you think this all sounds rather interesting then you’d be right.