Jan 1, 2017 · 21 minute read
This post was originally written
as part of the Government Service Design Manual while I was working for the UK
Cabinet Office in 2013. I’m republishing it here under the terms of the Open
Government licence.
This document outlines the typical scope of infrastructure and web operations
(sometimes erroneously referred to as hosting) work on a large service
redesign project.
The sample list of user stories provided is not intended to be a complete list of
all areas of interest nor are you likely to need to do all of this for every service.
The idea is for this list to be a good starting place from where to you can write
additional stories, delete ones you do not require and split stories into smaller
ones. Importantly you also need to provide your own acceptance criteria
specific to the needs of your service.
Remember these stories are a placeholder for a conversation.
For some contexts, that conversation will be ‘this does not apply to my
service’ – that is fine. But there will almost certainly be other stories not
listed here which do apply.
The problem
An issue we have observed on a number of projects is a lack of understanding
early on in a project about the work required to run a large online
service. Often this is placed under hosting and is investigated too late
in the process.
Intended audience
The hosting of a complex and sensitive software application requires a team
of people with specialist skills to design, setup and operate. Because this
work is generally not user facing and can be highly technical it is sometimes
easy to leave until later – with potentially dire consequences for launching
safely and on time.
Service managers
Does your team have people who deeply understand this topic? If you are not an
expert then it is important to involve people permanently in the team
who are. They can explain the technical trade offs and decisions which may
affect your service.
Delivery managers
As well as understanding the potentially large scope of work, many of the areas
discussed here have lead times associated with third parties. The earlier
stories related to these topics are brought into project backlogs the sooner
estimates can be made and deadlines understood.
Stories
The following stories are intended to provide a starting point for any project,
rather than be a complete set. Individual projects would be expected to take
and modify stories as needed and importantly to apply their own acceptance
criteria specific to their requirements.
The majority of these stories are from the point of view of developers, web
operations engineers and the responsible service manager. Although not ideal,
for this particular technical topic this works reasonably well. Feel free
to change the focus when using them in your backlog.
Process
Development process
As a developer working on the service
So that we can ensure a high level of quality
And so we can maximise the integrity of the source code
I want a well documented and understood development process
Out-of-hours support
As the service manager responsible for the service
So that we can ensure a suitable level of availability and integrity
I want to understand the requirement for Out-of-hours support
Disaster recovery
As the service manager responsible for the service
So that in the event of a disaster everyone doesn’t panic and make things up
I want a clear disaster recovery plan in place to deal with different types of catastrophic event
Release process
As the service manager responsible for the service
So that the service can be changed on a very frequent basis
And so that changes do not cause problems for users
I want a well documented and understood release process
Security response
As the service manager responsible for the service
So that security incidents are handled with extra care
And so that the service meets its wider Government obligation to GovCert
I want a well documented and understood security incident process
Helpdesk
As the service manager responsible for the service
So that communication with users is done in a joined up way
I want a central helpdesk function to deal with events, incidents and requests
Request Management
As the service manager responsible for the service
So that questions from users can be dealt with efficiently
I want a clear information request management policy
Event Management
As the service manager responsible for the service
So that likely events that could affect the running of the service can be dealt with smoothly
I want a clear event management policy
Incident Management
As the service manager responsible for the service
So that problems that arise with that service can be dealt with efficiently
I want a clear incident management policy
Operations manual
As the service manager responsible for the service
So that information about the running of the service is not kept in individuals’ heads
And so information is readily available to people running the service
I want a single place to store content for a service operations manual
Shared service
Source code hosting
As a developer working on the service
So we have somewhere to securely store our source code
I want access to a central source code hosting service or repository
Continuous Integration
As a developer working on the service
So we can ensure a high level of quality in the code
And so we can minimise the time needed for regression testings
I want a Continuous Integration environment which automatically runs tests against every commit
External DNS
As a web operations engineer
So that visitors to the service don’t need to remember an IP address that will change
I want a process and supplier relationship to manage external DNS addresses
Policy
Sensitivity of source code
As a developer working on the service
So that I understand the controls that need to be in place
And so that I know who and how I may share it
I want a clear policy around the sensitivity of source code
Third party code
As a developer working on the service
I want a clear policy around use of third party source code libraries
So that I do not introduce unknown security problems
Change evaluation
As the service manager responsible for the service
So that I can release changes to production quickly
And so that we can meet our obligation to the Digital by Default Service Standard
I want a documented process for evaluating and deciding on a change to the production service
Access control
As the service manager responsible for the service
So that the confidentiality, integrity and availability of the service isn’t compromised
And so that suitable technical controls can be put in place to enforce it
I want a clear policy on who has access to what on the production system
Separation of duties
As the service manager responsible for the service
So that we can ensure the service has enough people in the right roles
I want to understand any required separation of duties (whether driven by legislation or security concerns)
Clearances
As the service manager responsible for the service
So that security clearances can be arranged early in the project to avoid access restrictions later on
I want to know what level of clearances are required for different roles (including third parties)
Releasing open source
As a developer working on the service
So that I do not introduce unknown security problems
And so that we can meet our obligation to the Digital by Default Service Standard
I want a clear policy around releasing code as open source
Design
Government networks
As a technical architect
So that the right suppliers are contracted
And so that long lead times are factored into the project plan early
I want to know whether the service requires access to a Government network like the PSN or GSI
Multiple infrastructure providers
As the service manager for this service
So that I understand the intended availability constraints
I want to know whether multiple suppliers of Infrastructure are required
Capacity planning
As a web operations engineer
So that we can estimate the number and size of infrastructure components (instances, firewalls, load balancers etc.)
And so that resource based costs can be estimated
I want to carry out some capacity planning activities
Network architecture
As a technical architect
So that I can build out a production environment to an agreed specification
I want a network architecture design
Components
Web servers
As a web operations engineer working on the service
So that we can serve HTTP request
And so we can proxy requests to application servers
I want to install and configure a web server
Databases
As a web operations engineer working on the service
So that data can be stored in a manner befitting its structure
And so the stored data can be queried as quickly as required
I want to install and configure a suitable database server
As a web operations engineer working on the service
So that data can still be read even during a failure of a single database server
I want to configure some failover or other redundancy mechanism for the database
As a web operations engineer working on the service
So that data can still be written even during a failure of a single database server
I want to configure some failover or other redundancy mechanism for the database
Load balancers
As a web operations engineer working on the service
So that web requests can still be served even with the failure of one or more web servers
I want to install and/or configure a load balancer
Internal DNS
As a web operations engineer working on the service
So that we can easily address our services and instances
I want to install and/or configure a mechanism to manage internal DNS
Database backups
As the service manager for the service
So that we can recover from a large failure of our database infrastructure
I want regular automated backups to be taken of the data stored in the database
As the service manager for the service
So that we can recover from a large failure of a single suppliers infrastructure
I want regular automated backups to be stored off site
HTTP cache
As a web operations engineer working on the service
So that the service remains fast when serving identical content
And so load is minimised on the application servers
I want to install an HTTP cache
Email gateway
As a developer working on the service
So that the service can send email to administrators or end users
I want to setup and configure a suitable email gateway
Application servers
As a developer working on the service
So that the code I write can be run on server instances
I want to install and configure a suitable application server
Internal package repository
As a web operations engineer working on the service
So that we can use software not available in our operating system repositories
And so that we can use the security, dependency management and versioning features
I want to install and configure an internal package repository
Artifact repository
As a developer working on the service
So that we can share and version individual code components that need it
I want to install and configure an artifact repository
Message queue
As a developer working on the service
So that I can easily and efficiently process work asynchronously
I want to install and configure a suitable message queue or work queue system
Search server
As a developer working on the service
So that I can quickly and efficiently search through large amounts of data
I want to install and configure a suitable search engine
Object cache
As a developer working on the service
So that I can minimise the number of queries to the database
And so that I can keep the service fast and responsive to users
I want to install and configure a object caching system
Monitoring
Metric collection service
As a web operations engineer working on the service
So that we can collect large numbers of time series metrics from the running service
I want to install and configure a metric collection system
Application running monitoring checks
As a web operations engineer working on the service
So that we can run checks against metrics from the metrics system
And so that we can run active checks based on arbitrary code
I want to install and configure a monitoring system
Smoke tests
As a developer working on the service
So that I know that I haven’t broken anything when deploying my application
I want a series of smoke tests to be run after all deployments
Application metrics
As a developer working on the service
So that I can gain visibility of how my application is running in production
And so we can find and fix problems with it quickly
I want a simple way of instrumenting my application to feed metrics to the metrics system
System metrics
As a web operations engineer working on the service
So that we can identify and fix problems with the system, ideally before they occur
I want to set up collection of low level system metrics like load, disk, network io, etc.
Security monitoring
As a web operations engineer working on the service
So that we notice quickly and are alerted to any incidents with a security flavour
I want to configure suitable security monitoring tools
Notifications
As a web operations engineer or developer supporting the service
So that I know about any issues as they happen
I want to set up suitable notifications from the monitoring system
Transactional monitoring
As a developer working on a transactional service
So that we can block fraudulent or otherwise suspect transactions
I want to install and configure a transactional monitoring system with suitable rules
External monitoring
As the service manager for the service
So that in the event of a failure of the monitoring system
And so that the service is monitoring from outside our local network
I want an external monitoring capability with basic checks to monitoring service uptime
Monitoring data feed from infrastructure provider
As a web operations engineer working on the service
So that I am aware of problems in the hypervisor, physical or network infrastructure
I want a feed of monitoring data from the Infrastructure supplier
Logging
Log collection
As a web operations engineer working on the service
So that I can easily see everything that is happening in specific applications
I want to collect all the logs from applications running on the same host in one place
Log aggregation
As a web operations engineer working on the service
So that I don’t have to go to an individual machine to view its logs
I want all logs from all machines to be aggregated together
Log storage
As a web operations engineer working on the service
So that logs can be kept for a suitable period of time
I want to provision enough storage for log archiving
Log viewing
As a web operations engineer working on the service
So that I can see what is happening across the infrastructure
I want a mechanism for viewing and searching logs in as near real time as possible
As a developer working on the service
So that I can extract information from logs to aid with improving the service
I want a mechanism to run queries across the aggregated logs
Configuration management
Configuration management client
As a web operations engineer working on the service
So that changes to server configuration can be made safely and quickly
I want to install software to manage configuration management
Configuration management database
As a web operations engineer working on the service
So that configuration changes are tracked over time
And so that current state of available to query
I want to install software to manage a configuration management database
Configuration management server
As a web operations engineer working on the service
So that all nodes do not have all configuration information
I want to install software to allow centralised management of Configuration management code
Deployment
Configuration management code deployment mechanism
As a web operations engineer working on the service
So that configuration changes can be made safely and in an auditable manner
I want a deployment process and tooling for configuration management code
Application deployment mechanism
As a developer working on the service
So that changes to applications can be made available to users
And so that changes are made in a safe and auditable manner
I want a deployment process and tooling for application code
Release tracking
As the service manager for the service
So that we have an auditable log of what was changed when by whom
I want an up-to-date list of releases to be maintained
Packaging
As a web operations engineer working on the service
So that we don’t have to compile customised applications from source before using them
And so we can take advantage of dependency and version management capabilities of the OS
I want a process and tooling for creating our own system packages
Orchestration
As a web operations engineer working on the service
So that I can run commands across multiple instances quickly
I want tooling in place which allows some orchestration based on the current instances
Database migrations
As a web operations engineer working on the service
So that I can have confidence that database migration scripts will work when applied to production
I want database migrations to be deployed through the same sequence of environments as code changes
Management of secrets
As a web operations engineer working on the service
So that I can ensure confidential communication between particular parts of the system
I want a process or tool for managing secrets such as keys and passwords
Access control
End user devices
As the service manager responsible for the service
So that management access to the infrastructure can be locked down to prevent unauthorised access
I want to know what kind of protection the management end user devices require
User directory
As a web operations engineer
So that we do not have to maintain multiple lists of privileged users
And so that users can be added and removed once in a central fashion
I want to install and configure something to provide a single user directory
Key based authentication
As a web operations engineer
So that we are not vulnerable to password based login attempts to individual servers
I want to set-up public key based authentication
Single sign-on
As a web operations engineer
So that any third party web interfaces we use can be accessed via a single login
I want to install and configure a single sign-on systems
Network/VPN configuration
As a web operations engineer
So that management functions can not be accessed via the public internet
And so that we reduce the surface area for attack
I want to restrict management access to a VPN and/or non-public restricted network
Provisioning
Other environments
As the service manage for the service
So that I can see the very latest working version of the service at any time
And so I can share that with people in and outside the team
I want a preview environment to be provisioned which is similar to production
As a web operations engineer working on the service
So that the we have a clean environment in which to test production deployments
And so that we have a secure environment to test with production-like data
I want to provision a staging environment which mimics production as closely as possible
Production environment
As a web operations engineer working on the service
So that the service can launch to the public
I want to provision a production environment
Base image(s)
As a web operations engineer working on the service
So that all server instances start out with sensible security settings
I want to create a base image running the chosen operating system with hardened configuration
Public network interfaces
As a web operations engineer working on the service
So that the application only receives wanted traffic from the internet
And so that we don’t accidentally expose sensitive or insecure components of the system
I want to configure and test the public network interfaces for the system
Private network configuration
As a web operations engineer working on the service
So that individual internal components can only talk with known parts of the system
And so we limit the extent of any security breach
I want to configure and test the private network interfaces for the system
Network codes of connection
As a web operations engineer working on the service
Given I need to communicate with a system only available on a Government network
So that the two systems can talk with each other
I want to meet the code of connection requirements and configure access to the network
Management network
As a web operations engineer working on the service
So that network traffic used to manage the infrastructure is separate from public traffic
And so we can monitor irregularities in network traffic separately
I want to configure a separate management network
Platform load balancers
As a web operations engineer working on the service
So that we can reduce the number of single points of failure
And so that we can scale out to deal with a large amount of traffic
I want to provision load balancers to distribute traffic between multiple instances
Platform firewalls
As a web operations engineer working on the service
So that unwanted traffic can be filtered before it enters our virtual infrastructure
I want to configure the external facing IaaS firewalls to only allow certain traffic
Dynamic environments
As a web operations engineer working on the service
So that we are not constrained by a fixed number of environments
And so we can easy run full stack tests or experiments
I want to be able to easily provision an environment running the full service
Elastic scaling
As a web operations engineer working on the service
So that the service can automatically deal with unexpected increases in traffic
I want to configure tooling to automatically scale the number of instances based on load
Security controls
Operating system hardening
As a web operations engineer
So that we are making use of built-in operating system security controls
I want to automate a default set of hardening rules for our chosen operating system
Malware detection
As a web operations engineer
So that instances which may be compromised can be dealt with quickly
I want to automate the detection of potential malware
Intrusion detection
As a web operations engineer
So that instances which are being attacked or probed can defend themselves
I want to configure an intrusion detection and prevention system
Virus scanning
As a web operations engineer
So we can be sure that files in the system don’t have viruses
I want to install virus scanning for files passing a network boundary
Host firewalls
As a web operations engineer
So that the surface area for attack is limited
And so that services which should only be available locally aren’t exposed on the internet
I want to install and configure a local firewall
On instance event auditing
As a web operations engineer
So that I know when things like logins or other sensitive events happen on instances
I want to set-up some auditing of events
Rate/connection limiting
As a web operations engineer
So that large spikes in traffic from a single source don’t overwhelm application
I want to configure some level of rate and connection limiting for web requests
Secure storage of key material
As a web operations engineer
So that any highly sensitive cryptographic keys are not lost, resulting in a compromise
I want to have a mechanism in place to securely store key material
Third party DDoS protection
As a web operations engineer
So that a the site does not go down under a denial of service attack
I want to purchase and/or configure a level of DDoS protection
Testing
Performance testing
As the service manager responsible for the service
So that we know the service will be fast and responsive under realistic traffic
I want to be able to run a comprehensive performance test suite against the service
As a developer working on the service
So that we know changes to the code do not negatively affect performance
I want the performance test suite to run as part of the continuous integration system
Load testing
As the service manager responsible for the service
So that we know the service will still be working under larger amounts of traffic than are expected
I want to be able to run a comprehensive load test suite against the service
Application penetration testing
As the service manager responsible for the service
So that the service does not get compromised due to a vulnerability
And so we meet our accreditation obligations
I want to run a suitable number of penetration tests against the applications under development
As the service manager responsible for the service
So that the service does not get compromised due to a vulnerability
And so we meet our accreditation obligations
I want to run a suitable number of penetration tests against third party installed applications used as part of the service
Infrastructure penetration testing
As the service manager responsible for the service
So that the service does not get compromised due to a vulnerability
And so we meet our accreditation obligations
I want to run a suitable number of penetration tests against the infrastructure configuration
Operating system
Operation system selection
As a web operations engineer working on the service
So that we have a clear path to receiving security updates
And so we can more easily find support for our systems
I want to select and install a suitable default operating system for the service
File systems
As a web operations engineer working on the service
So that we get the best possible performance and reliability from the disk
I want to select a suitable file system and partition layout
Resource isolation
As a web operations engineer working on the service
So that noisy applications cannot affect other applications on the instance
I want to be able to isolate running applications from each other in terms of memory and CPU
Read-only file systems
As a web operations engineer working on the service
So that I can protect against files being changed due to compromises in the application
I want to be able to configure a read-only file system if appropriate.
Dec 31, 2016 · 8 minute read
One of the reasons I moved to Puppet two and a bit years ago was because
I was interested in the software industry. In particular I was
interested in being on the vendor side for a while. My background is
mainly as a service provider, software as a service, in-house
developer/ops type person. This has definitely been an interesting
experience, but I’ve not tried too much to explain why, until now.
First, what do we mean by vendor?
a person or company offering something for sale, especially a trader
in the street.
So in the context of a software vendor we specifically mean:
a person or company offering software for sale
Note that we’re selling the software, not access to some service
provided by software (ie. SaaS). SaaS and other as-a-service models are
growing part of the industry, but the business model, development cycle,
company structure and other aspects are quite different in my experience,
though lots of hybrid models exist too.
Economics and scale
One of the interesting aspects of the software vendor world is the
economics, the revenues, and the fact lots of companies are public. This
in turn means a large amount of VC money goes into trying to create
another large software vendor, because the potential payout is huge.
Take a sample set of companies from the last 10 years or so that are still
private: Docker, Puppet, Chef, MongoDB, Elastic, CoreOS, Mesosphere,
Weave, Cloudera, etc. Somewhat biased towards my own interests I’ll
admit.
Now take a sample of large, public, software vendors: Oracle, Microsoft,
CA, SAP, Sage, BMC, VMware. Not counting companies like Intel,
Cisco, IBM, Dell (no longer public), and HP with huge software portfolios.
Let’s pick on Sage, a UK software company selling accounting software.
As of 2014 Sage had 1169
people in software development R&D roles and they made $1.6billion from
software and related services in 2015. That’s probably about the (order
of magnitude) number of people employed in R&D roles in the above private
softare companies. The revenue is (and I’m guessing here) a bit higher
at Sage than those companies combined too. SAP is an order of magnitude larger,
both in terms of people (18908 in 2014) and revenues ($18billion
also in 2014). Oracle revenues were $38billion as another data point.
So all the cool (or not so cool) companies from the past 10 years or so
are a rounding error to the size of the industry. But you wouldn’t know
that from reading Hacker News or other parts of the internet. This
disconnect is a constant source of interest to me as I spend time with
Puppet customers and with the wider infrastructure community at
conferences and the like.
A world of difference
My gut feeling is that most people working as software developers,
designers, product managers, etc. don’t work for software vendors. Apart
from maybe in localised areas like Silicon Valley. But because of the
mentioned money and scale (and PR spend) of the big players a
great deal of press interest centers around vendors. Docker is probably the
best current example of this but it’s more general than one company.
This makes what happens in software-vendor-startup-land more visible to
everyone else than, say, IT reality in large financial companies.
At the heart of a good software company is a product being
built and maintained by a team of engineers, designers, managers, etc.
In many ways this is similar to lots of peoples experience of building
software (whether at work or at home as part of one open source project
or another). But the support surrounding this tends to vary greatly from
other areas. A dedicated marketing and product marketing team, dedicated
sales staff, a professional services function, training, documentation,
public relations personell are all required to turn the software into
revenue. And importantly these teams have to work closely together, and
be actively involved with the development of the product.
This is very different from an in-house development position, but it’s
also quite different from most SaaS operations. SaaS tends (generalising
here) to be based around large numbers of individual users with monthly
recurring revenues of 10s or 100s of US dollars. Software vendors
selling to large enterprises tends to be looking at single large deals
of 10s of thousands to many millions of dollars. This tends to mean
large differences in total number of customers, revenue per
customer, time needed to close a deal, requirement for staff local to a
customer, etc. All of that makes for a very different operation and
feedback cycle.
Some interesting observations
Software has a much longer shelf-life in the real-world than people
typically think on the internet. Take the datacenter automation
market. This IDC
report
for example pegs the market at $2.3billion in 2015. VMware takes the
lions share with roughly 30%, with BMC with 10%. For reference Puppet
has 3.2% and Chef 1.2%. Obviously this is just one report, and it’s now
a year old, but it’s an interesting data point. And compare that to what
you might expect if you just follow the software rather than the market.
Even in 2015 some people would have been saying “surely everything is
Docker and Kubernetes now?“. The reality is closer to it being all shell
scripts and BladeLogic for the majority of IT shops.
For the most part, innovators (and some early adopters) don’t buy software,
instead they build or co-opt it. Take Netflix, Uber, Amazon, Google,
Facebook or similar. All are well-known for building much of there core
software and infrastructure and using open source solutions for much of
the rest. And it’s not just software, all of the above also have large
internal investments in bespoke hardware as well. So who buys software
from software vendors? Taking the Rogers’ Innovation Adoption
Curve
it’s the early majority, the late majority and laggards. That’s
~85% of the market. Most of the noise on the internet about software is
from innovators and early adopters, or people who want to be in those
groups. But most of the software sold is to people with very different
wants and needs. This chasm explains much of the frustration experienced
with software, and the difficulty of building software for often very
different types of users at the same time.
Much of the writing about continuous delivery and continuous deployment
assumes you’re releasing a web site or at least a central, single,
service. At the very least this is most peoples experience and context.
But shipping software than people install and run themselves tends to
make software deployment a pull rather than a push. A vendor can release
a new version, but how to make the customer upgrade? Technically this
could be reasonably straightforward (Chrome auto-updates for example)
but for expensive, often critical, systems in sometimes regulated or
otherwise controlled or low trust environments, this turns out to be
trickier and more about people than just technology. This is an entire
topic on it’s own so I’ll leave it there for
now.
Continuous integration for packaged software (true for some, but not most,
projects outside software vendors) tends to hit a permutation explosion quite
quickly. Take server software because that’s what I’m most familiar
with. You’ll definitely support the latest version of RHEL, plus
probably a few older versions, and maybe Centos and some of the other
variants (Oracle Linux, Scientific Linux) as well. Ubuntu LTS releases
probably makes the list, as might Debian stable. You’ll also likely want
to test on at least Windows Server 2016 and 2012. You may have need to
keep going and support BSD, AIX, HP-UX, SUSE, etc. Puppet has an
unreasonably long list of supported and tested
platforms for instance.
Throw in other variations or configurations or architectures and you
have a serious CI environment. Compare this to a more typical case of a
deployment pipeline to a single known operating system and version on
a server you control.
Open source
One of the notable things about the lists above of older (public) and
newer (currently private) software companies is that all of the newer
ones are based around an open source software product or products. We’ve
had companies based around open source for a long time, but very few make
it to the public markets (where we get data to see if they actually work
as companies). A recent exception is Hortonworks (HDP) which
opened at $26.38 in December 2014 but is down to $8.31 as of this
writing, with revenues around $40million a
quarter.
Red Hat (RHT) did $2billion in
2016
(which remember is 5% of Oracles revenues, but still a large amount).
So undoutedly open source has had a large effect on the software
industry as a whole. But the impact on the public markets to date is
minimal in terms of new companies. It will be super interesting to see
if in 5 years time the list of public software companies based on open
source software is larger than it is today.
Conclusions
I mainly wrote this post so I had something to reference when I talk to
people about the software industry, and in particular what it’s like
working for a software vendor. Speculating about or second guessing one
vendor or another is an internet sport (non-more-so than for those that
work at other vendors) but from the outside it’s worth an appreciation
of some of the differences I think and a bit of empathy for the
decisions made. And if the above makes you think this all sounds rather
interesting then you’d be right.
Nov 23, 2016 · 5 minute read
Very few people today start using Linux by downloading the linux kernel
and starting from scratch. Most people start with a Linux distribution;
for instance Debian, Ubuntu or CentOS. These distributions provide some
opinions, some central infrastructure, a brand, strong versioning for
the entire ecosystem and a bunch of other things. I posit that we’ll see
the same pattern emerge with Kubernetes.
What even is Kubernetes?
I’ve seen Kubernetes described as all of the following:
- An operating system for your datacenter
- The distributed systems toolkit
- The Linux kernel for distributed systems
I think all of these descriptions point to the developers intent that
Kubernetes is something to build upon, rather than a simple out-of-the-box
experience. It’s predominantly about building agreement on the
primitives/APIs of distributed systems.
A name for a thing
I’ve not seen much discussion of this in general yet, I think because
it’s early days and many of the people looking at Kubernetes today are
either developers or early adopter types. These people have been
“downloading the kernel and starting from scratch”, even until recently
most likely running from source downloaded directly from GitHub. If the
Kubernetes ecosystem is to grow then that’s not how more mainstream IT
will adopt Kubernetes.
The reason for discussing this now is that I think a name is useful.
That way we can talk about Kubernetes (singular, the software) separate
from distrubutions of Kubernetes (many of them, from different vendors
and communities). I’d be happy to see a different name, but I think
distribution probably fits best.
Any evidence?
Absolutely. A range of software vendors are providing what I’m calling
Kubernetes distributions. Here is a sample, I’m sure there are and will
be more. I’m also sure over time some will disappear or maintain only a
niche audience.
- OpenShift from Red Hat
- Tectonic from CoreOS
- Kismatic from Apprenda
- Rancher
- Canonical Distribution of Kubernetes
- GKE from Google
- Azure Container Service from Microsoft
- Photon Platform from VMware
- Navops from Univa
Note that Canonical are already using the term distribution in the
name. I’ve seen it used in passing in CoreOS, OpenShift and Apprenda
press materials too.
What can we expect from Kubernetes distributions?
Running with the analogy that Kubernetes is “an operating system for
your datacenter” and that we’ll have a range of competing Kubernetes
distributions, what else can we expect over the next few years?
Package repositories (aka. app stores)
One of the things provided by the traditional Linux distributions has
been a central package repository. Most of the packages you’re
installing from apt
or yum
are coming from that currated set of
available packages. Not to mention community efforts like EPEL. We
already have two package concepts within the Kubernetes ecosystem -
container images (often from Docker Hub today, or from internal
repositories) and Charts, part of the Helm package management tool
(now a CNCF project).
In the short term expect the shared public Charts repository and Docker
Hub to dominate. But over time different vendors will launch there own
repositories. Partly this will be about building a trusted ecosystem,
partly about limiting permutations for support and testing, and partly
about control. The prize here is to be “the enterprise app store” and
no vendor in this space isn’t going to at least try to own that as part
of their platform.
Kubernetes standards and compliance
In an environment with many distributors of core software, it’s
common for people to emphasise portability. As vendors extend their
distribution (to provide higher level, but potentially proprietary
features) this can become muddier. Some level of certification is
often the answer. See CloudFoundry or OpenStack for recent examples.
Kubernetes is already part of the CNCF, part of the Linux Foundation.
I’d expect to see the works standards and certification eventually
float around, but my guess is not in the short term.
A fight over who is the most open
Much of the container conversation recently has centered around a
weaponisation of open. I think as the different distributions try and
take the community with them at the same time as trying to scale sales this
will continue. This will be an irritation and is probably best avoided.
Pressure for AWS to offer Kubernetes as a service
I would presume AWS has a very good idea of how many people are actually
using Kubernetes on it’s platform. I think as that grows, and as other
vendors efforts mature, they will come under pressure to offer the
Kubernetes API as a service. I’m still split on whether that will
actually happen but that’s a longer blog post about economics.
Differentiating features
Ultimately vendors will try and differentiate themselves in this new
market. To begin with the majority of business will be targetting the
container-curious and mainly talking up the benefits of containers and
Kubernetes. But some potentialy customers are going to insist on
comparing Kubernetes distributions and winning there is going to be about
clear differentiation. Do you want to be the budget offering or the
provider with the unique selling point?
Interesting questions
An observation at the moment is that all the current Kubernetes
distributions I’m aware of are vendor-owned. Whether Open Source or not,
they are driven by a single vendor (CoreOS, Red Hat, Apprenda, etc.)
It’s interesting to see whether, in the current climate, we see a
genuinely free and open source Kubernetes distribution emerge, similar
to the role Debian plays in the Linux distribution world.
Nov 12, 2016 · 2 minute read
The previous
post
went into why I think the days of the general purpose operating system
(for servers) are numbered. But one interesting area I didn’t comment
on (but did talk about in the talk of the same name) was Unikernels.
It’s all about cost
One of the topics I didn’t really touch on in discussing the end of the
generally purpose operating system was cost. Historically,
maintaining a general purpose operating system has been a costly
endeavour, something only the largest companies or communities could
sustain by themselves. Think Red Hat, Oracle, Microsoft, Sun, IBM,
Debian, etc. The result of that is the assumption when building software
that you should target one or more of a small number of operating systems.
In doing so you’re ceding some ground, and likely some revenue, to another
vendor. You’re also stuck with any underlying limitations of that OS as
well as its release cadence. And invariably you’re also stuck with the
multiplying support cost of supporting your software on multiple versions of
that OS over time.
I would posit that up until relatively recently the cost of that support
burden was hugely outweighted by the cost of maintaining an actual operating
system. But that’s now changing, as I outlined in the previous post. Now a
small or medium sized software company (be it CoreOS, Rancher, Docker,
Pivotal, etc.) can build and maintain it’s own operating system as well.
This is very much about the rising level of abstraction - all of the
above leverage the huge efforts that go into the Linux kernel and into
other projects like systemd (CoreOS) or Alpine (Docker’s Moby) for
instance.
Enter Unikernels
But where do Unikernels fit into this narrative? I’d argue that they
represent the fulfilment of this democratization. If building and
maintaining a traditional OS is only possible for the largest of
companies, and building and maintaining a more special-purpose OS (say
for running containers, or a storage device) is cost-effective for medium
sized softare companies, then Unikernels will allow anyone to build their
own single-purpose operating systems.
There are other technical reasons for (and against) Unikernels as an
approach but most focus on the technical. I think the economic side is
worth some consideration too. And not just the typical development and
support costs, but the ability to own the end-to-end unit of software
has lots of benefits, and Unikernels may make those benefits available
to everyone, including small organisations and individuals.
Nov 5, 2016 · 5 minute read
As interesting chat on Twitter today reminded me that not everyone is
probably aware that we’re seeing a concerted attempt to dislodge the
general purpose operating system from our servers.
I gave a talk about some of this nearly two years
ago
and I though a blog post looking at what I got right, what I got wrong
and what’s actually happening would be of interest to folks. The talk
was written only a few months after I joined Puppet. With a bunch
more time working for a software vendor there are some bits I missed in
my original discussion.
What do you mean by general purpose and by end?
First up, a bit of clarification. By general purpose OS I’m referring
to what most people use for server workloads today - be it RHEL or variants
like CentOS or Fedora, or Debian and derivatives like Ubuntu. We’ll
include Arch, the various BSD and opensolaris flavours and Windows too.
By end I don’t literally mean they go away or stop being useful. My
hypothosis is that, slowly to begin with then more quickly, they cease
to be the default we reach for when launching new services.
The hypervisor of containers
The first part of the talk included a discussion of what I’d referred to
as the hypervisor of containers, what today would more likely be
referred to as a CaaS, or containers as a service. I even speculated
that VMWare would have to ship something in this space (See vSphere Integrated
Containers and the work on Photon OS) and that counting out OpenShift
would be premature (OpenShift 3 shipped predominantly as a Kubernetes
distribution). I’ll come back to why this is a threat to your beloved
Debian servers shortly.
The race to PID1
For anyone who has run Docker you’ll likely have wrestled with the
question of where does the role of the host process supervisor (probably systemd)
start and the container process supervisor (the Docker engine) end? Do
you have to interact directly with both of them?
Now imagine if all of the software on your servers was run in containers.
Why do I need two process supervisors now with 100% overlap? The obvious
answer is you don’t, which is why the fight between Docker and systemd
is inevitable. Note that this isn’t specific to Docker either. In-scope
for cri-o is Container
process lifecycle management.
Containers as the unit of software
Hidden behind my hypothosis, which mainly went unsaid, was
that containers are becoming the unit of software. By which I mean
the software we build or buy will increasingly be distributed as
containers and run as containers. The container will carry with it
enough metadata for the runtime to determine what resources are
required to run it.
The number of simplying assumption that come from this shared contract should not
be underestimated. At least at the host level you’re likely to need lots
of near-identical hosts, all simply advertising their capabilities to
the container scheduler.
Operating system as implementation detail
What we’re witnessing in the market is the development of vertically integrated
stacks.
- Docker for Mac/Windows/AWS/Azure ships with it’s own operating
system, an Alpine Linux derivative nicknamed Moby, which is not intended for direct management by end users.
- Tectonic from CoreOS is a Kubernetes distribution which runs atop a
cluster of managed CoreOS hosts. Most of the operating system is
managed with frequent atomic rolling updates.
- OpenShift Enterprise from RedHat is another Kubernetes derivative,
this time running atop Atomic host.
- Pivotal CloudFoundry ships with the IaaS, host OS, kernel, file
system, container OS all tested
together
In all of these cases the operating system is an implementation detail
of the higher level software. It’s not intended to be directly managed,
or at least managed to the same degree as the general purpose OS you’re
running today.
This is how the end comes for the majority of your general purpose
operating system running servers. The machines running containers will
be running something more single purpose, and more and more of the
software you’re running will be running in containers.
The reason why you’ll do this, rather than compose everything yourself, is
compatability. Whether it’s kernel versions, file system drivers,
operating system variants or a hundred variations that make your OS
build different from mine. Building and testing software that runs
everywhere is a sisyphean task. Their is also the commercial angle at
play here, and the advantage of being able to support a single validated
product to everyone.
Implications
There are lots of implications to this move, and it’s going to be
interesting to see how it plays out with both early adopters and
enterprise customers alike.
- What does this mean for corporate operating system policies?
- How do standard agent-based monitoring systems work in a world of
closed vertical stacks?
- Will we see this pattern for other types of service in the AWS Marketplace,
where instance launched are inaccessible but automatically updating?
- How does such fast moving software work in environments with rigid
change control processes or audit requirements?
- Many large organisations will end up running more than one of these
types of system, how best to manage such heterogenous environments?
- Will we see push back from some parties? In particular the open source
community who may see this mainly serving the needs of vendors?
- Does the end of the general purpose OS lead to greater specialism
amongst systems administrators?
I’d love to chat about any of this with other folks who have given it
some thought. It’s interesting watching grand changes play out across
the industry and picking up on patterns that are likely obvious in
hindsight. And if you like this sort of thing let me know and I’ll try
and find time for more speculation.
Oct 7, 2016 · 3 minute read
Docker just shipped InfraKit a few days ago at LinuxCon and, while at the Docker Distributed Systems Summit, I wanted to see if I could get a hello world example up and running. The documentation is lacking at the moment, epecially around how to tie the different components like instances and flavors together.
The following example isn’t going to do anything particularly useful, but it’s hopefully simple enough to help anyone else trying to get started. I’m assuming you’ve checked out and built the binaries as described in the README.
First create a directory. We’re going to be using InfraKit to manage local files in that directory as part of the demo.
mkdir test
Now create an InfraKit configuration file. We’re going to use the file
instance plugin to manage files in out directory. This means everything works on the local machine, rather than trying to launch real infrastructure in AWS or similar. InfraKit also requires a flavor
plugin. I’m using vanilla
here just to meet the requirement for a flavor plugin, but it’s not going to actually do anything in this demo. It might be useful to write a noop flavor plugin or similar.
cat garethr.json
{
"ID": "garethr",
"Properties": {
"Instance" : {
"Plugin": "instance-file",
"Properties": {
}
},
"Flavor" : {
"Plugin": "flavor-vanilla",
"Properties": {
"Size": 1
}
}
}
}
InfraKit is based on running separate plugins. Each plugin runs as a separate process and provides a filesystem socket in /run/infrakit/plugins. First start up the file plugin:
$ ./infrakit/file --dir=./test
INFO[0000] Starting plugin
INFO[0000] Listening on: unix:///run/infrakit/plugins/instance-file.sock
INFO[0000] listener protocol= unix addr= /run/infrakit/plugins/instance-file.sock err= <nil>
Next, in a separate terminal run the vanilla plugin:
$ ./infrakit/vanilla
INFO[0000] Starting plugin
INFO[0000] Listening on: unix:///run/infrakit/plugins/flavor-vanilla.sock
INFO[0000] listener protocol= unix addr= /run/infrakit/plugins/flavor-vanilla.sock err= <nil>
An finally run the group plugin. I’m passing --log=5
to enable more verbose outout so it’s easier to see what’s going on with the group.
$ ./infrakit/group --log=5
INFO[0000] Starting discovery
DEBU[0000] Opening: /run/infrakit/plugins
DEBU[0000] Discovered plugin at unix:///run/infrakit/plugins/instance-file.sock
INFO[0000] Starting plugin
INFO[0000] Starting
INFO[0000] Listening on: unix:///run/infrakit/plugins/group.sock
INFO[0000] listener protocol= unix addr= /run/infrakit/plugins/group.sock err= <nil>
With that all setup we can create a group based on our configuration file from above.
$ ./infrakit/cli group --name group watch garethr.json
watching garethr
Have a look in the test directory. You should see a single file has been created.
$ ls test
instance-1475833380
Let’s delete that file and see what happens:
rm test/*
Hopefully InfraKit will spot the instance (a file in this case) no longer exists and recreate it. You should see something like the following in the logs:
INFO[0612] Created instance instance-1475833820 with tags map[infrakit.config_sha:B2MsacXz8V_ztsjAzu3tu3zivlw= infrakit.group:garethr]
This is obviously a less-than-useful example but hopefully provides a good hello world example for anyone trying to run InfraKit in it’s current early stage.
Jul 5, 2016 · 4 minute read
The Everyone is a Software Company meme has been around for a
number
of
years,
but it feels increasingly hard to get away from recently. That prompted
this post.
But what do we mean by Software Company?
To be software company you’re going to need to employee software
engineers and other professionals. Applying that logic to a large
number of companies at once, and looking at how existing
software companies are setup, we find a few large problems.
Google as an example
In my talk at Velocity, entitled The Two Sides of Google Infrastructure
for Everyone
Else
I argued both for and against the idea of wholesale adoption of
Google-like software and development/operations practices.
Even though they derive the lions share of revenue from advertising it’s
easy to argue that Google are a software company. But what does that look like?
What makes Google a software company?
From the Google Annual Report
2015
61,814 full-time employees: 23,336 in research and development,
19,082 in sales and marketing, 10,944 in operations, and 8,452
in general and administrative functions
So, roughly 50% of Google is involved in building or running software.
Glassdoor
says salaries for engineers at Google average about $126,000-$162,000.
The US Bureau of Labor Statistics says
that in 2014 the number of computer programming jobs in the US
was 1,114,000, with median pay in 2015 of $100,690 a year. The
total number of jobs in the US is about 143 million, with the
average wages at $44,569.20 according to the Social Security
Administration.
The Google Annual Report also states:
Competition for qualified personnel in our industry is intense,
particularly for software engineers, computer
scientists, and other technical staff
So, quick summary:
- Software engineers are expensive relative to others employees
- Demand for the best engineers means even higher wages
- Proportionally there aren’t many software developers
- There isn’t a large surplus of unemployed software engineers
Now the data above is mainly from US sources, although the Google data
is from an international company with offices around the world. My
experience says this is likely similar in Europe. Looking into data for
India and China would be super interesting I’d wager.
Problems
One obvious problem is short-term supply and demand. Everyone wants
experienced software folks for their transformation effort. But the more
organisations that buy into the everyone is a software company story
the greater the demand for a finite supply of people. For most
that means you’ll to able to find less people that you want because of
competition and afford even less people because all that competition
pushes up salaries.
I’ve seen that firsthand while working for the UK Government. People
occasionally complained that Government was hampering commercial
organisations growth by employing lots of developers and operations
people in London.
You’re also immediately in competition for software professionals with
existing software companies. Given the high salaries, most of
those employers already have developer friendly working environments and
established hiring practices suited to luring developers to work for
them. This sort of special case is hard for large companies without an
existing empowered developer organisation. I saw a lot of that at the
Government as well.
But the real macro problems are much more interesting. Even if you think
50% is a high mark for the ratio of software folk to others, you probably
agree you need a lot more than you have today. And those developers just
don’t exist today to allow everyone to be a software company. Nor
would I argue is education in the near term producing enough skilled
people to fill that gap tomorrow. So, what happens?
- Does everyone sort-of become a software company but not quite?
- Do most organisations struggle to hire and maintain a software team
and see the endeavour fail?
- Do increasing numbers of developers end up working for a small number
of larger and larger software companies?
- Does outsourcing bounceback, adapt and demonstrate innovation and
transformation qualities to go along with the scale?
- Countries like India or China are able to produce enough software engineers
at scale to allow there companies to act on everyone becoming a
software company?
- We see clear winners and losers, ie. companies which become software
companies and accelarate away from those that don’t?
Personally I think to take advantage of the idea behind the meme we’re
going to need order of magnitude more efficient approaches to software
delivery. What that looks like is the most interesting question of all.
Caveats
The above is not a detailed analysis, and undoutedly has a few holes. It
also doesn’t overly question the advantage of being a software
company, or really question what we actually mean by everyone. But I
think the central point holds: Everyone is NOT a software company, nor
will everyone be a software company any time soon, unless we come up
with a fundamentally better approach to service delivery.
Dec 27, 2015 · 4 minute read
I think one of the patterns of the last few years has been the
democratization of systems administration, especially for web
applications. Whether that’s Heroku or Docker, or Chef or Puppet, more
and more traditional developers are doing work that would have been
somebody else’s problem only a few years ago. But running in parallel
to that thread is another less positive trend, that of conflating
operations with just systems administation. The story seems to go that
now we know Ansible (or some other tool) we just need developers to run
the show.
In this post I’m going to try and introduce some of the other
operational disciplines, especially for developers who maybe have come
to operations via the above resurgence in infrastructure tooling over
the past few years.
Note that this post has a slight bias towards more normal
organisations. That is to say if you’re in a 5 person software startup
you probably don’t have operational problems to worry too much about
yet. I’m also not playing down the practice of systems administration,
most experienced sysadmins I know are also quite rounded operations pros
as well.
Service Management
If you’ve worked in operations, or in many large organisations you’ll have
come across the term Service Management. This tends to be linked to
various service management frameworks; like ITIL or MOF (Microsoft
Operations Framework). The framework will describe, often in great
detail, activities and processes for things like incident response,
configuration management, change management, capacity planning and more.
While I was at The Government I wrote what I
think is a reasonable introduction to Service
Management
albeit from a specific point-of-view. This was based on my experience of
trying, and likely sometimes failing, to encourage teams to think about
how the products they we’re working on would be run. Each of the topics
touched on in the overview is worthy of it’s own stack of books, but I
will repeat the ITIL service list here as (whatever you might think of
the framework or a specific implementation) I’d found it a useful starting
point for conversations - in particular stressing the breadth of
topics under service management.
Service Strategy
- IT service management
- Service portfolio management
- Financial management for IT services
- Demand management
- Business relationship management
Service Design
- Design coordination
- Service Catalogue management
- Service level management
- Availability management
- Capacity Management
- IT service continuity management
- Information security management system
- Supplier management
Service Transition
- Transition planning and support
- Change management
- Service asset and configuration management
- Release and deployment management
- Service validation and testing
- Change evaluation
- Knowledge management
Service Operation
- Event management
- Incident management
- Request fulfillment
- Problem management
- Identity management
- Continual Service Improvement
For each of the above points, whether you are using ITIL or not, it’s
useful to have a conversation. Some of these areas do provide ample
opportunity for automation and for using tooling to minimise the effort
required. But much of this is about designing how you are going to
operate a service throughout it’s lifetime.
Operations user stories
One of the other things I published while at The Government was a set of
user stories for a web operations
team.
These grew out of work on launching GOV.UK and have had input from
various past colleagues. In hindsight I’d probably do somethings
here differently, the stories assume a certain context which isn’t explicitly
spelled out for instance. But they have a couple of things going for them in that
they demonstrate how traditional operations activities can be planned out as part
of a more developer-friendly planning approach, and also they are public and
have been tested by more than a single team.
Not everything is a programming problem
The main point I think is that not everything can be turned into a
programming problem to solve. Automation has it’s place, and many manual
processes and practices can benefit from automation. But the wide range
of activities involved in running a non-trivial and often non-ideal
system in production tend to mean making trade-offs and prioritization
decisions frequently. This is where softer skills like arguing for
funding or additional head count, or building a business case for
further work, come into play. Operations management is much more than
systems administration.
Further reading
This is little more than a plea for people to think more about
operations, separate to the more technical aspects of systems
administration. If you’re interested in learning more however I would
recommend some good reading material:
- Visible Ops
Handbook -
still an excellent and pragmatic introduction to many of the topics
noted above.
- Designig Delivery -
a bang up-to-date tome covering a range of service design topics.
- Basic Service
Management -
a 50 page starter book covering the fundamentals of service
management as generally discussed in more detail elsewhere. A great
starting point.
Dec 4, 2015 · 3 minute read
I love DigitalOcean for quickly spinning
up machines. I also like managing my infrastructure using Puppet. Enter the
garethr-digitalocean module.
This currently provides a single Puppet type; droplet
.
Lets show a quick example of that, by launching two droplets, called
test-digitalocean and test-digitalocean-1.
droplet { ['test-digitalocean', 'test-digitalocean-1']:
ensure => present,
region => 'lon1',
size => '512mb',
image => 14169855,
}
With the above manifest saved as droplets.pp
we can run it with:
$ puppet apply --test droplets,pp
This will ensure those two droplets exist in that region, and have that
size. If they don’t exist it will launch droplets using the specified image.
This means we can run the same command again, and rather that create
more instances it will simply report that we currently have those
droplets already.
Querying resources
Puppet also comes with puppet resource
, a handy way of querying the
state of a given resource or type. Running the following will list all
of your droplets, whether you created them using Puppet or not.
$ puppet resource droplet
droplet { 'test-digitalocean':
ensure => 'present',
backups => 'false',
image => '14169855',
image_slug => 'ubuntu-15-10-x64',
ipv6 => 'true',
price_monthly => '10.0',
private_address => '10.131.98.186',
private_networking => 'true',
public_address => '178.62.25.100',
public_address_ipv6 => '2A03:B0C0:0001:00D0:0000:0000:0090:B001',
region => 'lon1',
size => '1gb',
}
Mutating resources
The type also supports mutating droplets, for instance changing the
size of a droplet if you change the model in Puppet. The API client
doesn’t support all possible changes, but you can disable backups, enable
IPv6 and switch on private networking as needed. Here’s a quick sample
of the output showing this in action.
Info: Loading facts
Notice: Compiled catalog for gareths-macbook.local in environment production in 0.43 seconds
Info: Applying configuration version '1449225401'
Info: Checking if droplet test-digitalocean exists
Info: Powering off droplet test-digitalocean
Info: Resizing droplet test-digitalocean
Info: Powering up droplet test-digitalocean
Notice: /Stage[main]/Main/Droplet[test-digitalocean]/size: size changed '1gb' to '512mb'
Error: Disabling IPv6 for test-digitalocean is not supported
Error: /Stage[main]/Main/Droplet[test-digitalocean]/ipv6: change from true to false failed: Disabling IPv6 for test-digitalocean is not supported
Error: Disabling private networking for test-digitalocean is not supported
Error: /Stage[main]/Main/Droplet[test-digitalocean]/private_networking: change from true to false failed: Disabling private networking for test-digitalocean is not supported
Info: Checking if droplet test-digitalocean-1 exists
Info: Created new droplet called test-digitalocean-1
Notice: /Stage[main]/Main/Droplet[test-digitalocean-1]/ensure: created
Info: Class[Main]: Unscheduling all events on Class[Main]
Notice: Applied catalog in 60.61 seconds
But why?
Describing your infrastructure at this level in code has several advantages:
- Having a shared model of your infrastructure in code allows for a discussion
around that model
- You can be convident in the model because of the idempotent nature of running
the code
- The use of code for this model allows for activities like code review, change
control based on pull requests, unit testing, user created abstrations and more
- The use of Puppet means you can use it as above as a command line interface, or
run it every period of time to enfore and report on the state of you infrastructure
- Puppet ecosystem tools like PuppetDB, Puppet Board or Puppet Enterprise mean you can
store data over time for later analysis
The module also acts as a reasonable example of a simple Puppet type and provider.
If you’re interested in extending Puppet for your own services this is hopefully a
good place to start understanding the API.
Sep 20, 2015 · 4 minute read
I was attending the first GOTO London conference last week, in particlar the Rugged Track. One of the topics of conversation that came up was unikernels, and their potential for improving the state of software security. Unikernels are pretty new outside research groups, I’m just lucky enough to live and work in Cambridge where some of that research is happening. The security advantages of unikernels are one of the things that attracted me in the first place. I thought it might be interesting to jot a few of those down for other people interested in security and the future of infrastructure.
As with my last post, it’s worth having a basic understand of Unikernels. I’d recommend reading Unikernels - the rise of the virtual library operating system.
Hypervisor
Every unikernel is provided the isolation guarantees from a hypervisor. Not only are these guarantees reasonably well understood, they tend to make use of hardware features too. It’s interesting to note that recent container runtime work is heading in this direction too, with ptojects like Clear Containers from Intel, Bonneville from VMware and the new stage1 in rkt.
No User Space
With a typical server OS we have kernel space and user space. Part of the idea here is to ensure the underlying machine doesn’t crash, whatever horrible things people do in user space. But this means you can do horrible things. The unikernel model is similar to the Erlang philosophy of let it crash. You only have kernel space, you entire application resides in it. Most things out of the ordinary are going to crash the kernel. This makes the sort of exploratory testing useful in exploit development harder.
Really Immutable Infrastructure
People often talk about immutable infrastructure. I’d wager there is more talk than reality however. When you push, people are often not using read-only file systems and retain the capability to login to machines to make ad-hoc changes. What they mean by immutable is that they only change machines at deploy time. This ignores both the fact they have the technical capability to change them anytime, and that an attacker could change them outside that deployment cycle. With unikernel systems there is often just the compiled kernel, you can’t just change files on disk. The defaults force an immutable way of working.
Clean Slate TLS
As a typical developer or operator you’ve probably learned more than you wanted to know about the OpenSSL source code. It’s not well understood and not likely to be so anytime soon and has some pretty spectacular bugs like Heartbleed. The Core Infrastructure Initiative is laudable and will improve things but it’s still a problematic codebase. Functional programming is often regarded as an easier way of writing understandable code. Types are a good thing, especially when it comes to security systems. So a pure OCaml TLS implementation as used by MirageOS makes sense on lots of levels. Yes this is quite an undertaking, but the bitcoin pinata tests show promise.
Knowing whether an application really does exactly what you want it to do (and no more) is a hard problem to solve. Unit tests and other form of automated testing help, but are still reliant on people to both write and design the tests. A formal proof system can provide much stronger guarentees of correctness, it’s an approach used in some cases for missing-critical components of Amazon’s AWS. MirageOS is implemented in OCaml. One of the most popular OCaml programmes is Coq, which just so happens to be a formal proof management system. I’ve not seen many examples yet of this approach, probably due to the effort involved, but the capability is there for building formally specified unikernels. I’d wager a similar thing is possible with Haskell and HalVM. Making that easier to do for typical developers could open up much more secure development practices for certain usecases.