A real world story of the technical and organisational challenges of a start-up becoming a public cloud provider using open source technologies. Where we went wrong, where we went right and where we’ve still got to go.
Intros
Dariush Marsh-Mossadeghi
Sean Handley
on from the subs bench, a little rough round the edges
This is our journey, sometimes it feels a bit like this
What we do/who we are
lots of talk and hype about moving your applications to the cloud focussed on
someone, has to build actual, real, nuts and bolts infrastructure
well funded startup, not a big telco
Vanilla IaaS, nothing that smells like vendor lock-in
What we’re here to talk about
not so much about OpenStack and Ceph
more about all the stuff that sits around it
What’s it going to do ?
OpenStack has lots of moving parts
A sysadmin can stand up OpenStack…
But we need to be confident we can make it scale
And we need to add
logging
billing
monitoring & alerting
but avoid the ‘embrace and extend’ strategies of some of the bigger players
What kind of organisation did we set out build ?
So we got some VC funding, no small feat in it’s own right
Let’s get on with it…
We’re going to do it right! Doesn't everyone start out that way ?
We’ve all worked in siloed, low trust, command and control bureaucracies, not good at many levels
also committed to building not just a successful company, but also one that is:
engaged with and supportive of the wider community
collaborates with its partners and customers
standing on the shoulders of open source giants
So we want
Not too much process
Short feedback loops
An iterative approach, but not just in development, but also in the non-technical aspects of the business
People
err… we need some people. A few good men (and women) should get us started, eh ?
can’t afford recruitment fees
Started trawling linkedin and exercising our black books
How do we compete with organisations with much deeper pockets than ours ?
keep coming back to the same 8-10 names and they all seem to be working for people much bigger than us
We need T shaped people, generalists with a speciality
(next slide)
People
But more than that, we need M shaped people, generalists with not just one, but 2 or 3 specialisms
but also Interns
fresh and optimistic, not just cheap labour, not cynical after 20 years in tech
this is about growing the skills in an emerging sector
Recurring themes of the past year
When I look back over the last year I can’t ignore recurring conversations
The tension/balance between
build it and they will come
being market led
being opportunity led
How to model it commercially
Customers with varying needs, coming from different places
Balance between operational and capital expense is quite specific in each case
resource pools vs instances
Making best use of available capital
debt is incurred in many ways, sometimes technical, sometimes fiscal
our first iteration of the platform was built on high quality 2nd hand hardware
Roadmaps
What services we can provide now vs later
expectation management
Now onto some of the technical choices we’ve made…
This is an actual photo of our first production platform, not a stock photo!!
Hardware
well, it’s x86_64
PXE & IPMI are our two basic criteria
A slight digression, you have to build good relationships with your suppliers, bring them along on your journey
don’t underestimate how time consuming this can be
Networks
definitely emerging… watch this space
handover to Sean
What are the technical challenges we face as a small engineering team?
Large number of machines to maintain
Compute, storage, controller nodes, database nodes, etc
Runs to the hundreds very quickly
We need to be able to control this herd effectively
We also need to be able to see what it’s up to
This leads neatly to 3 main goals:
Goal 1: Automate all the things
Nobody should have to do anything manually
All config should be write-once, apply ad infinitum
(new slide) Building and provisioning should be highly automated
Plug in a new server and go
Inevitably, some tasks need human interaction
Reduce friction as much as possible
Goal 1: Automate all the things
Nobody should have to do anything manually
All config should be write-once, apply ad infinitum
Building and provisioning should be highly automated
Plug in a new server and go
Inevitably, some tasks need human interaction
Reduce friction as much as possible
Goal 2: Keep it Transparent
Use tools that call you back with notifications.
(next slide) Never chase system status. In scalable system, status chases you.
Use tooling that makes it not only easy to share information with colleagues, but inevitable. Sharing should be the default behaviour.
Goal 2: Keep it Transparent
Use tools that call you back with notifications.
Never chase system status. ( In scalable system, status chases you.)
(next slide) Use tooling that makes it not only easy to share information with colleagues, but inevitable. Sharing should be the default behaviour.
Goal 2: Keep it Transparent
Use tools that call you back with notifications.
Never chase system status. ( In scalable system, status chases you.)
Use tooling that makes it not only easy to share information with colleagues, but inevitable. Sharing should be the default behaviour.
This way changes are less likely to cause surprise later on...
Goal 3: Stay out of technical debt
Try to never get into technical debt if you can
Inevitably, you will compromise somewhere:
Keep track of where you think your compromises are
Re-assess these technical debts every week and pay them down when you can
This needs to be fed upwards to management, so they’re aware of how important it is
This feeds into using iterative, collaborative ways of working
Onwards to tooling...
Here are some of the tools we’re currently using
I’m sure you’ve used some of these, or at least heard of them
The hardest thing to do well we’ve found is monitoring the platform, so let’s start by looking at that.
Logging, Monitoring and Alerting
So… how do you keep track of what your machines are doing?
How do you stop nasty surprises before they happen?
Often, failures happen partially for a long time before they explode fully
Central part of this: logging
We use logstash to make sense of logs and get them in a consistent format, which is nice
We also use Elastic search and Kibana to easily search through log data
To keep things simple we used default email alerting to begin with
Seemed reasonable at the time
Null mailer (horror) story
During development, we found if a serious problem happens it tends to affect many systems at once.
e.g. a network issue.
Every time logstash matched a log file message, it would email, leading to many duplicates.
Using null mailer, we naiively forwarded these alerts to the devops team
We use GMail to host our e-mail, and this throttles messages when too many come through at once
To clear this, need to have a quiet period, unfortunately…
If null mailer fails to send, it creates its own logging failure messages - so we had a perfect storm
(next slide) Must be a better way...
Riemann and Hipchat
We’re beginning to use Riemman to reduce the SNR
(next slide) Uses Clojure rules to intelligently handle a stream of log messages
Does nice things, like rolling up duplicates
We can then produce alert messages cleanly into our comms system, Hipchat - more on Hipchat later
Riemann and Hipchat
We’re beginning to use Riemann to reduce the SNR
(next slide) Uses Clojure rules to intelligently handle a stream of log messages
Does nice things, like rolling up duplicates
We can then produce alert messages cleanly into our comms system, Hipchat - more on Hipchat later
Next, Puppet
Getting a new node online - as automated as possible
Foreman + Puppet
Foreman is for node role classification
Discovers new nodes
Ensures a consistent BIOS, boots a shared image, installs OS
When node is built, it hands over to puppet for provisioning
Puppet
At the heart of what we do
We’ve written more puppet config than anything else
Does the provisioning via configuration manifests that are stored in git
Profiles and roles with puppet (LAMP example)
Puppet modules have been very useful in solving common problems
We often have to fork and modify these and push changes upstream
Git and Github
All our code goes into git
Makes it very, very easy to see changes
Github adds a great layer of visibility on top of this
Branching and pull requests (next slide)
Git and Github
All our code goes into git
Makes it very, very easy to see changes
Github adds a great layer of visibility on top of this
(next slide) Branching and pull requests
Hiera and Security
Separation of data from configuration logic with Hiera and puppet variables
Issues with sensitive data in Hiera
Encryption on the fly
We use a git hook on the client that encrypts the hiera data
When on Github, it’s in an encrypted state
JIRA and confluence
We use JIRA to manage and prioritise our backlog of tasks
and plan units of work
We use Confluence to host our internal company documentation
A little shared documentation goes a long way
These are paid, SaaS solutions.
The most important thing is that you can manage backlog of work and documentation, somehow
Hipchat
This is basically our mission control
More than just a chat program
Easily pluggable via web hooks to receive notifications from other systems
(next slide) Easy at a glance to see what the platform is doing, and what your colleagues are working on
Allows easier remote working also
So with this toolset, we can massively automate the lifecycle and management of our systems
And make it highly visible… back to Dariush
Hipchat
This is basically our mission control
More than just a chat program
Easily pluggable via web hooks to receive notifications from other systems
(next slide) Easy at a glance to see what the platform is doing, and what your colleagues are working on
Allows easier remote working also
So with this toolset, we can massively automate the lifecycle and management of our systems
And make it highly visible… back to Dariush
In closing…
Talked a lot about challenges
With a careful choice of open source tools, a small number of good people, you can stand on the shoulders of giants and deliver a relatively mature public cloud platform
There are many moving parts, you have to keep it as simple as possible
Hope it’s been useful
Q&A