Jesper Terkelsen presenting what happened regarding infrastructure, developer productivity, security, site reliability engineering, and scale - in the last 4 years in tradeshift.
2. | Confidential
DevOps at tradeshift
A small history on how stuff evolved the last few years (2015-2019)
Jesper Terkelsen, VP Global Platform Operations
jnt@tradeshift.com
4. | Confidential4
= Office locations
= Supplier density
Delivering an end-to-end supply-chain buying
experience in the cloud everywhere in the world
Covering five continents and more than 190
countries.
Tradeshift is the global business
commerce company
5. | Confidential
Tradeshift by the numbers
5
Real network of
1,500,000
companies
connected on
platform $500B in
transaction
volume500+ Enterprise
P2P Customers all
migrating to
next gen cloud
platform
30M SKU’s on
Tradeshift
Marketplace and
procurement
solution
1000 Employees
42 Nationalities
Global
Presence
US, Europe,
Apac
7. | Confidential
What is DevOps in Tradeshift?
7
• Delivering global, always-on SAAS to Enterprise customers that are used
to on-prem or managed hosting.
• With a rapidly expanding feature footprint
• In a rapidly growing organization
• To a rapidly growing user base
• With a rapidly growing usage
• While enabling 3rd parties to build on top of the platform
• … and not grow cost at the same rate ...
9. | Confidential
2015 and earlier
9
• Organization growth was mostly in engineering not in operations.
• Lots of focus on features in the product, scale less of a concern
• About 70 engineers
• Small DevOps team
• 3 FTE in Copenhagen
• 1 FTE in San francisco
• Partially automated infrastructure
San Francisco
DevOps
Copenhagen
DevOps
10. | Confidential
Challenges in late 2015
• Lots of new engineering teams
wants
• a new microservice to
production today.
• a new version of a component
out today.
• Demo environments were a
sparse resource
(answer was we will work for 4 month on this
release)
10
Engineering Growth Other
• We needed to get a lot of new
certifications to be able to
operate in the market.
• Enable engineering to operate
independently from operations
11. | Confidential
Challenges in late 2015
• We expected to grow
engineering with 200%
headcount
• To around 300 people
• We were about to expand into
China.
11
Forecast was amazing
• We expected to grow the # of
instances with 2-3 orders of
magnitude
• Clustering for availability 2-3 x
• Introducing new services 3-5 x.
• Hosting in more data centers. 4-5 x
• Scaling up storage 5-10 x
• We did not plan to grow
operations headcount with the
same amount
12. | Confidential
We wanted to
• Migrate all services to docker. about 15 at the time.
• Introduce service discovery and clustering.
• Introduce zero downtime deployments.
• Migrate all hosts to VPCs.
• Rewrite all of the automation code.
• Introduce tests for all infrastructure automation code.
• Upgrade OS versions.
• Improve monitoring.
• While the system was running - since it's a cloud service.
12
13. | Confidential
First let's agree on some values
We strived towards the following values
● Never disrupt service.
● Speed matters.
● Self-service and horizontal ownership over “throw it over the wall”.
● Homogeneous operations over heterogeneous / many unique solutions.
● Code as documentation.
● Testability over one-off scripts.
● Everything is peer-reviewed.
● Clear ownership.
This made our design choices easier to argue about
13
14. | Confidential
1. Foundation
We started by building the foundation - base infrastructure for automation.
• Own internal certificate authority (CA)
• End2End encryption
• LDAP for authentication
• Puppet servers
• Private code registries
• Terraform - infrastructure templating
• And a current Ubuntu base role
14
15. | Confidential
2. Design a global network
• Build a internal CIDR ip allocation
scheme for all possible future
data centers.
• Support for
• Site to site network VPNs
• Public encrypted channels based
on TLS with mutual
authentication
• Use both private and public
subnets
15
16. | Confidential
3. Migrate databases
• SQL and NoSQL databases
• Use AWS Classic links for network connectivity
• For SQL databases we had to upgrade the version first.
• Then did a live streaming replication to a new slave. And then promoted that
as master.
• 10 Postgresql databases where the largest one was about 1 TB
• For NoSql databases we - migrated one node at the time
• 300TB Elasticsearch cluster
• 1.5 PB Riak cluster
16
17. | Confidential
4. Migrate services
• We added service discovery and routing
• Better leader election
• Put everything in docker
• Services were migrated.
• Clusterable services was migrated during uptime.
• Non clustered services was migrated during maintenance window in the
weekend.
• Load balancers was migrated
• We had to announce IP changes to customers, because AWS does not run the
same subnets for classic and VPC
17
18. | Confidential
5. Replace build pipeline
• Use same infrastructure automation as for production
• QA methodology in Tradeshift relies heavily on automated tests - and
every team has to own their own tests - no “throwing over the wall”
• Runs our 700+ UI end to end flows
• Introduce consumer driven tests
• Push more tests upstream
18
19. | Confidential
6. Replace demo sandboxes
• Make sandboxes and demo stacks - on demand.
• Tooling for sizing, scope, clustering, public/private
• No more fighting over who can use an environment
• Only runs as long as teams need them
• Hours -> Months
• Automate data creation
• Useful for demo storyboards
• Useful for performance tests.
• Useful for automated tests
• Promotes data generation
19
20. | Confidential
7. Optimize cost a bit
20
• Roughly 20 environments are created/destroyed daily
• Spot instances are used for both temp stacks, and build slaves + data
processing slaves.
21. | Confidential
8. Container adoption
21
• “Puppet and Terraform is fine, but i need my new docker image deployed
in hours not days, since i only used a few hours scaffolding it up +
coding the microservice” - ML team member
• In 2018 we rolled out Kubernetes in all test environments and production
currently running 30% of our services as flexible containers
• This can be challenging for our values
• homogeneous systems? Managed services mixed with K8
• PCI compliance?
• Very good for infra as code with helm
23. | Confidential
Why do we automate?
23
Infrastructure as code
• DevOps work should be like development work
• “Test driven operations”
• Treat infrastructure code as regular code
• Write tests for the puppet code and configuration.
• Have a code reviews within the team.
Benefits
● Recover faster from incidents.
● Fewer people can manage more servers. (5 people 5000+ servers)
● Less human error
● More transparency into what is on the servers.
24. | Confidential
Ability to scale fast
24
● We grew the number of versioned services from ~15 to ~120
● We increased our rate of deployment to prod
a. From once a week to 15 times a day
b. This distributes risk and shrinks potential blast radius
● We now have 6000 unit tests for puppet written in Ruby, which is about
60% code coverage.
a. This allows us to change the puppet code faster and with way more
confidence.
● Number of virtual machines is now above 5000+ across all environments,
and varies a lot from day to day.
● With containers accelerated the commit -> production delay (for new
services) even further
25. | Confidential
Self service engineering
• All engineering teams in tradeshift can write automation code
and test that on our AWS test account.
• Introducing a new service in prod is only a code review
exercise from operations
• Releasing new versions in production is fully automated
• The productivity teams maintains the tools
• Engineers are granted access to logs and error collection tools
which allows teams to always show metrics near their desks
25
26. | Confidential
Automated Security
26
“We don't really need an army of people patching servers, or following
human compliance processes, if we can automate the whole thing”
We can then focus human time on improving actual security
27. | Confidential
AWS China rollout
• 1 Full production
environment
• 5125 new lines of terraform
template code
• 1097 new lines of hiera
configuration
• In 120 pull requests
27
We provisioned Using
• A team of 5 people
• In 7 working days
• 40k lines of existing puppet
code
• Similar amount of existing
terraform code
36. | Confidential
Current tech scale
36
~5000
VM’s on AWS
(daily average,
across all envs) ~15
Releases/D
ay~120 Services
45 running
in K8
(1-4 added per
month)
2-3m daily business
transactions
58
Developer
Teams
Hosting in
US, Europe,
China
37. | Confidential
Operations/Productivity Teams 2019
37
SREToolchain
Developer Productivity Site Reliability Engineering
Platform Infrastructure
Stacks
SRE
Compute
Containers
Storage
Roots Data
Data
BackendApp Frameworks
Dev Support
TBH
SRE - Århus
Compute China
38. | Confidential
The future
• Even larger engineering organization
• We are currently 350 in engineering - 58 developer teams
• We expect to grow engineering 100%-150% in 2019 and more in 2020
• Even more automation
• Immutable infrastructure
• In product testing: canary deploys, red-green deploys, improved A/B and feature testing.
• Even more often deployments
• We do roughly 10-15 deploys a day today, we want this to grow 10x
• More security certifications
• 20x more microservices
We are looking to consumer global scale SAAS for inspiration (Google, Uber,
LinkedIn, Facebook, Twitter, etc) - for processes as well as technology.
38