More Related Content Similar to Migrating a build farm from on-prem to AWS (20) Migrating a build farm from on-prem to AWS1. © 2020 Nokia1
Migrating a build farm
from on-prem to AWS
Claes Buckwalter
04-02-2020
14:50–15:40 in B.2.009 at
CfgMgmtCamp Ghent2020
https://cfp.cfgmgmtcamp.be/2020/talk/EGJCKT/
Public
2. © 2020 Nokia2 Public
About me
• A Swede living in Antwerp, Belgium
• MSc degree in Media Techology from Linköping University, Sweden
• Identify as a software developer
• Have always had an interest in developer tooling, release engineering and automation
• Have attend all CfgMgmtCamps organized in Ghent since the start
• In my spare time I enjoy rock climbing and bouldering
• At Nokia I manage a globally distributed team that is responsible for the centralized build
infrastructure of Nokia Software, Nokia's software division
3. © 2020 Nokia3 Public
Our team
• We are 18 people
• We are a mix of software developers and systems administrators
• We are globally distributed with 2-3 team members at each site
• Chengdu, China
• Bangalore, India
• Kfar-Saba, Israel
• Tampere, Finland
• Bydgoszcz & Wroclaw, Poland
• Antwerp, Belgium
• Austin, TX
4. © 2020 Nokia4 Public
What our team does
• We run and support a collection of services branded Central CI
• Self-service Jenkins for CI pipelines that run on a Kubernetes cluster
• Self-service GitLab for version control and code review
• Artifactory for storing dependencies and produced release artifacts
• Our customers are Nokia Software's R&D and Services teams
5. © 2020 Nokia5
Related servicesCentral CI services
Public
Build cluster
Kubernetes-based
Artifactory
Artifact storage
Jenkins
CI pipelines
GitLab
Version control
Nokia Software Central CI
Gerrit
Security
scanning services
JIRA
Test infra:
PaaS
OpenStack
VMware
Bare metal
Static code
analysis services
6. © 2020 Nokia6 Public
Some Central CI stats
• 7500 active users
• 35k builds per day
• 7 Jenkins Masters
• Build cluster size
• 1200 vCPU
• 2.6 TB memory
• 72 TB scratch storage
• 150 TB of release artifacts
7. © 2020 Nokia7 Public© 2020 Nokia7
Motivation for
the migration
8. © 2020 Nokia8 Public
Before the migration
Our services were running on OpenStack, on two HPE C9000 blade
enclosures with a 3PAR storage array, in a Nokia datacenter
9. © 2020 Nokia9 Public
Limitations of our on-prem data center
• Each hardware rack was a separate OpenStack instance
• We had to do yearly migrations to new hardware racks because the OpenStack
distribution we were required to use did not support upgrades
• Long lead times to add new compute and storage
• Our compute needs fluctuate during the day and week
• The service-level objectives (SLOs) of the infrastructure were undefined
• We were an annoying snowflake customer in the data center
• We don't care about hardware; we just want APIs
• We were frustrated because we did not feel empowered
• Our customers were frustrated because we were slow to react to new capacity needs
10. © 2020 Nokia10 Public
The stars aligned
• There was a willingness from management to use more public cloud
• Nokia had recently integrated AWS with its corporate network
12. © 2020 Nokia12 Public
What we had to migrate
• Customer-facing services
• A fleet of 5 Jenkins Masters (grew to 7)
• A Kubernetes build cluster
• A large Artifactory instance
• Back-office services
• ELK
• Zabbix
• Prometheus
• Grafana
13. © 2020 Nokia13 Public
With as few interruptions as
possible for our customers
14. © 2020 Nokia14 Public
Migration timeline
• January — started a pilot on AWS
• March — decision made to migrate to AWS
• April — knowledge ramp-up and experiments
• May — started building our infrastructure on AWS
• June — started migrating R&D teams to Jenkins Masters on AWS
• August — finished migrating 99% of R&D teams
• September — migrated Artifactory
• October — migration "done"
15. © 2020 Nokia15
Public subnet
(Nokia intranet)
/27
Private subnet
/22
High-level architecture on AWS
Jenkins Masters
Artifactory
Load balancer
VPC
NAT Instance
Build pods
Public
Incoming requests
from Nokia
Outgoing requests
to Nokia
16. © 2020 Nokia16 Public
How we migrated
• Mostly a lift-and-shift
• We refactored our infrastructure provisioning to use Terraform instead of Ansible
• Provisioned hosts are still configured using Ansible
• We used AWS services when it made sense
• Relational Database Service (RDS)
• Elastic Kubernetes Services (EKS)
• S3 for Artifactory storage backend
• Application Load Balancer (ALB)
• Network Load Balancer (NLB)
• NAT Instances
• Elasticsearch
17. © 2020 Nokia17 Public
Constraints we had to deal with
• A centralized IT team manages all AWS accounts for the company and defines the rules
• Our AWS account's VPC is not accessible from the public Internet
• Services that are exposed on the public Internet cannot be used
• Elastic IPs cannot be used (see above)
• Access to the public Internet must go via a proxy on our corporate network
• Our AWS users are federated users managed in our corporate directory
• All users have the same access level
• AWS multi-factor authentication (MFA) cannot be used
• Our public subnet is limited to /26 (64 addresses)
• Split across two availability zones (AZs), so two /27 subnets (32 addresses)
• NAT Gateways between private and public subnets cannot be used; must use NAT Instances
• Route 53 cannot be used (for now)
• The AWS Direct Connect link to our corporate network is 10 Gbps
18. © 2020 Nokia18 Public
Design decisions we made
• Use three separate AWS accounts: Staging, Production, Backup
• Run our infrastructure in a single availability zone (AZ) within a single region
• Not all our services are easy to run in high-availability mode
• Data transfer costs between AZs
• Complexity tradeoff
• Currently not a business requirement to use multiple AZs or regions
• Manage all AWS resources using Terraform
• All changes are code reviewed
• All changes are tested in Staging first
• Standardize on a few instance types
• Purchase Reserved Instances (RIs) for standard instance types to save costs
20. © 2020 Nokia20 Public
Choose your own adventure (pick a number)
Problems we ran into
Outgoing network
transfer speeds
slowed down
periodically
1 2
Even AWS has lead
times for resource
creation
5
The importance of
Kubernetes pod
resource requests
6
It is possible to
overload the EKS
control plane
3
You are using more
IP addresses than
you think
4
Dynamically auto-
scaling a Kubernetes
cluster is non-trivial
7
Automation can be
dangerous
2
21. © 2020 Nokia21
1. Outgoing network transfer speeds
slowed down periodically
Public
22. © 2020 Nokia22 Public
1. Outgoing network transfer speeds slowed down periodically
• A few times a week our customers would report that data transfer speeds were slow
• At first we suspected the corporate network, the service the users were accessing, or the
Direct Connect link
• After a few days we realized that our NAT Instances were the bottleneck
23. © 2020 Nokia23 Public
1. Outgoing network transfer speeds sometimes slowed down
• Our NAT Instances were instance type c5.large
• After reading the documentation for c5 instance types more carefully, we found:
Instance sizes that use the ENA and are documented with network performance of "Up to 10 Gbps" or "Up to 25
Gbps" use a network I/O credit mechanism to allocate network bandwidth to instances based on average
bandwidth utilization. These instances accrue credits when their network bandwidth is below their baseline limits, and
can use these credits when they perform network data transfers.
• Fix: resize our NAT Instances to c5.9xlarge
• Lessons learned:
• Understand the SLOs of the services you
are using
• Look at the telemetry you are collecting
• Do your homework before you escalate
and waste other peoples' time
https://aws.amazon.com/ec2/instance-types/c5/
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/compute-optimized-instances.html#compute-network-performance
25. © 2020 Nokia25 Public
2. Automation can be dangerous
• During our Artifactory migration, we copied our 70 TB of Artifactory
data (files) to the root of an S3 bucket
• It turned out that Artifactory's S3 integration required that all objects
("files") be stored under a common prefix ("folder") in the S3 bucket
• We wrote a script to copy (S3 has no "move" operation) all objects to the
new common prefix "artifactory"
• We then wrote a script to delete the old files using prefix matching for
"0" to "f"
26. © 2020 Nokia26 Public
2. Automation can be dangerous
• The prefix "a" matched "artifactory" and deleted all our files
• Fix: transfer 70 TB of files again
• Lessons learned:
• Review all code, including one-off migration scripts
• We need to protect our S3 buckets and objects
• S3 MFA Delete is not supported for federated users; use IAM policies as a poor mans MFA
• Turn on Object Versioning in S3
• Replicate all important objects in the S3 bucket to a limited-access AWS account for data backups
28. © 2020 Nokia28
3. It is possible to overload the EKS control plane
Time (UTC) Event
14:40 We make a manual configuration change to our Jenkins Masters. We do not test the change in our staging environment first.
We do not know this yet, but the manual change caused Jenkins to write bad default values to all Kubernetes podTemplates.
All pods created from this point on have a misconfigured persistent volume claim (PVC) for the Jenkins workspace.
15:00 We notice that pods created by Jenkins Masters are not starting. The EKS control plane is accepting the pods but failing to
schedule them on an EKS worker node. The Jenkins Masters keep trying to create new pods and the EKS control plane keeps
trying to schedule them.
15:23 We identify a potential bug in the Jenkins Kubernetes Plugin that causes this behavior, confirm it in our staging environment,
and find a workaround.
15:53 We apply the workaround to production Jenkins Masters. Pods still do not start.
16:33 We create an AWS support case (Business support plan)
17:05 We get a call from AWS Support and investigate together
19:22 AWS Support escalates to the Service Team
19:23 We stop all but one Jenkins Master. Pods still do not start.
19:53 The Service Team observes that 5k pods are pending and begins scaling out the EKS control plane
20:00 Pods start scheduling again
20:42 Our services are fully recovered. Time to recovery: 6h02m
Public
29. © 2020 Nokia29 Public
3. It is possible to overload the EKS control plane
• Lessons learned:
• Always test in Staging first
• Always roll out changes gradually
• Monitor the EKS control plane so that you can detect when it is overloaded
• https://docs.datadoghq.com/integrations/kube_apiserver_metrics/
• EKS does not appear to auto-scale your EKS control plane; you may need to ask AWS Support to do
this if you think it is needed. Is this documented somewhere?
• Our Terraform code is not designed to allow adding a new EKS instance while keeping the existing
one
31. © 2020 Nokia31 Public
4. You are using more IP addresses than you think
• Elastic Load Balancing load balancers may refuse to be created if there are not enough
free IP addresses in the subnet
• It is recommended to have at least 8 free IP addresses
• https://aws.amazon.com/premiumsupport/knowledge-center/subnet-insufficient-ips/
• Each EKS pod gets its own Elastic Network Interface (ENI), which means it uses an IP
address from the subnet the pod is connected to
• EKS maintains a "warm pool" of IP addresses for each worker node so that they can
quickly be assigned to new pods
• https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html
32. © 2020 Nokia32 Public
4. You are using more IP addresses than you think
• Lessons learned:
• Carefully plan the size of your VPC's public and private subnets
• Expanding a subnet is not possible if the range is already in use
• Our public subnets have only /27 (32 addresses) so we run as little as possible there
• Because of the continuous growth in the number of concurrent builds (=pods) our Jenkins Masters
run, we are going to have to resize our private subnets within a few months
• Current private subnets are /22 (1024 addresses)
• We have a /21 range (2048 addresses) free, but we cannot easily expand the existing subnets
34. © 2020 Nokia34 Public
5. Even AWS has lead times for resource creation
• AWS accounts have default limits for each region
• We have had to raise our limits for
• Number of running instances
• Number of Security Groups per Interface
• Rules per VPC Security Group
• Lessons learned:
• Check your account's default limits https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-
resource-limits.html
• Plan ahead if you are going to need a lot more resources than you are currently using
• It can take 1-4 hours for a limit increase request to be approved
• It can take up to 30 minutes for an approved increase to become active
• When creating large instances (e.g. c5.18xlarge) you may have to wait hours for capacity to
become available
36. © 2020 Nokia36 Public
6. The importance of Kubernetes pod resource requests
• We initially allowed Jenkins to created build pods that did not have resource requests for
CPU and memory
37. © 2020 Nokia37 Public
6. The importance of Kubernetes pod resource requests
• If a pod does not have resoure requests, the Kubernetes scheduler has to guess which
worker node the pod is going to fit on
• This results in randomly failing pods
• You will get failed Jenkins builds with ChannelClosedException stacktraces in the console output
• Lessons learned:
• To minimize the likelyhod of out-of-memory killed and severly CPU throttled pods, all pods must
have resource requests for CPU and memory
• It is a tradeoff between stability and maximizing resource utilization
38. © 2020 Nokia38 Public
6. The importance of Kubernetes pod resource requests
39. © 2020 Nokia39
7. Dynamically auto-scaling a
Kubernetes cluster is non-trivial
Public
40. © 2020 Nokia40 Public
7. Dynamically auto-scaling a Kubernetes cluster is non-trivial
• Customers complain if their builds to not schedule in a timely fashion
• We do not want to pay for a larger Kubernetes cluster than we need
• We want the Kubernetes cluster size to scale dynamically based on build demand
• Some builds wait idling for hours while tests are executed in a remote lab
• Some builds are triggered by a timer
41. © 2020 Nokia41 Public
7. Dynamically auto-scaling a Kubernetes cluster is non-trivial
42. © 2020 Nokia42 Public
7. Dynamically auto-scaling a Kubernetes cluster is non-trivial
• Lessons learned:
• Cluster size is a tradeoff between cost and happy customers
• It is difficult to pick the correct worker node to cordon
• Customers do not care if their timer-triggered builds are queued
• Long-running builds that are idle are very expensive
• Customers need an incentive to optimize their builds
44. © 2020 Nokia44 Public
Conclusions
• Our team is a lot happier now
• Our customers are a lot happier now
• Key lessons learned:
• Undestand how the services you use work and what their service-level objectives (SLOs) are
• Always do code reviews
• Take measures to protect your data
• Monitor the AWS services you use
• Plan your network subnet design
• Plan ahead when provisioning a large amount of resources
• Kubernetes pod resource requests are critical for stability
• You must constantly monitor and optimize cost
46. © 2020 Nokia46 Public
Copyright and confidentiality
The contents of this document are proprietary and
confidential property of Nokia. This document is
provided subject to confidentiality obligations of the
applicable agreement(s).
This document is intended for use of Nokia’s
customers and collaborators only for the purpose
for which this document is submitted by Nokia. No
part of this document may be reproduced or made
available to the public or to any third party in any
form or means without the prior written permission
of Nokia. This document is to be used by properly
trained professional personnel. Any use of the
contents in this document is limited strictly to the
use(s) specifically created in the applicable
agreement(s) under which the document is
submitted. The user of this document may
voluntarily provide suggestions, comments or other
feedback to Nokia in respect of the contents of this
document ("Feedback").
Such Feedback may be used in Nokia products and
related specifications or other documentation.
Accordingly, if the user of this document gives Nokia
Feedback on the contents of this document, Nokia
may freely use, disclose, reproduce, license,
distribute and otherwise commercialize the
feedback in any Nokia product, technology, service,
specification or other documentation.
Nokia operates a policy of ongoing development.
Nokia reserves the right to make changes and
improvements to any of the products and/or
services described in this document or withdraw this
document at any time without prior notice.
The contents of this document are provided "as is".
Except as required by applicable law, no warranties
of any kind, either express or implied, including, but
not limited to, the implied warranties of
merchantability and fitness for a particular purpose,
are made in relation to the accuracy, reliability or
contents of this document. NOKIA SHALL NOT BE
RESPONSIBLE IN ANY EVENT FOR ERRORS IN THIS
DOCUMENT or for any loss of data or income or any
special, incidental, consequential, indirect or direct
damages howsoever caused, that might arise from
the use of this document or any contents of this
document.
This document and the product(s) it describes
are protected by copyright according to the
applicable laws.
Nokia is a registered trademark of Nokia
Corporation. Other product and company names
mentioned herein may be trademarks or trade
names of their respective owners.