This document discusses using a canary microservice to validate the software delivery pipeline at DigitalGlobe. It describes how DigitalGlobe uses a canary microservice that is deployed through the entire pipeline to automatically test that the pipeline is functioning properly. The document also provides statistics on the availability and success of DigitalGlobe's continuous delivery pipeline and canary deployments that help them monitor and improve the pipeline over time.
8. 8
Impactful Situational Analysis
Top image shows two slave labor
fishing boats tied to Silver Sea 2, a
roughly 2,300-ton refrigerated
cargo ship, with its cargo hold
open to receive the slave-caught
seafood. Bottom image shows the
analysis of the same photo.
http://eplore.digitalglobe.com/see-freedom
Combating Human
Trafficking & Slavery
10. 10
• Many mature processes and tools
already exist
• Talented Engineers & Developers
• Engineers are allowed to pick the best
solutions or tool for the job
• Executive Management support
• WV-4 Launch
11. 11
• Multi-geographical development
locations
• Over 70 Agile Teams
• Separate release streams
• Complex Missions Control Systems
• Over 300 Applications
• Disparate environments make it hard to
test
12. 12
• Monolithic systems, manually
maintained
• Multi-module builds with cross
dependencies
• Long release cycles
• Long deploy outages
• Silo teams – knowledge gaps
13. Why DevOps?
• Customer demand for quicker
enhancements and fixes
• Reduce cost by changing
architecture to Microservices
• Easier to add new functionality
(low Impact)
• Standardize the platform
• Better release automation (XL
Release)
13
14. Pipeline as a Service
14
• The pipeline infrastructure built and
maintained with IaC
• Support hybrid cloud infrastructure
• AWS + Cloud Foundry + Openstack
• Have a Pipeline for the Pipeline
• Provide self service onboarding –
enable developers
The Pipeline should be Fast, Secure, Reliable & Available!
18. 18
XL Release
• Orchestration layer
• Hides the complexity
• Release templates are flexible
• Release overview
• Good reporting
Jenkins
• Works well for DIY build automation
• Difficult to manage jobs & config
• Difficult to navigate folders and jobs
• Lots of plugins to manage
Why XL Release?
19. Pipeline Tech Stack
19
Dashboard
Dev Build Integrate Test Release Deploy Operate
Release Orchestration
Operations
Dashboard
Dev / Test dashboards
Infrastructure
20. XL Release - Orchestration Layer
• Delivers customer facing applications to
production
• “Fed-Ex – We deliver!”
• Multiple customers with unique needs
• Workflow for our IT processes
• Refreshing pipeline infrastructure “Get Well
- Stay Well”
• Get the workflow right, then automate it
20
21. How do we know it’s working?
• ELK Stack Dashboards
• Requires constant monitoring &
alerting
• User support via phone, email,
chat, tickets
• Also requires monitoring & alerting
• Canary Microservice
• Automatically runs and alerts on
failures
21
22. Let your Canary Sing!
• Microservice that touches entire tech stack
• Canary Release Validates
• Pipeline Release Template (workflow)
• Tool to tool communications
• Operational Platform
• Production instance triggers a new release
restarting the workflow
22
24. Canary Enhancements
• Additional Programming language support
• Better integration with issue tracking &
notifications systems
• More trend analysis
• Support new tools and platforms
• Negative testing
24
26. Pipeline Availability Report
26
98.990%
96.629%
99.983% 99.933%
99.167%
95.00%
96.00%
97.00%
98.00%
99.00%
100.00%
Dec '16 Jan '17 Feb '17 Mar '17 Apr '16
% Successful
• CI/CD Pipeline Availability – April 2017
• Degradation
• Unplanned : None
• Outage
• Unplanned : ~6 hrs – Artifactory crash : Artifactory stopped at
midnight due to disk space issues. Customer impact was ~20 min
(first job was at 6am) (would make numbers 99.954%)
27. Canary Availability Reports
27
• Canary testing
• We lost a number of
canaries during the
artifactory disk issue which
caused a race condition in
resubmission of new
canaries
• Manual Processes
• % of time waiting for
somebody to push a
button (Prod Gate) to the
total time for a release to
reach production
76%
79%
85%
89%
97% 96%
50%
75%
100%
Nov 16 Dec 16 Jan 17 Feb 17 Mar 17 Apr 17
% of successful Canaries per month
94.8% 94.9%
96.6%
95.3%
93.2%
91.8%
88.0%
90.0%
92.0%
94.0%
96.0%
98.0%
Nov '16 Dec '16 Jan '17 Feb '17 Mar '17 Apr '17
% of time releases wait at manual gates
Manual
28. Pipeline Volume in XL Release
28
• Average Release
• Duration: How long does a
single release take to get
through the Pipeline?
• Automation Percentage:
Percentage of automated
tasks in completed releases
during the selected time
period.
• Releases per month
• Number of releases
completed per month.
29. 29
Future: Service Maturity Dashboard
Probability
ImpactLow High
High
Medium High Critical
Low Medium High
Low Low Medium
Threat Level
MissionControlOperators
ControlofSatellites
ProductOrdering
ProductProduction
BareMetalService
Interdependencies
P800DependentInfrastructure
FeatureToggles
HA
RiskScore
Service 1 0 0 0 0 0 0 0 0 0 0.0
Service 2 2 3 1 2 3 2 3 0 5 3.5
Service 3 4 5 3 4 5 4 5 5 0 5.8
Service 4 6 7 5 6 7 6 7 0 5 8.2
Service 5 8 9 7 8 9 8 9 0 0 9.7
ProbabilityImpact Mitigation
The Pipeline will gather statistics to drive a Release Score
30. Final Notes
Do - Pipeline as a Service
• Iterative development process
• IaC
• MVP
• Pipeline for the Pipeline
• Self Service Onboarding
Don’t
• Abandon DevOps principals
• Over Design - “KISS”
• Manual Tasks & Gates
30
Image of Narita Airport, Japan.
30cm Image
Editor's Notes
We build Launch and fly the world’s most sophisticated commercial satellite constellation
Together, WorldView-1, GeoEye-1, WorldView-2, WorldView-3 and WorldView-4 are capable of collecting well over one billion square kilometers of quality imagery per year and offering intraday revisits around the globe.
Our advanced accuracy technology ensures content is as closely aligned as possible to a known lat/long coordinate on the surface of the earth.
WorldView-4 joins WorldView-3 as the only commercial satellites that collect the world’s highest-resolution, 30 cm commercial satellite imagery
We also provide imagery analysis, in this case supporting humanitarian efforts to stop human trafficking, slave labor and illegal fishing. Images captured over Papua New Guinea at 17,000 mph, 380 miles up. Analysts spotted ship, matching the a suspected slave ship with open cargo holds, apparently offloading fish.
With the launch of our WorldView-4 satellite, we have the opportunity to update out software and technical stack to meet future demands. When starting a new initiative like bringing DevOps to DigitalGlobe, we need to understand all the facts. I like to know 3 things….
The Good. All the stuff that works well and we can build-on. There are some things in place that are working well. Our current production operations environment is sustainable. We have a talented staff. Engineers are typically allowed to pick the best tool for the job. Finally, we have a satellite launch that we can use to pay for new development.
The Bad – what makes it difficult. Over 1300 employees world wide, with developers in US, Costa Rica, Italy and India. Over 70 Agile Teams, all working at a different cadence. Separate release streams for each business unit. A very complex Missions Control Systems, and hundreds of applications. Also, disparate environments make it hard to do true testing early on.
The Ugly…Monolithic Systems & Multi-module builds. Release cycles taking 3-6 months to get enhancements into production. Long production deploy process takes 12 hours and involves 15 people. Making deploy expensive and difficult to schedule with customers. Finally, separate teams doing code promotion and integration created knowledge gaps and made deployments difficult.
Why did we choose DevOps… Customers were demanding faster turnaround on fixes and enhancements to the system. Implementing microservices a architecture helps reduce the cost of making changes to our software. Smaller change sets. Shorter, more iterative release cycle. A DevOps methodology would make it easier to add new satellites and products without affecting the current functionality. Using Infrastructure as Code would allow us to standardize our platform and environments. With XL Release we can target %100 release automation.
We came up with the idea of a CI/CD pipeline as a service with the following requirements. The pipeline infrastructure built and maintained with IaC or Infrastructure as Code. It will be cloud agnostic and support deployment to a hybrid cloud infrastructure including AWS, Cloud Foundry and Openstack. For ease of development, we will have a pipeline for the pipeline. To enable developers and reduce manual process we will self service onboarding of microservice and dependent libraries. This would also save developers from spending lots time maintaining development environments.
DigitalGlobe operationally spans internal and external cloud infrastructure. With the goal of moving our computing to AWS, DG and AWS Pioneered large data transfer using the “SnowMobile” technology. A 45 ft semi-trailer truck that carries 100PB of data in one-shot to any AWS data-center.
The pipeline for the pipeline is a way for us to use our own DevOps process to maintain the CI/CD pipeline. Enhancements and fixes are made to the dev pipeline where they are validated and promoted to the production pipeline. We use a XL Release template to orchestrate this, just as we would a microservice in the pipeline.
To allow developers to self service with the pipeline, we have created XL Release onboarding templates. This allows a developer to enter a few fields of identifying information into the template and then running that template create webhooks, Hashicorp Vault secrets and release triggers needed to automatically start a release in the pipeline when GitHub is updated.
Jenkins can manage your CI/CD Pipeline, but there are some things that Jenkins does well and some that it doesn’t
Jenkins works well as a Do It Yourself build automation tool.
Jobs and Configuration can get complex and difficult to mange.
Navigation of folders and jobs can also be difficult for all but engineers.
Requires lots of plugins to configure and maintain to get good release reporting.
XL Release as an orchestration layer, allows you to use Jenkins or XL Deploy or other tools to complete releases.
It hides the complexity of multiple Jenkins servers, folders, jobs, parameters, etc…
Release templates and the releases they create are flexible and customizable even when a release is in progress
The XL Release interface is simple to understand and navigate to get status of everything in progress
Good reporting of the status releases in progress as well as historical data.
This is our DevOps CI/CD Pipeline technology stack. *Call out Infrastructure. *Call out some the technology used. XL Release is our Orchestration layer, performing release automation for microservices, libraries and IT Infrastructure processes. At the top of the stack we have monitoring and reporting dashboards, for which we use XL Release to display release metrics and KPIs.
Release Orchestration with XL Release is used to delivers customer facing solutions to production. On the pipeline team, we like to say the old Fed-Ex mantra, “We deliver!” Meaning, if you commit your code to Github, we deliver it to production. XL Release through release templates and plugins, give us the flexibility to support multiple agile teams writing code in many software languages and development tools. In addition we use XL Release templates as a workfklow for our IT processes, server upgrades, etc. Supporting our get well – stay well security initiative. Even if the tasks in the templates are manual at first, we get the process into the template, iterate and automate as we go forward.
During development and operation of the CI/CD Pipeline, how do we know it’s working? We have ELK Stack Dashboards and User support that require constant monitoring and alerting. We soon came to the conclusion that we need to implement the concept of a Canary Release. Canary, like “A Canary in a Coalmine”. Just as a miner would know there was an issue if the canary quits singing, we would know if any of our pipeline infrastructure was down or unresponsive.
New programming language support, Test new languages and version before developers need them. Better integration with issue tracking & notifications systems, Elastic Watcher, Pager Duty, etc… Enhanced release reporting, Identify and report trends. Add integrations and support for new tools and platforms. Negative testing – Does the canary fail when it should?
Support 5 software languages
Average Timeline for a production deploy is 3 days
Approximate 500 Production deploys per Month
XL Release gives us the ability to show the pipeline is “Fast, Secure, Reliable & Available” Here we show the availability of the pipeline based on the Reported outages.
Shows success rate of the canaries.
The bottom graph shows how long releases are waiting at a manual gate. (which we hope to get rid of)
This shows average release time and releases to production each month.
Red line shows that 1 out of 20 tasks are manual, 95% have no manual wait.
I also wanted to give you glimpse of what we are working on now, the Service Maturity Dashboard. This is a microservice that integrates with XL Release to perform release scoring. This release scoring can be used to allow fully automated (100%) deployments to production, depending on the maturity and risk of the release in the pipeline.
Do’s:Use a agile iterative development process, Infrastructure as Code. Create a minimal viable product “MVP and Demo, expand from there.
Create a pipeline for the pipeline so that you can isolate end-users from your development churn.
Create your pipeline with self-service onboarding for developers. This will give you a more consistent experience for end users and free you up to do more important things.
Don’ts
Abandon your DevOps and Agile principals when the pressure is on.
Don’t overdesign it. Keep It Simple, start with simple release templates (workflows) and expand from there.
Don’t user any more manual tasks and Gates than you need. The goal is to have everything flow without manual intervention. Someone commits a good change to source control and that results in a release to production.