The Rocky Cloud Road

The Rocky
Cloud Road
Gert Drapers (#DataDude)
Principle Software Design Engineer
Copyright: Clouds, Trail Ridge Road, Rocky Mountains National Park (Miriam_Berlin, Oct 2009)

Disclaimer
What follows is a simplified view of some complex
trends
Like any simplification is it both correct and incorrect
It will give you a framework to work from

Driven by TCO, OPEX and CAPEX…
The Drive to the Cloud…
Utility Based Computing…
Are your Engineering
Systems & Practices Ready?

Virtuous COGS cycle
Drive
Down
Hardware
Cost
Design for
Autonomy,
Availability
Rationalize
IT Pro
Activities

The Funny Thing That Happened on the
Way to the Search Engine…
• Those guys built on some really big expensive
Alpha boxes.
But… search is embarrassingly parallel, so why not throw lots of
cheap hardware at it?
• But then you have a serious ops problem. To fix that, you have
to:
• Design software that self assembles into large farms
… and fails fast on failure
… and re-executes / rebalances work as systems come and go
… and monitors itself effectively, so it can pull systems that don’t work
… and partitions & replicates storage so it can ride through failures

“Paper Plate” Computing
•Self assembling “paper plate” designs that
presume no repair
• You don’t fix when broken, instead you dispose
• You add more when you are short on capacity
• You put them away you do not need them now
• You dispose when you no longer need them
Improved System Autonomy
See: Above the Clouds: A Berkeley View of Cloud Computing
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf

The Basics
“The characteristics of a software system that
we consider non-negotiable.”
•A few key points as preface:
• Design for “simplicity”
• Design for “good enough”
• Understand the true minimum shipping point
• Long term plans will often be wrong

On Premise vs. Cloud – Basics Eye
Chart On Premise
Reliability
Security
API quality
Application
Compatibility
Performance
Operations
Availability
Scalability
Cloud
Availability
Scalability
Operations
Performance
Security
Reliability
API quality
Application Compatibility

Reality check
• Some things we know don’t carry forward
• A lot of what we know is still useful
• There are tools to make all of this easier

Availability
“The ability to provide continuous service, despite
partial transient failures”
• Focus on overall application availability, not one
resource
• Scale horizontally across regions for durability
• Replace instead of repair; start replacement
instances, don’t save dying ones
• Design for eliminating the need for maintenance
windowsSource: Architecting for the Cloud: Best Practices

Scalability
•Characteristics of Truly Scalable Service
• Increasing resources results in a proportional increase in
performance
• A scalable service is capable of handling heterogeneity
• A scalable service is operationally efficient
• A scalable service is resilient
• A scalable service becomes more cost effective when it grows
A scalable architecture is critical to take
advantage of a scalable infrastructure
Source: Architecting for the Cloud: Best Practices

Reliability
“The characteristics that ensure that the system
behaves deterministically”
• Meta
• Recovery-oriented computing
• Concrete
• General: standard reliability analysis remains relevant
• Deployment: never repair: restart, reboot, reinstall, replace
• Design: invariant checks, hang and timeout detection, failfast, strict
exception contracts
• Design: single “rude” shutdown path, boot-time recovery, self-verification
• Design: failure modeling, negative case testing
Source: Architecting for the Cloud: Best Practices

Operations
“The characteristics that allow the system to be easily
deployed, configured and diagnosed”
• Meta
• Build self-assembling systems, with no individualized configuration
• Design software that self-monitors and self-heals
• Practice efficient offline diagnostics
• Concrete
• Deployment: automated provisioning, role discovery and configuration
• Design: universal configuration file for all nodes
• Design: instrument code to generate tracing, usage and health information
• Deployment: gather, aggregate, understand, use telemetry data
• Test: zero-repro engineering

Engineering Processes
“The rules we create to build software systems
that embody our basics”
Live the Dream 

Service Isolation
•Public Service Contract
• Versioned
• Loosely coupled, no type sharing
•Different services do not share persisted state
with other services
•Services are:
• Developed independently
• Deployed independently

Branching Structure
• $/base/main
• Base branch for all service branches
• A new service branch always starts by branching from /base/main/*
• Base only contains common tools, code, scripts and externals
• $/common/main
• Branch shared binaries, which are shared as NuGet packages via the internal NuGet gallery
• $/<svc>/*
• Every service resides in its own source branch, to promote service isolation
• Each service can be deployed individually
• A service branch consists minimally of two branches
• $/<svc>/main
– Working branch, requirement is that main is always in a building and deployable state.
– Used to deploy to the nonprod environment
• $/<svc>/prod
– Reflects the state deployed to production environment
• Additional branches are allowed, but should always parent from /<svc>/main and are not allowed
to be used to deploy to prod
$/common/main
/$common/prod
$/base/main
$/svc1/main
$/svc1/prod
$/svc3/main
$/svc2/prod

Builds
• No daily builds
• All services are in their own branch, and deployed at their own cadence, there is no place for daily builds
• Only on-demand builds, triggered by check-in or queue-requests
• GC (Gated Checked) builds
• Code flows in to the branch via a gated check-in system.
• There exists a mandatory code review policy, for all code that flows in to or changes within the branch
• GC builds are NOT retained and are NOT allowed to be used for deployments, only for validation (service
overrides, non-prod PPE validation etc.)
• GS (Golden Share) builds
• Code flows in to these branches using “merge” from the parent branch
• Running the GC test suites is optional
• GS builds have the intention to be deployed
• GS builds are automatically retained, based on deployment history.
• N-x builds which have been deployed are automatically retained for rollback purposes
• Build which have not been deployed between current and N-1 are automatically removed as are build older then
N-x
• Optional automatic deployment from GS build to non-prod-ppe and prod-ppe environments to ease the

Environments
• non-prod
• Core integration environment, however with SLA!
• prod
• Production environment
• PPE (Pre Production Environment) used for:
• Deployment validation of the services and watchdogs
• Synthetic functional validation of the services and watchdogs
• Mandatory rollback testing
• Each environment (non-prod and prod) have PPE environments to perform these tasks in isolation
• General deployment flow:
• GC build  ppe.non.prod (if successful goto #2)
• GS build non.prod (if successful goto #3)
• GS PROD build  ppe.prod (if successful goto #4)
• GS PROD build  prod
• Hot Fixing
• Hotfixes can be created the Prod branch and ported back to Main
• This is why there is a GS and GC build of each branch to enable running the gate check-in suites in every
environment

Sharing binaries using Internal NuGet
Gallery
• Consuming projects bind to explicit version of package
• The NuGet package expresses its dependencies, which automatically get included
• At build time, referenced packages and its dependencies are automatically
downloaded
• Advantages:
• Explicit versioning; less breakages due to dependency changes
• Implicit dependency management, reduced breakage due to missing
dependencies
• Developers and build systems use the same versions and dependencies
• Packages references are managed per project
• Build system only needs to download once
• Use of internal NuGet gallery improves sharing due to increased
discoverability
• No need to check in binaries which keeps the source tree clean and slim!

The Engineering Flow – Shared
binaries
$/common
/main/…
sources
GC deployment
drop share
$/common/main/compX
$/common
/prod/…
GS deployment
drop share
Automated
publish
$/common/prod/compX
Merge common/main => common/prod
Gated
Check-in
Build
Build
NuGet
Gallery

Environment <svc A>
Scale Units <1..N>
The Engineering Flow - Services
$/<svc>/mainsources
deployment
trigger branch
Deployment
Manifest
Deployment
drop share
Machine Functions
Automated
deployment
Nod
e#1
Nod
e#2
Nod
e
#M
non-prod environment
Environment <svc A>
Scale Units <1..N>
$/<svc>/prod
deployment
trigger branch
Deployment
Manifest
Deployment
drop share
Machine Functions
Automated
deployment
Nod
e#1
Nod
e#2
Nod
e
#M
prod environment
Merge svc/main => svc/prod
Gated
Check-in
Build
Build
Check-in
Check-in
NuGet
Gallery

Deployments
•DevOps model:
• All engineers can deploy all services
• Forces sharing of knowledge and skills
• Required to support on-call model
•Published Deployment Guidelines
• Check list of steps for deployment and validation of each
service
• Automated KPIs for monitoring health of service
• Documents service dependencies, both up and down stream

Service Validation
•Monitoring
• Real-time and historical analysis
•Alerting
• Must to be actionable
•Validation
• Everybody can run them!

Testing using PowerShell
•Everybody should be able to run tests
•Re-usable atoms
•Composition of atoms
•Target all environment
•Outside-In testing vs. Inside-In Testing

Point Developer / Pager Duty 
•Rotation based (4 weeks, 4 people)
• Separate interrupt driven from schedule driven work
• Provides focus
•Pager Duty
• Automatic escalation
• Complete management chain is involved in incidents
•RCA (Root Cause Analysis)
• You must be pedantic about RCAs and action them!
Availability is King

Versioning & Deployment Ordering
•The service must support running multiple
versions side-by-side!
• Required during deployment, service overrides, A-B testing,…
•Deploy stateful services before stateless services
• Service must be able to support schema versions N, N-1 and
N+1

Data Layer
•Evolves to a document/resource centric model
• Schema owned by middle tier services
• Chunky, cacheable, partitionable
•Schema changes:
• Owned by service layer
• By default: fault-in model, you update to new version when
written, optionally write is triggered by reading older version.
Amortizes cost of schema update over time.
• Optionally trigger update using a crawler process

Best Practices
•Design for Failure
•Loose Coupling
•Implement Elasticity
•Think Asynchronous and Parallel

Design for Failure
• Avoid single points of failure
• Assume everything fails, and design backwards
• Goal: Applications should continue to function even if the underlying
physical hardware fails or is removed or replaced.
• Best practices
• Use multiple regions
• Use Virtual IP addresses (VIP)
• Use Load Balancers
• Real-time monitoring
• Leverage Auto Scaling groups
• Practice failures/recovery
Always Assume Each Call is your Last
Call!

Loose Coupling
•Independent components
•Design everything as a Black Box
•De-coupling for Hybrid models
•Load-balance clusters
The lesser coupling, the higher the scale factor

Implement Elasticity
•Use designs that are resilient to reboot and re-
launch
•Enable dynamic configuration
•Self discovery and join: instance discovers it own
role
Horizontal Scaling is the Only Option

Think Asynchronous and Parallel
• Only make non-blocking async x-service calls!
• Use load balancing to distribute load across multiple
servers
• Decompose a tasks into their simplest form
• Multi-treading and concurrent requests to cloud
services
• Leverage parallel MR task when appropriate and
possible

Conclusion
•http://en.wikipedia.org/wiki/KISS_principle
• List of software development philosophies
• Minimalism (computing)
• Reduced instruction set computing
• Worse is better (Less is more)
• Don't repeat yourself (DRY)
• You aren't gonna need it (YAGNI)
• Rule of Least Power
Live by the KISS Principle!
https://www.pinterest.com
Source: http://chromblog.thermoscientific.com/blog/bid/85450/GC-MS-MS-Software-Applies-the-KISS-Principle

Resources
• Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud
Applications
• http://msdn.microsoft.com/en-us/library/dn568099.aspx
• Private Cloud Principles, Concepts, and Patterns
• http://social.technet.microsoft.com/wiki/contents/articles/4346.private-cloud-principles-
concepts-and-patterns.aspx
• Cloud Services Foundation Reference Architecture - Principles,
Concepts, and Patterns
• http://blogs.technet.com/b/cloudsolutions/archive/2013/08/15/cloud-services-foundation-
reference-architecture-principles-concepts-and-patterns.aspx

Laat ons weten wat u vindt van deze sessie! Vul de evaluatie
in via www.techdaysapp.nl en maak kans op een van de 20
prijzen*. Prijswinnaars worden bekend gemaakt via Twitter
(#TechDaysNL). Gebruik hiervoor de code op uw badge.
Let us know how you feel about this session! Give your
feedback via www.techdaysapp.nl and possibly win one of
the 20 prices*. Winners will be announced via Twitter
(#TechDaysNL). Use your personal code on your badge.
* Over de uitslag kan niet worden gecorrespondeerd, prijzen zijn voorbeelden – All results are final, prices are
examples

The Rocky Cloud Road

More Related Content

What's hot

Similar to The Rocky Cloud Road

Recently uploaded

The Rocky Cloud Road

Editor's Notes