Your SlideShare is downloading. ×
  • Like
The Rocky Cloud Road
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

The Rocky Cloud Road


Moving to the cloud isn’t easy, transforming your engineering team to adopt to the cloud and services lifestyle is therefore crucial. It all starts with creating a common understanding of the …

Moving to the cloud isn’t easy, transforming your engineering team to adopt to the cloud and services lifestyle is therefore crucial. It all starts with creating a common understanding of the engineering and development principles which are important in the cloud, which are different then building regular applications. This session will take you on a road trip based on the presenters experience developing and more importantly operating Azure Active Directory, SQL Server Azure and most recently the Xbox Live Services to support Xbox One.

Published in Engineering , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • EconomicsTechnology changes are transforming operations efficiencySupporting workload amortization for large hosting companiesChanging RelationshipsPurchasing patterns are changing, friction is no longer toleratedVendors are responsible for much more of the software lifecycleCadenceExecution cadence greatly increased due to delivery mechanisms


  • 1. The Rocky Cloud Road Gert Drapers (#DataDude) Principle Software Design Engineer Copyright: Clouds, Trail Ridge Road, Rocky Mountains National Park (Miriam_Berlin, Oct 2009)
  • 2. Disclaimer What follows is a simplified view of some complex trends Like any simplification is it both correct and incorrect It will give you a framework to work from
  • 3. Driven by TCO, OPEX and CAPEX… The Drive to the Cloud… Utility Based Computing… Are your Engineering Systems & Practices Ready?
  • 4. Virtuous COGS cycle Drive Down Hardware Cost Design for Autonomy, Availability Rationalize IT Pro Activities
  • 5. The Funny Thing That Happened on the Way to the Search Engine… • Those guys built on some really big expensive Alpha boxes. But… search is embarrassingly parallel, so why not throw lots of cheap hardware at it? • But then you have a serious ops problem. To fix that, you have to: • Design software that self assembles into large farms … and fails fast on failure … and re-executes / rebalances work as systems come and go … and monitors itself effectively, so it can pull systems that don’t work … and partitions & replicates storage so it can ride through failures
  • 6. “Paper Plate” Computing •Self assembling “paper plate” designs that presume no repair • You don’t fix when broken, instead you dispose • You add more when you are short on capacity • You put them away you do not need them now • You dispose when you no longer need them Improved System Autonomy See: Above the Clouds: A Berkeley View of Cloud Computing
  • 7. The Basics “The characteristics of a software system that we consider non-negotiable.” •A few key points as preface: • Design for “simplicity” • Design for “good enough” • Understand the true minimum shipping point • Long term plans will often be wrong
  • 8. On Premise vs. Cloud – Basics Eye Chart On Premise Reliability Security API quality Application Compatibility Performance Operations Availability Scalability Cloud Availability Scalability Operations Performance Security Reliability API quality Application Compatibility
  • 9. Reality check • Some things we know don’t carry forward • A lot of what we know is still useful • There are tools to make all of this easier
  • 10. Availability “The ability to provide continuous service, despite partial transient failures” • Focus on overall application availability, not one resource • Scale horizontally across regions for durability • Replace instead of repair; start replacement instances, don’t save dying ones • Design for eliminating the need for maintenance windowsSource: Architecting for the Cloud: Best Practices
  • 11. Scalability •Characteristics of Truly Scalable Service • Increasing resources results in a proportional increase in performance • A scalable service is capable of handling heterogeneity • A scalable service is operationally efficient • A scalable service is resilient • A scalable service becomes more cost effective when it grows A scalable architecture is critical to take advantage of a scalable infrastructure Source: Architecting for the Cloud: Best Practices
  • 12. Reliability “The characteristics that ensure that the system behaves deterministically” • Meta • Recovery-oriented computing • Concrete • General: standard reliability analysis remains relevant • Deployment: never repair: restart, reboot, reinstall, replace • Design: invariant checks, hang and timeout detection, failfast, strict exception contracts • Design: single “rude” shutdown path, boot-time recovery, self-verification • Design: failure modeling, negative case testing Source: Architecting for the Cloud: Best Practices
  • 13. Operations “The characteristics that allow the system to be easily deployed, configured and diagnosed” • Meta • Build self-assembling systems, with no individualized configuration • Design software that self-monitors and self-heals • Practice efficient offline diagnostics • Concrete • Deployment: automated provisioning, role discovery and configuration • Design: universal configuration file for all nodes • Design: instrument code to generate tracing, usage and health information • Deployment: gather, aggregate, understand, use telemetry data • Test: zero-repro engineering
  • 14. Engineering Processes “The rules we create to build software systems that embody our basics” Live the Dream 
  • 15. Service Isolation •Public Service Contract • Versioned • Loosely coupled, no type sharing •Different services do not share persisted state with other services •Services are: • Developed independently • Deployed independently
  • 16. Branching Structure • $/base/main • Base branch for all service branches • A new service branch always starts by branching from /base/main/* • Base only contains common tools, code, scripts and externals • $/common/main • Branch shared binaries, which are shared as NuGet packages via the internal NuGet gallery • $/<svc>/* • Every service resides in its own source branch, to promote service isolation • Each service can be deployed individually • A service branch consists minimally of two branches • $/<svc>/main – Working branch, requirement is that main is always in a building and deployable state. – Used to deploy to the nonprod environment • $/<svc>/prod – Reflects the state deployed to production environment • Additional branches are allowed, but should always parent from /<svc>/main and are not allowed to be used to deploy to prod $/common/main /$common/prod $/base/main $/svc1/main $/svc1/prod $/svc3/main $/svc2/prod
  • 17. Builds • No daily builds • All services are in their own branch, and deployed at their own cadence, there is no place for daily builds • Only on-demand builds, triggered by check-in or queue-requests • GC (Gated Checked) builds • Code flows in to the branch via a gated check-in system. • There exists a mandatory code review policy, for all code that flows in to or changes within the branch • GC builds are NOT retained and are NOT allowed to be used for deployments, only for validation (service overrides, non-prod PPE validation etc.) • GS (Golden Share) builds • Code flows in to these branches using “merge” from the parent branch • Running the GC test suites is optional • GS builds have the intention to be deployed • GS builds are automatically retained, based on deployment history. • N-x builds which have been deployed are automatically retained for rollback purposes • Build which have not been deployed between current and N-1 are automatically removed as are build older then N-x • Optional automatic deployment from GS build to non-prod-ppe and prod-ppe environments to ease the
  • 18. Environments • non-prod • Core integration environment, however with SLA! • prod • Production environment • PPE (Pre Production Environment) used for: • Deployment validation of the services and watchdogs • Synthetic functional validation of the services and watchdogs • Mandatory rollback testing • Each environment (non-prod and prod) have PPE environments to perform these tasks in isolation • General deployment flow: • GC build  (if successful goto #2) • GS build  (if successful goto #3) • GS PROD build  (if successful goto #4) • GS PROD build  prod • Hot Fixing • Hotfixes can be created the Prod branch and ported back to Main • This is why there is a GS and GC build of each branch to enable running the gate check-in suites in every environment
  • 19. Sharing binaries using Internal NuGet Gallery • Consuming projects bind to explicit version of package • The NuGet package expresses its dependencies, which automatically get included • At build time, referenced packages and its dependencies are automatically downloaded • Advantages: • Explicit versioning; less breakages due to dependency changes • Implicit dependency management, reduced breakage due to missing dependencies • Developers and build systems use the same versions and dependencies • Packages references are managed per project • Build system only needs to download once • Use of internal NuGet gallery improves sharing due to increased discoverability • No need to check in binaries which keeps the source tree clean and slim!
  • 20. The Engineering Flow – Shared binaries $/common /main/… sources GC deployment drop share $/common/main/compX $/common /prod/… GS deployment drop share Automated publish $/common/prod/compX Merge common/main => common/prod Gated Check-in Build Build NuGet Gallery
  • 21. Environment <svc A> Scale Units <1..N> The Engineering Flow - Services $/<svc>/mainsources deployment trigger branch Deployment Manifest Deployment drop share Machine Functions Automated deployment Nod e#1 Nod e#2 Nod e #M non-prod environment Environment <svc A> Scale Units <1..N> $/<svc>/prod deployment trigger branch Deployment Manifest Deployment drop share Machine Functions Automated deployment Nod e#1 Nod e#2 Nod e #M prod environment Merge svc/main => svc/prod Gated Check-in Build Build Check-in Check-in NuGet Gallery
  • 22. Deployments •DevOps model: • All engineers can deploy all services • Forces sharing of knowledge and skills • Required to support on-call model •Published Deployment Guidelines • Check list of steps for deployment and validation of each service • Automated KPIs for monitoring health of service • Documents service dependencies, both up and down stream
  • 23. Service Validation •Monitoring • Real-time and historical analysis •Alerting • Must to be actionable •Validation • Everybody can run them!
  • 24. Testing using PowerShell •Everybody should be able to run tests •Re-usable atoms •Composition of atoms •Target all environment •Outside-In testing vs. Inside-In Testing
  • 25. Point Developer / Pager Duty  •Rotation based (4 weeks, 4 people) • Separate interrupt driven from schedule driven work • Provides focus •Pager Duty • Automatic escalation • Complete management chain is involved in incidents •RCA (Root Cause Analysis) • You must be pedantic about RCAs and action them! Availability is King
  • 26. Versioning & Deployment Ordering •The service must support running multiple versions side-by-side! • Required during deployment, service overrides, A-B testing,… •Deploy stateful services before stateless services • Service must be able to support schema versions N, N-1 and N+1
  • 27. Data Layer •Evolves to a document/resource centric model • Schema owned by middle tier services • Chunky, cacheable, partitionable •Schema changes: • Owned by service layer • By default: fault-in model, you update to new version when written, optionally write is triggered by reading older version. Amortizes cost of schema update over time. • Optionally trigger update using a crawler process
  • 28. Best Practices •Design for Failure •Loose Coupling •Implement Elasticity •Think Asynchronous and Parallel
  • 29. Design for Failure • Avoid single points of failure • Assume everything fails, and design backwards • Goal: Applications should continue to function even if the underlying physical hardware fails or is removed or replaced. • Best practices • Use multiple regions • Use Virtual IP addresses (VIP) • Use Load Balancers • Real-time monitoring • Leverage Auto Scaling groups • Practice failures/recovery Always Assume Each Call is your Last Call!
  • 30. Loose Coupling •Independent components •Design everything as a Black Box •De-coupling for Hybrid models •Load-balance clusters The lesser coupling, the higher the scale factor
  • 31. Implement Elasticity •Use designs that are resilient to reboot and re- launch •Enable dynamic configuration •Self discovery and join: instance discovers it own role Horizontal Scaling is the Only Option
  • 32. Think Asynchronous and Parallel • Only make non-blocking async x-service calls! • Use load balancing to distribute load across multiple servers • Decompose a tasks into their simplest form • Multi-treading and concurrent requests to cloud services • Leverage parallel MR task when appropriate and possible
  • 33. Conclusion • • List of software development philosophies • Minimalism (computing) • Reduced instruction set computing • Worse is better (Less is more) • Don't repeat yourself (DRY) • You aren't gonna need it (YAGNI) • Rule of Least Power Live by the KISS Principle! Source:
  • 34. Resources • Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications • • Private Cloud Principles, Concepts, and Patterns • concepts-and-patterns.aspx • Cloud Services Foundation Reference Architecture - Principles, Concepts, and Patterns • reference-architecture-principles-concepts-and-patterns.aspx
  • 35. Laat ons weten wat u vindt van deze sessie! Vul de evaluatie in via en maak kans op een van de 20 prijzen*. Prijswinnaars worden bekend gemaakt via Twitter (#TechDaysNL). Gebruik hiervoor de code op uw badge. Let us know how you feel about this session! Give your feedback via and possibly win one of the 20 prices*. Winners will be announced via Twitter (#TechDaysNL). Use your personal code on your badge. * Over de uitslag kan niet worden gecorrespondeerd, prijzen zijn voorbeelden – All results are final, prices are examples