Building a highly scalable and available cloud application


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Building a highly scalable and available cloud application

  1. 1. SQLCAT - Shared technicallearnings
  2. 2. Agenda• Fit for Purpose – what makes a good cloud app?• Shifting Perspective – Designing for Cloud• Lessons Learned• Summary / Q&A
  3. 3. Setting the Stage• Customer Advisory Team (CAT) works on big mean projects. • Including a lot of big mean Azure projects• Collating guidance and learnings from the last year of engagement• This discussion is a peek at some of what we’ve learned about Azure applications at serious scale • Take a deep breath.. Not all Azure applications are this involved 
  4. 4. Fit for Purpose – What makes a good cloud app?
  5. 5. Fit for Purpose – What makes a good cloud app? DISPERSEDUSERS & DATA ELASTIC DEMAND SCALE OUT
  6. 6. Shifting Perspective – Designing for Cloud• Scale-out not scale-up• Everything has a limit – compose for scale• Design for failure• Design for continuity• Optimize for density
  7. 7. Scale-out not scale-up• Traditional 3-tier application• Make ”everything stateless” Load Balancer• Where is the state? Web App Servers Servers
  8. 8. Scale-out not scale-up• Traditional 3-tier application• Make ”everything stateless” Load Balancer Database• Where is the state?• Oh, right.. in the scale-up database Web App Servers Servers
  9. 9. Azure Load BalancerScale-out not scale-up• Challenge: architect applications to use partitioned data store • Connection management • Data partitioning & affinity • Scatter / gather queries • Resource management DB1 DB2 DB3
  10. 10. Everything has a limit – compose for scale• Ship as much as you want• Provided it will fit into the standard “scale units”• Want to ship more – use more containers.
  11. 11. Design for failure• Traditional approach: harden the database• Cloud approach: expect failures, design for them, work around them
  12. 12. Optimize for Density• Density is cost of goods• Chunky not chatty• Framework and library efficiency
  13. 13. Handling Transient and Enduring Failures• Given enough scale, time and pressure all components or services will fail • Your application will experience 1..N failures• Transient failures; temporary service interruptions • Dropped connections, failed queries• Enduring failures; require intervention • Incorrect configuration, long-running service unavailability
  14. 14. Handling Transient and Enduring Failures• Use fault-handling frameworks that recognize transient errors• Appropriate retry and backoff policies
  15. 15. Handling Transient and Enduring Failures Web Request Response Latency 450Seconds 400 350 300 250 200 150 100 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Avg Latency Response latency
  16. 16. Scale it out, stitch it together• Partitioning strategies: • Horizontal • Vertical • Hybrid• CSV vs. Global data model
  17. 17. Telemetry is Life• You don’t know what you didn’t capture• Split the streams: high-volume (low- fidelity) and high-value• Know you’re ”down” before your users are! • Be able to figure out why afterwards
  18. 18. Handlingtransient failures Logging transient failures Logging all external API calls with timing Logging full exception (not .ToString())
  19. 19. Telemetry is Life Per-Application High value data High value data consumer - Filter - Generate alerts Server - Aggregate - Display dashboard - Publish - Operational intelligence Data Sources - IIS logs - Application logs - Performance counters High volume data High volume data consumer - Batch - Data mining / analysis - Partition - Historical trends - Archive - Root Cause Analysis
  20. 20. Azure Load BalancerManaging Connections• Instances * DB’s * Pool Size• Each hosted service has 1 IP• Each DB cluster has 1 IP• How big is a routing table entry for IPv4? DB1 DB2 DB3 Hint: 64k
  21. 21. Optimize work: batch & align• Challenge: • Optimize insert of activity and user data into a scale-out data tier (400+ databases) • Transient failure – retries • Enduring failure – failover to alternate store • Optimize for partition alignment
  22. 22. Impact of Interface• Be careful about paying for features you don’t use• Look at optimized frameworks / libraries for key aspects • Balance features vs. Performance – CoGS can add up quickly
  23. 23. Mark SimmsSummary / Q&A masimms@microsoft.c om• Architecture is key Twitter: @mabsimms• Failure is the norm; expect it, design for it• Scale through partitioning and composition• Scale exposes the seams of your implementation• CAT preparing to publish hands-on guidance with reusable patterns