SQLCAT - Shared technical
learnings
Agenda
•   Fit for Purpose – what makes a good cloud app?
•   Shifting Perspective – Designing for Cloud
•   Lessons Learned
•   Summary / Q&A
Setting the Stage
• Customer Advisory Team (CAT) works on big mean projects.
 • Including a lot of big mean Azure projects
• Collating guidance and learnings from the last year of
  engagement
• This discussion is a peek at some of what we’ve learned about
  Azure applications at serious scale
 • Take a deep breath.. Not all Azure applications are this involved 
Fit for Purpose – What makes a good cloud app?
Fit for Purpose – What makes a good cloud app?



 DISPERSED
USERS & DATA    ELASTIC DEMAND      SCALE OUT
Shifting Perspective – Designing for Cloud
•   Scale-out not scale-up
•   Everything has a limit – compose for scale
•   Design for failure
•   Design for continuity
•   Optimize for density
Scale-out not scale-up
• Traditional 3-tier application
• Make ”everything stateless”




                                   Load Balancer
• Where is the state?



                                                    Web       App
                                                   Servers   Servers
Scale-out not scale-up
•   Traditional 3-tier application
•   Make ”everything stateless”




                                     Load Balancer
                                                                         Database
•   Where is the state?
•   Oh, right.. in the scale-up
    database
                                                      Web       App
                                                     Servers   Servers
Azure Load Balancer

Scale-out not scale-up
• Challenge: architect
 applications to use partitioned
 data store
 •   Connection management
 •   Data partitioning & affinity
 •   Scatter / gather queries
 •   Resource management            DB1          DB2            DB3
Everything has a limit – compose for scale
• Ship as much as you want
• Provided it will fit into the
  standard “scale units”
• Want to ship more – use more
  containers.
Design for failure
• Traditional approach: harden
  the database
• Cloud approach: expect
  failures, design for them, work
  around them
Optimize for Density
• Density is cost of goods
• Chunky not chatty
• Framework and library
 efficiency
Handling Transient and Enduring Failures
• Given enough scale, time and pressure all components or
  services will fail
  • Your application will experience 1..N failures
• Transient failures; temporary service interruptions
 • Dropped connections, failed queries
• Enduring failures; require intervention
 • Incorrect configuration, long-running service unavailability
Handling Transient and Enduring Failures
• Use fault-handling
  frameworks that recognize
  transient errors
• Appropriate retry and
  backoff policies
Handling Transient and Enduring Failures
                           Web Request Response Latency
          450
Seconds




          400
          350
          300
          250
          200
          150
          100
           50
            0
                1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

                                   Avg Latency       Response latency
Scale it out, stitch it together
• Partitioning strategies:
  • Horizontal
  • Vertical
  • Hybrid
• CSV vs. Global data model
Telemetry is Life
• You don’t know what you didn’t capture
• Split the streams: high-volume (low-
  fidelity) and high-value
• Know you’re ”down” before your users
  are!
  • Be able to figure out why afterwards
Handling
transient
 failures
      Logging transient
          failures

            Logging all external
            API calls with timing

 Logging full exception
    (not .ToString())
Telemetry is Life
       Per-Application           High value data    High value data consumer
                                 - Filter           - Generate alerts
       Server                    - Aggregate        - Display dashboard
                                 - Publish          - Operational intelligence
        Data Sources
        - IIS logs
        - Application logs
        - Performance counters
                                 High volume data   High volume data consumer
                                 - Batch            - Data mining / analysis
                                 - Partition        - Historical trends
                                 - Archive          - Root Cause Analysis
Azure Load Balancer

Managing Connections
•   Instances * DB’s * Pool Size
•   Each hosted service has 1 IP
•   Each DB cluster has 1 IP
•   How big is a routing table
    entry for IPv4?
                                      DB1          DB2            DB3

                          Hint: 64k
Optimize work: batch & align
• Challenge:
 • Optimize insert of activity and user data into a scale-out data tier (400+
   databases)
 • Transient failure – retries
 • Enduring failure – failover to alternate store
 • Optimize for partition alignment
Impact of Interface
• Be careful about paying for features you don’t use
• Look at optimized frameworks / libraries for key aspects
 • Balance features vs. Performance – CoGS can add up quickly
Mark Simms
Summary / Q&A                                masimms@microsoft.c
                                             om
•   Architecture is key                      Twitter: @mabsimms
•   Failure is the norm; expect it, design for it
•   Scale through partitioning and composition
•   Scale exposes the seams of your implementation
•   CAT preparing to publish hands-on guidance with reusable
    patterns

Building a highly scalable and available cloud application

  • 1.
    SQLCAT - Sharedtechnical learnings
  • 2.
    Agenda • Fit for Purpose – what makes a good cloud app? • Shifting Perspective – Designing for Cloud • Lessons Learned • Summary / Q&A
  • 3.
    Setting the Stage •Customer Advisory Team (CAT) works on big mean projects. • Including a lot of big mean Azure projects • Collating guidance and learnings from the last year of engagement • This discussion is a peek at some of what we’ve learned about Azure applications at serious scale • Take a deep breath.. Not all Azure applications are this involved 
  • 5.
    Fit for Purpose– What makes a good cloud app?
  • 6.
    Fit for Purpose– What makes a good cloud app? DISPERSED USERS & DATA ELASTIC DEMAND SCALE OUT
  • 7.
    Shifting Perspective –Designing for Cloud • Scale-out not scale-up • Everything has a limit – compose for scale • Design for failure • Design for continuity • Optimize for density
  • 8.
    Scale-out not scale-up •Traditional 3-tier application • Make ”everything stateless” Load Balancer • Where is the state? Web App Servers Servers
  • 9.
    Scale-out not scale-up • Traditional 3-tier application • Make ”everything stateless” Load Balancer Database • Where is the state? • Oh, right.. in the scale-up database Web App Servers Servers
  • 10.
    Azure Load Balancer Scale-outnot scale-up • Challenge: architect applications to use partitioned data store • Connection management • Data partitioning & affinity • Scatter / gather queries • Resource management DB1 DB2 DB3
  • 11.
    Everything has alimit – compose for scale • Ship as much as you want • Provided it will fit into the standard “scale units” • Want to ship more – use more containers.
  • 12.
    Design for failure •Traditional approach: harden the database • Cloud approach: expect failures, design for them, work around them
  • 13.
    Optimize for Density •Density is cost of goods • Chunky not chatty • Framework and library efficiency
  • 14.
    Handling Transient andEnduring Failures • Given enough scale, time and pressure all components or services will fail • Your application will experience 1..N failures • Transient failures; temporary service interruptions • Dropped connections, failed queries • Enduring failures; require intervention • Incorrect configuration, long-running service unavailability
  • 15.
    Handling Transient andEnduring Failures • Use fault-handling frameworks that recognize transient errors • Appropriate retry and backoff policies
  • 17.
    Handling Transient andEnduring Failures Web Request Response Latency 450 Seconds 400 350 300 250 200 150 100 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Avg Latency Response latency
  • 19.
    Scale it out,stitch it together • Partitioning strategies: • Horizontal • Vertical • Hybrid • CSV vs. Global data model
  • 20.
    Telemetry is Life •You don’t know what you didn’t capture • Split the streams: high-volume (low- fidelity) and high-value • Know you’re ”down” before your users are! • Be able to figure out why afterwards
  • 21.
    Handling transient failures Logging transient failures Logging all external API calls with timing Logging full exception (not .ToString())
  • 22.
    Telemetry is Life Per-Application High value data High value data consumer - Filter - Generate alerts Server - Aggregate - Display dashboard - Publish - Operational intelligence Data Sources - IIS logs - Application logs - Performance counters High volume data High volume data consumer - Batch - Data mining / analysis - Partition - Historical trends - Archive - Root Cause Analysis
  • 23.
    Azure Load Balancer ManagingConnections • Instances * DB’s * Pool Size • Each hosted service has 1 IP • Each DB cluster has 1 IP • How big is a routing table entry for IPv4? DB1 DB2 DB3 Hint: 64k
  • 24.
    Optimize work: batch& align • Challenge: • Optimize insert of activity and user data into a scale-out data tier (400+ databases) • Transient failure – retries • Enduring failure – failover to alternate store • Optimize for partition alignment
  • 26.
    Impact of Interface •Be careful about paying for features you don’t use • Look at optimized frameworks / libraries for key aspects • Balance features vs. Performance – CoGS can add up quickly
  • 28.
    Mark Simms Summary /Q&A masimms@microsoft.c om • Architecture is key Twitter: @mabsimms • Failure is the norm; expect it, design for it • Scale through partitioning and composition • Scale exposes the seams of your implementation • CAT preparing to publish hands-on guidance with reusable patterns