-
1.
Scaling Systems: Architectures
that Grow
Fundamental Patterns for scaling you can
implement incrementally
-
2.
Who Am I?
• Kendall Miller
• One of the Founders of Gibraltar Software
– Small Independent Software Vendor Founded in 2008
– Developers of VistaDB and Gibraltar
– Engineers, not Sales People
• Enterprise Systems Architect & Developer since 1995
• BSE in Computer Engineering, University of Illinois
Urbana-Champaign (UIUC)
• Twitter: @KendallMiller
-
3.
Fair Warning
-
4.
What is Scale?
Scaling is the ability to cope
and perform under an
increasing workload.
-
5.
What is Scale?
Scaling to a load = available
sustaining that load
-
6.
What is Scale?
Being available is really
about a request being
completed in a period of
time.
-
7.
What’s your Target?
0.00E+00 1.00E+07 2.00E+07 3.00E+07 4.00E+07 5.00E+07 6.00E+07 7.00E+07
Microsoft.com
Twitter.com
Amazon.com
Target.com
Slashdot.org
DevExpress.com
Hanselman.com
Gibraltar
Software
Average daily traffic in Visitors / Day
-
8.
What’s your Target?
1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08
Microsoft.com
Twitter.com
Amazon.com
Target.com
Slashdot.org
DevExpress.com
Hanselman.com
Gibraltar
Software
Average daily traffic in Visitors / Day
-
9.
What’s your Target?
25,000 Visitors/Day = 125,000 Pages/Day
11 High Traffic Hours/Day = 12,000 Pages/Hour
12,000 Pages/Hour = 3.3 Pages/Second
-
10.
Specific Architectures
• Gossip • Load Balancers + Shared
• Map Reduce Nothing Units
• Tree of Responsibility • Load Balancers +
• Stream Processing Stateless Nodes +
Scalable Storage
• Scalable Storage
• Content Addressable
• Publish/Subscribe Networks
• Distributed Queues • General Peer to Peer
-
11.
ACD/C
• Async – Do the work whenever
• Caching – Don’t do any work you don’t have
to
• Distribution – Get as many people to do the
work as you can
• Consistency – We all agree on these key things
-
12.
Async
• Decouple operations so you do the minimum
amount of work in performance critical paths
• Queue work that can be completed later to
smooth out load
• Speculative Execution
• Scheduled Requests (Nightly processes)
-
13.
Caching
• Save results of earlier work nearby where they
are handy to use again later
• Apply in front of anything that’s time
consuming
• Easiest to apply from the left to the right
• Simple strategies can be really effective (EF
Dump all on update)
-
14.
Why Caching?
• Loading the world is impractical
• Apps ask a lot of repeating questions.
– Stateless applications even more so
• Answers don’t change often
• Authoritative information is expensive
-
15.
Distribution
• Distribute requests across multiple systems
• Classic web “Scale Out” approach
• The less state held, the easier to distribute
work.
– Distributed database = hard
– Distributed static content server = easy
• Request routing for distribution can serve
other availability purposes
-
16.
Consistency
• The degree to which all parties observe the
same state of the system at the same time
• Scaling inevitably requires compromise
– Forces one source of the truth for absolute
consistency and requires extensive locking to
ensure parties agree
– The real world doesn’t require the consistency we
tend to demand of our systems
-
17.
Consistency Challenges
• Singleton Data Structures (Order numbers..)
• State held between the endpoints of a process
• Consistent results of queries across
partitioned datasets
-
18.
Typical Application
Session State Transaction Isolation
SSL Session Reader/Writer Locks
Log Contention Singleton Data Structures
Memory Allocation/GC
Network Sockets
Request Queue
Client Server
(Web (Web Storage
Browser) Server) (Database)
-
19.
Caching
100% 50% 10% 1%
Client Server
(Web (Web Storage
Browser) Server) (Database)
-
20.
Distribution
Session State and Identity
need to be factored out
Partition (Sticky Session)
First, then stateless nodes
Client Server
(Web (Web
Client
Browser) Server)
(Web
Client Storage
Browser)
(Web (Database)
Client Server
Browser)
(Web (Web
Browser) Server)
-
21.
Partitioned Storage Zones
Server
Client (Web
Server
(Web Server)
(Web Storage
Client (Database)
Browser) Server)
(Web
Client
Browser)
(Web
Client Server
Browser)
(Web (Web
Browser) Server
Server) Storage
(Web
Server) (Database)
-
22.
Partitioned Storage Intra-Zone
Client Server
Orders
(Web Customer B (Web
Server
Client
Browser) Server)
(Web
(Web Server
Client Server)
Browser) (Web
(Web Server
Client Server)
Browser) (Web Products
(Web
Server)
Browser)
Inventory
-
23.
Asynchronous Processing
Server Orders
(Web Order
Server
Server) Queue
(Web
Server
Server)
(Web
Server
Server)
(Web Products
Server)
Order
Processing
Server Inventory
-
24.
Fallacies of Distributed Computing
• The network is reliable
• Latency is zero
• Bandwidth is infinite
• The network is secure
• Topology doesn’t change
• There is one administrator
• Transport cost is zero
• The network is homogeneous
-
25.
Fresh Problems: Partial Failures
Client Server
(Web (Web
Client
Browser) Server)
(Web
Client Storage
Browser)
(Web (Database)
Client Server
Browser)
(Web (Web
Browser) Server)
-
26.
Fresh Problems: Partial Failures
1. Break system into individual failure zones
2. Monitor each instance of each zone for
problems
3. Route around bad instances
-
27.
Without
monitoring, redundancy is
worthless
-
28.
Fresh Problems: Upgrades
Server
Client (Web
Server
(Web Server)
(Web Storage
Client (Database)
Browser) Server)
(Web
Client
Browser)
(Web
Client Server
Browser)
(Web (Web
Browser) Server
Server) Storage
(Web
Server) (Database)
-
29.
Fresh Problems: Upgrades
1. Break system into individual upgrade zones
2. Upgrade each zone – Drain &
Stop, Upgrade, Verify.
3. Cut traffic over to updated zones
-
30.
Design for Software Update
From the Start
• Don’t forget Data Schemas
-
31.
Bringing Home the Bacon
Testing
Testing
Testing
-
32.
Critical Lessons Learned
• ACD/C
• Clear Consistency
Strategy
• Build in monitoring and
management
-
33.
Additional Information:
Websites
– www.GibraltarSoftware.com
– www.eSymmetrix.com
Follow Up
– Kendall.Miller@eSymmetrix.com
– Twitter: kendallmiller
What level of scaling are we talking about?Scaling is the ability to cope and perform under an increasing workload.
This is VISITORS per DAYMicrosoft.com: 60M Twitter.com: 35MAmazon.com: 15MTarget.com: 2MDevExpress.com & Telerik.com: 25KHanselman.com: 12KGibraltar Software: 1K
This is VISITORS per DAYMicrosoft.com: 60M Twitter.com: 35MAmazon.com: 15MTarget.com: 2MDevExpress.com & Telerik.com: 25KHanselman.com: 12KGibraltar Software: 1K
THIS IS NOT ABOUT ASYNC FOR FASTER PERCEIVED PERFORMANCE
Improve response under loadDo only the work you have to Up to 95% of the work on the typical site can be pulled from cache
Add reverse proxy (Load Balancer)Add additional middle tier serversSession state and identity need to be factored outPartition (“Sticky session”) first, then true load balancing with no state in center
Break down traffic by easy to determine characteristic: Customer, product category, etc.Add storage regions that are self-consistentCan vary exact mix of what data is in each container and how you partitionTypically some parts may be shared like IdentityCross-zone aggregation is slowCross-zone coherency strategy
Middle tier routes storage requests based on easy to determine characteristicConsistency strategy complexity (reports may reflect delayed data, different parties may not see the same view of the world)
Separate long running, dangerous, or serialized tasks from general workWorkflow consistency strategy requiredComplications with deployment and versioningDeferred failure scenarios.
Add reverse proxy (Load Balancer)Add additional middle tier serversSession state and identity need to be factored outPartition (“Sticky session”) first, then true load balancing with no state in center
Break down traffic by easy to determine characteristic: Customer, product category, etc.Add storage regions that are self-consistentCan vary exact mix of what data is in each container and how you partitionTypically some parts may be shared like IdentityCross-zone aggregation is slowCross-zone coherency strategy