Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Art of Scalability - Managing growth

58,355 views

Published on

The ability to grow (and shrink) according to the needs and the available resources is an essential part of designing applications. In this talk we'll cover the fundamental elements of scalability, including aspects involving people, processes and technology. With sound and proven principles and some advice on how to shape your organisation, set the right processes and design your application, this session is a must-see for developers and technical leads alike.

Published in: Technology, Business

The Art of Scalability - Managing growth

  1. The Art of Scalabiliity Managing Growth Lorenzo Alberton Amsterdam, 11th June 2010
  2. Scalability Scalability is a desirable property of a system, a network, a business or a process, which indicates its ability to handle growing amounts of work http://en.wikipedia.org/wiki/Scalability 2
  3. Scalable ≠ Fast A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added. http://www.julianbrowne.com/article/viewer/scalability Increasing performance in general means serving more units of work, but it can also be to handle larger units of work, such as when data sets grow. http://highscalability.com/amazon-architecture 3
  4. Scalability Is About... People Processes Technology 4
  5. People Staffing, Roles, Leadership, Management 5
  6. Roles And Responsibilities Role-clarity 6
  7. Roles And Responsibilities Role-clarity overlapping areas missing wasted effort, responsibilities responsibilities value-destroying conflicts, failed scale initiatives 6
  8. Roles And Responsibilities Role-clarity overlapping areas missing wasted effort, responsibilities responsibilities value-destroying conflicts, failed scale initiatives Key scale-related responsibilities Set measurable goals Staff the team with the appropriate skills Define and implement a scalable architecture Test, monitor, develop future demand projections Define future changes based on the analysis 6
  9. Leadership Inspire people Set the right vision and goals Create the right culture Create the right tools 7
  10. Leadership } Inspire people Set the right vision and goals Accelerator for growth Create the right culture Create the right tools 7
  11. Leadership } Inspire people Set the right vision and goals Accelerator for growth Create the right culture Create the right tools vision = where we are going mission = general direction on how to get there goals = milestones along the path 7
  12. Leadership } Inspire people Set the right vision and goals Accelerator for growth Create the right culture Create the right tools vision = where we are going mission = general direction on how to get there goals = milestones along the path S Specific M Measurable A Achievable (but Aggressive) R Realistic T Time-bound 7
  13. Leadership } Inspire people Set the right vision and goals Accelerator for growth Create the right culture Create the right tools vision = where we are going mission = general direction on how to get there goals = milestones along the path S Specific Chip & Dan Heat, “Switch: How To Change Things When Change Is Hard” M Measurable A Achievable (but Aggressive) People R Realistic - Direct the rider T Time-bound - Motivate the elephant - Shape the path 7
  14. Management Project Management Goals Projects Tasks Individuals Measurement Communication Resolution 8
  15. Management Project Management Goals Projects Tasks Individuals Measurement Communication Resolution People Management Hiring Firing Growth 8
  16. Organisational Structure And Team size Too small Too big Micromanaging Poor communication managers Low morale Overworked team Low productivity members 9
  17. Team Structure functional CTO PM PM PM Designer Developer Tester Designer Developer Tester Designer Developer Tester Designer Developer Tester Designers Developers Testers 10
  18. Team Structure functional matrix CTO PM PM PM Proj 1 PM Designer Developer Tester Proj 2 PM Designer Developer Tester Proj 3 PM Designer Developer Tester Proj 4 PM Designer Developer Tester Designers Developers Testers 10
  19. Building Processes For Scale 11
  20. Why Are Processes Critical? Augment management of teams and employees Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis Determine system capacity and scalability needs 12
  21. Why Are Processes Critical? Augment management of teams and employees Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis Determine system capacity and scalability needs Challenge 12
  22. Why Are Processes Critical? Augment management of teams and employees Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis Determine system capacity and scalability needs Challenge right amount 12
  23. Why Are Processes Critical? Augment management of teams and employees Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis Determine system capacity and scalability needs Challenge right amount right process 12
  24. Why Are Processes Critical? Augment management of teams and employees Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis Determine system capacity and scalability needs Challenge right amount right process right time 12
  25. Determining Headroom For Apps Capacity Current Load 13
  26. Determining Headroom For Apps Capacity Current Load 13
  27. Determining Headroom For Apps Capacity Current Load 13
  28. Determining Headroom For Apps Why? Capacity Planning annual budget Hiring plan Current Load Prioritisation 13
  29. Headroom Process 1. Identify major components 14
  30. Headroom Process 1. Identify major components 2. Identify responsible team 14
  31. Headroom Process 1. Identify major components 2. Identify responsible team 315 queries/sec 20MB/min 3. Determine usage and capacity 14
  32. Headroom Process 1. Identify major components 2. Identify responsible team 315 queries/sec 20MB/min 3. Determine usage and capacity 4. Determine growth rate 14
  33. Headroom Process (ideal usage percentage) x (max capacity) - (current usage) - 1. Identify major components 12 2. Identify responsible team ∑ (growth(t) - (optimisation projects(t))) = ____________________________________ t=1 Headroom 315 queries/sec 20MB/min L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley M. 3. Determine usage and capacity 4. Determine growth rate 14
  34. Joint Architecture Design + Review Board Engineering Architecture Operations M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley 15
  35. Joint Architecture Design + Review Board Engineering Architecture Operations M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley 15
  36. Joint Architecture Design + Review Board Engineering Architecture Architecture Review Board Operations M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley 15
  37. Joint Architecture Design + Review Board Meeting Engineering State goal Review alternative designs Architecture Q&A session Deliberation Architecture Review Board Vote Conclusion Operations M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley 15
  38. Joint Architecture Design + Review Board Meeting Engineering State goal Review alternative designs Architecture Q&A session Deliberation Architecture Review Board Vote Conclusion Operations M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley 15
  39. Controlling Change in Production Environment 16
  40. Controlling Change in Production Environment Change Management Process Proposal Approval Scheduling Logging Review 16
  41. Controlling Change in Production Environment Change Management Process Proposal Approval Scheduling Logging Review Change Identification Process Date & time System undergoing Expected of the change the change results Contact information Rollback procedure 16
  42. Determining Risk #1: Gut Feeling http://dilbert.com/strips/comic/2008-05-08/ 17
  43. Determining Risk #2: Traffic Lights Feature 1 Feature 2 Feature 3 18
  44. Determining Risk #2: Traffic Lights Feature 1 Feature 2 = Overall Release Feature 3 18
  45. Determining Risk #3: FMEA Failure Mode and Effect Analysis Likelihood Severity Ability Total Remed- Revised Failure Feature Effect of If Failure to Risk iation Risk Mode Failure Occurs Detect Score Actions Score User User not - do this data not registered 3 3 3 27 3 - do that saved Sign Up Users Users can given access 1 9 3 27 - do sth 9 wrong other’s privileges data CC Credit number CC theft not 1 9 1 9 N/A 9 Card risk encrypted 19
  46. Managing Risk Rules Risk Level New Feature Release < 150 pts * Bug Fix Release < 50 pts * Peak-usage-time release < 10 pts * Off-peak release < 200 pts * * Numbers are just indicative figures 20
  47. Managing Risk (Human Factor) Rules Risk Tolerance Level 6-hour period < 150 pts * 12-hour period < 250 pts * 24-hour period < 350 pts * 72-hour period < 500 pts * * Numbers are just indicative figures 21
  48. Managing Incidents And Problems Detect, Report, Investigate, Escalate, Resolve approach M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley Restore services in a timely and cost-effective manner Contain chaos: each person has a place Determine root cause and correct problems Review issues regularly 22
  49. Managing Incidents And Problems Detect, Report, Investigate, Escalate, Resolve approach M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley Restore services in a timely and cost-effective manner Contain chaos: each person has a place Determine root cause and correct problems Review issues regularly Post-mortem Process Cross-functional brainstorming meeting 22
  50. Performance (Load) Testing 23
  51. Performance (Load) Testing ✓1.5k users/sec 1. Establish success criteria ✓RT < 150ms 23
  52. Performance (Load) Testing ✓1.5k users/sec 1. Establish success criteria ✓RT < 150ms 2. Establish the test environment TEST ≅ LIVE 23
  53. Performance (Load) Testing ✓1.5k users/sec 1. Establish success criteria ✓RT < 150ms 2. Establish the test environment TEST ≅ LIVE Pareto rule 3. Define the tests (for different things) 20% - 80% 23
  54. Performance (Load) Testing ✓1.5k users/sec 1. Establish success criteria ✓RT < 150ms 2. Establish the test environment TEST ≅ LIVE Pareto rule 3. Define the tests (for different things) 20% - 80% 4. Identify what needs to be monitored CPU - Memory What data needs to be collected TTL, RT, Services 23
  55. Performance (Load) Testing ✓1.5k users/sec 1. Establish success criteria ✓RT < 150ms 2. Establish the test environment TEST ≅ LIVE Pareto rule 3. Define the tests (for different things) 20% - 80% 4. Identify what needs to be monitored CPU - Memory What data needs to be collected TTL, RT, Services CPU: 90% 5. Run, analyse, report to engineers RT: 180ms 2K SimUsers/sec 23
  56. Performance (Load) Testing ✓1.5k users/sec 1. Establish success criteria ✓RT < 150ms 2. Establish the test environment TEST ≅ LIVE Pareto rule 3. Define the tests (for different things) 20% - 80% 4. Identify what needs to be monitored CPU - Memory What data needs to be collected TTL, RT, Services CPU: 90% 5. Run, analyse, report to engineers RT: 180ms 2K SimUsers/sec 6. Repeat tests and analysis Rinse and repeat 23
  57. Stress Testing 24
  58. Stress Testing 24
  59. Stress Testing 24
  60. Stress Testing JMeter Load Runner The Grinder Avalanche http://www.opensourcetesting.org/performance.php 24
  61. Barrier Conditions Architecture review board Code reviews Manual and automated QA processes Performance testing Dev, Test, Stage and Live environments Production monitoring and measurement 25
  62. Technology Architecting scalable solutions 26
  63. Designing For Any Technology Dell WatchGuard Cisco CSS 11501 HP ProLiant DL HP Media Cache Server Appliance 27
  64. Designing For Any Technology Dell WatchGuard Cisco CSS 11501 HP ProLiant DL HP Media Cache Server Appliance 27
  65. Designing For Any Technology Dell WatchGuard Firewall Load Balancer Cisco CSS 11501 HP ProLiant DL Application Servers HP Media Cache Server Appliance DB Server Media / Cache 27
  66. Architectural Principles 28
  67. Architectural Principles +1 N + 1 design 28
  68. Architectural Principles +1 N + 1 design for rollback 28
  69. Architectural Principles +1 N + 1 design for rollback to be disabled 28
  70. Architectural Principles +1 N + 1 design for rollback to be disabled to be monitored 28
  71. Architectural Principles +1 N + 1 design for rollback to be disabled to be for multiple monitored live sites 28
  72. Architectural Principles +1 N + 1 design for rollback to be disabled to be for multiple use mature monitored live sites technology 28
  73. Architectural Principles +1 N + 1 design for rollback to be disabled to be for multiple use mature monitored live sites technology asynchronous design 28
  74. Architectural Principles +1 N + 1 design for rollback to be disabled to be for multiple use mature monitored live sites technology asynchronous stateless design systems 28
  75. Architectural Principles +1 N + 1 design for rollback to be disabled to be for multiple use mature monitored live sites technology asynchronous stateless buy when design systems non core 28
  76. Focus On Core Competencies vs. Build Buy 29
  77. Asynchronous Design 30
  78. Asynchronous Design 30
  79. Stateless Systems State is often useful, but has a significant cost (replication between data centres, synchronous calls...) 31
  80. Stateless Systems State is often useful, but has a significant cost (replication between data centres, synchronous calls...) A B ? Avoidance No sessions / Sticky sessions 31
  81. Stateless Systems State is often useful, but has a significant cost (replication between data centres, synchronous calls...) A B ? Avoidance Decentralisation No sessions / Data in the cookie / Sticky sessions Cookie with hash 31
  82. Stateless Systems State is often useful, but has a significant cost (replication between data centres, synchronous calls...) A B ? Avoidance Decentralisation Centralisation No sessions / Data in the cookie / Store cookies in the Sticky sessions Cookie with hash db or in memcached 31
  83. Creating Fault Isolative Structures 32
  84. Creating Fault Isolative Structures Increase availability Limit impact of failures Easier debugging 32
  85. Creating Fault Isolative Structures Increase availability Limit impact of failures Easier debugging First 32
  86. Creating Fault Isolative Structures Increase availability Limit impact of failures Easier debugging Functions causing repetitive problems First 32
  87. Creating Fault Isolative Structures Increase availability Limit impact of failures Easier debugging Functions Natural layout causing or topology repetitive of the site problems First 32
  88. Scale Directions M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley 33
  89. Scale Directions cloning of entities or data - unbiased distribution of work x M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley 33
  90. Scale Directions cloning of entities or data - unbiased distribution of work x y separation of work by activity or data M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley 33
  91. Scale Directions cloning of entities or data - unbiased distribution of work x y z separation of work separation of work by person by activity or data for whom the work is done M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley 33
  92. Splitting Applications For Scale 34
  93. Splitting Applications For Scale mirroring x + scale transactions - scale data 34
  94. Splitting Applications For Scale mirroring x + scale transactions - scale data + fault isolation + scale function data - scale customer data y split by service 34
  95. Splitting Applications For Scale mirroring x + scale transactions - scale data + fault isolation + fault isolation + scale function data + scale customer data - scale customer data - scale function data y z split by need / split by service location / value 34
  96. Splitting Databases For Scale 35
  97. Splitting Databases For Scale data cloning (replication / clustering) x + easy to implement + scale transaction volume - scale data size and growth 35
  98. Splitting Databases For Scale data cloning (replication / clustering) x + easy to implement + scale transaction volume - scale data size and growth + fault isolation + reduce query time - more difficult - data migration y split by service / resource / data affinity 35
  99. Splitting Databases For Scale data cloning (replication / clustering) x + easy to implement + scale transaction volume - scale data size and growth + balanced demand + fault isolation + fault isolation + reduce query time + scale data and trans. - more difficult - more costly - data migration y z split by service / split by modulus / resource / data affinity hash-based lookups 35
  100. Caching For Performance & Scale 36
  101. Caching For Performance & Scale Object Caches Usually serialized (marshalling / unmarshalling) get() / set() / replace() APC, Memcached 36
  102. Caching For Performance & Scale Object Caches Application Caches Usually serialized Proxy caches (marshalling / Reverse proxy unmarshalling) caches get() / set() / HTTP headers replace() ISP/Uni proxies APC, Memcached Squid, Varnish, mod_cache 36
  103. Caching For Performance & Scale Object Caches Application Caches CDNs Usually serialized Proxy caches Multiple locations (marshalling / / backbones Reverse proxy unmarshalling) caches get() / set() / HTTP headers CNAME entries replace() ISP/Uni proxies Akamai, Coral, APC, Memcached Squid, Varnish, Limelight... mod_cache 36
  104. Solving Other Issues ...and challenges 37
  105. Too Much Data 38
  106. Too Much Data The more storage ...the more storage management 38
  107. Too Much Data The more storage ...the more storage management storage costs people and software power and space processing power backup time and costs 38
  108. Too Much Data The more storage ...the more storage management storage costs people and software power and space processing power backup time and costs Evaluate data retention policy Consider multi-tiered storage Distribute work (MapReduce) 38
  109. Clouds And Grids Cheap, on-demand storage and compute capacity Cost (pay for what you use) High computation rates Speed (procurement, Shared infrastructure (with provisioning, deployment) proper scheduling Flexibility (change / Unused capacity (SETI@H) reconfigure environment) Security, portability, control Not shared simultaneously Limitations of virtualisation Monolithic applications Performance Complexity (debugging, OS) 39
  110. Monitoring 40
  111. Monitoring 1. Is there a problem? User experience / Business metrics monitors 2. Where is the problem? System monitors (threshold - variance) 3. What is the problem? Application monitors 40
  112. Monitoring 1. Is there a problem? User experience / Business metrics monitors 2. Where is the problem? System monitors (threshold - variance) 3. What is the problem? Application monitors Keep Signal vs. Noise ratio high 40
  113. Monitoring 1. Is there a problem? User experience / Business metrics monitors 2. Where is the problem? System monitors (threshold - variance) 3. What is the problem? Application monitors Keep Signal vs. Noise ratio high 40
  114. Questions ? 41
  115. Links & sources http://www.slideshare.net/postwait/scalable- internet-architecture http://highscalability.com/blog/2009/4/2/art- of-scalability-1-scalability-principles.html http://agile.dzone.com/news/approaches- organizational M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley http://theartofscalability.com/ 42
  116. Links & sources 43
  117. Image Credits http://www.sxc.hu/photo/1217386 http://michaelscomments.files.wordpress.com/2009/10/onion- centurion.jpg http://www.travelsd.com/_images/gallery/hires/000189.jpg http://www.socketmanufacturers.com/miniature-circuit-breaker/ DZ47-63-3P-Miniature-Circuit-Breaker.jpg http://blogs.microsoft.co.il/blogs/shair/archive/2008/06/19/load- testing-features-of-visual-studio-team-system.aspx http://www.alibaba.com/member/de100430205.html/viewimg/ photo/103590047/Boxing_Ring_Competition_AIBA_Ring.jpg.html http://brandonsmarathon.com/wp-content/uploads/2009/08/ Olympics+Day+3+Swimming+43rPmSVmwHql.jpg http://en.wikipedia.org/wiki/File:Synchronized_swimming_- _Russian_team.jpg http://www.flickr.com/photos/bugeaters/3025911233/ http://www.flickr.com/photos/cote/2763677698/ http://www.iconfinder.com 44
  118. Thank you! Contact details: Lorenzo Alberton lorenzo@ibuildings.com http://www.alberton.info/talks http://joind.in/talk/view/1539

×