Building Scalable Systems: an asynchronous approach

2,821 views
2,719 views

Published on

A brief tour through asynchronous systems.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,821
On SlideShare
0
From Embeds
0
Number of Embeds
34
Actions
Shares
0
Downloads
95
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Building Scalable Systems: an asynchronous approach

  1. 1. Building Scalable Systems an Asynchronous Approach / pleasure and painMonday, June 20, 2011
  2. 2. Who am I? @postwait on twitter Author of “Scalable Internet Architectures” Pearson, ISBN: 067232699X Contributor to “Web Operations” O’Reilly, ISBN: Founder of OmniTI, Message Systems, Fontdeck, & Circonus I like to tackle problems that are “always on” and “always growing.” I am an Engineer A practitioner of academic computing. IEEE member and Senior ACM member. On the Editorial Board of ACM’s Queue magazine. 2Monday, June 20, 2011
  3. 3. Some rants about systems. / cutting through the crapMonday, June 20, 2011
  4. 4. BIG data • BIG data: doesn’t exist • but it sure is a good way to market things. • If you measure things in petabytes as we have for the last several years you have data, not BIG data.Monday, June 20, 2011
  5. 5. Data stores • “new” NoSQL systems can scale better than an RDBMS • yes • They scale better than sharded RDBMS • no • I need NoSQL systems for “BIG” data. • noMonday, June 20, 2011
  6. 6. Why NoSQL • The choice to use a NoSQL system is driven by: • a more suitable data model • we desire to shift our CAP theorem constraintsMonday, June 20, 2011
  7. 7. The cloud • The cloud is not magic. • The cloud enables engineers to rapidly deploy an architecture to run an application. • There is nothing wrong with this.Monday, June 20, 2011
  8. 8. The cloud: gotcha 1 • Provisioning services “the old way” took time due to: • installing systems • install packages • determining and enforcing data security • determining and enforcing SLA monitoring • determining and enforcing DR: RPO and RTO • documenting escalation and remediation effortsMonday, June 20, 2011
  9. 9. The cloud: gotcha 2 • Quality of service in multi-tenancy environments is a very, very hard problem. • In fact, no one has solved it.Monday, June 20, 2011
  10. 10. The cloud: gotcha 3 • Platform as a service provides: • MySQL? PostgreSQL? Redis? Node.js? • .. patched. • .. patched. • .. patched. • .. patched.Monday, June 20, 2011
  11. 11. The cloud: gotcha 4 • Scaling isn’t magic pixie dust. • “I can run more instances of my app.” • horribly false sense of security. • You must have a scalable architectureMonday, June 20, 2011
  12. 12. Design & Implementation Techniques / some say architecture != implementationMonday, June 20, 2011
  13. 13. Architecture vs. ImplementationMonday, June 20, 2011
  14. 14. Architecture vs. Implementation Architecture is without specification of the vendor, make, and model of components.Monday, June 20, 2011
  15. 15. Architecture vs. Implementation Architecture is without specification of the vendor, make, and model of components. Implementation is the adaptation of an architecture to embrace available technologies.Monday, June 20, 2011
  16. 16. Architecture vs. Implementation Architecture is without specification of the vendor, make, and model of components. Implementation is the adaptation of an architecture to embrace available technologies. They are intrinsically tied. Insisting on separation is a metaphysical argument (with no winners)Monday, June 20, 2011
  17. 17. Respect Engineering Math Engineering math: 19 + 89 = 110 “Precise” Math: 19 + 89 = 10.8Monday, June 20, 2011
  18. 18. Respect Engineering Math Engineering math: 19 + 89 = 110 “Precise” Math: 19 + 89 = 10.8 Ok. Ok. I must have, I must have put a decimal point in the wrong place or something. Shit. I always do that. I always mess up some mundane detail. - Michael Bolton in Office SpaceMonday, June 20, 2011
  19. 19. Ensure the gods aren’t angry.Monday, June 20, 2011
  20. 20. Ensure the gods aren’t angry. Bob: We need to grow our cluster of web servers.Monday, June 20, 2011
  21. 21. Ensure the gods aren’t angry. Bob: We need to grow our cluster of web servers. Alice: How many requests per second do they do, how many do you have and what is their current resource utilization?Monday, June 20, 2011
  22. 22. Ensure the gods aren’t angry. Bob: We need to grow our cluster of web servers. Alice: How many requests per second do they do, how many do you have and what is their current resource utilization? Bob: About 200 req/second, 8 servers and they have no headroom.Monday, June 20, 2011
  23. 23. Ensure the gods aren’t angry. Bob: We need to grow our cluster of web servers. Alice: How many requests per second do they do, how many do you have and what is their current resource utilization? Bob: About 200 req/second, 8 servers and they have no headroom. Alice: How many req/second do you need?Monday, June 20, 2011
  24. 24. Ensure the gods aren’t angry. Bob: We need to grow our cluster of web servers. Alice: How many requests per second do they do, how many do you have and what is their current resource utilization? Bob: About 200 req/second, 8 servers and they have no headroom. Alice: How many req/second do you need? Bob: 800 req/second would be good.Monday, June 20, 2011
  25. 25. Ensure the gods aren’t angry. Bob: We need to grow our cluster of web servers. Alice: How many requests per second do they do, how many do you have and what is their current resource utilization? Bob: About 200 req/second, 8 servers and they have no headroom. Alice: How many req/second do you need? Bob: 800 req/second would be good. Alice: Um, Bob, 200 x 8 = 1600... you have 50% headroom on your goal.Monday, June 20, 2011
  26. 26. Ensure the gods aren’t angry. Bob: We need to grow our cluster of web servers. Alice: How many requests per second do they do, how many do you have and what is their current resource utilization? Bob: About 200 req/second, 8 servers and they have no headroom. Alice: How many req/second do you need? Bob: 800 req/second would be good. Alice: Um, Bob, 200 x 8 = 1600... you have 50% headroom on your goal. Bob: No... 200 / 8 = 25 req/second per server.Monday, June 20, 2011
  27. 27. Ensure the gods aren’t angry. Bob: We need to grow our cluster of web servers. Alice: How many requests per second do they do, how many do you have and what is their current resource utilization? Bob: About 200 req/second, 8 servers and they have no headroom. Alice: How many req/second do you need? Bob: 800 req/second would be good. Alice: Um, Bob, 200 x 8 = 1600... you have 50% headroom on your goal. Bob: No... 200 / 8 = 25 req/second per server. Alice: Bob... the gods are angry.Monday, June 20, 2011
  28. 28. Why you’ve pissed off the gods.Monday, June 20, 2011
  29. 29. Why you’ve pissed off the gods. Most web apps are CPU bound (as I/O happens on a different layer)Monday, June 20, 2011
  30. 30. Why you’ve pissed off the gods. Most web apps are CPU bound (as I/O happens on a different layer) Typical box today: 8 cores are 2.8GHz or 22.4 BILLION instructions per second.Monday, June 20, 2011
  31. 31. Why you’ve pissed off the gods. Most web apps are CPU bound (as I/O happens on a different layer) Typical box today: 8 cores are 2.8GHz or 22.4 BILLION instructions per second. 22x109 instr/s / 25 req/s = 880 MILLION instructions per request.Monday, June 20, 2011
  32. 32. Why you’ve pissed off the gods. Most web apps are CPU bound (as I/O happens on a different layer) Typical box today: 8 cores are 2.8GHz or 22.4 BILLION instructions per second. 22x109 instr/s / 25 req/s = 880 MILLION instructions per request. This same effort (per-request) provided me with approximately 15 minutes enjoying “Might & Magic 2” on my Apple IIe - you’ve certainly pissed me off.Monday, June 20, 2011
  33. 33. Why you’ve pissed off the gods. Most web apps are CPU bound (as I/O happens on a different layer) Typical box today: 8 cores are 2.8GHz or 22.4 BILLION instructions per second. 22x109 instr/s / 25 req/s = 880 MILLION instructions per request. This same effort (per-request) provided me with approximately 15 minutes enjoying “Might & Magic 2” on my Apple IIe - you’ve certainly pissed me off. No wonder the gods are angry.Monday, June 20, 2011
  34. 34. Develop a modelMonday, June 20, 2011
  35. 35. Develop a model Queue theoretic models are for “other people.”Monday, June 20, 2011
  36. 36. Develop a model Queue theoretic models are for “other people.” Sorta, not really.Monday, June 20, 2011
  37. 37. Develop a model Queue theoretic models are for “other people.” Sorta, not really. Problems:Monday, June 20, 2011
  38. 38. Develop a model Queue theoretic models are for “other people.” Sorta, not really. Problems: very hard to develop a complete and accurate modelMonday, June 20, 2011
  39. 39. Develop a model Queue theoretic models are for “other people.” Sorta, not really. Problems: very hard to develop a complete and accurate model Benefits:Monday, June 20, 2011
  40. 40. Develop a model Queue theoretic models are for “other people.” Sorta, not really. Problems: very hard to develop a complete and accurate model Benefits: provides insight on architecture capacitance dependenciesMonday, June 20, 2011
  41. 41. Develop a model Queue theoretic models are for “other people.” Sorta, not really. Problems: very hard to develop a complete and accurate model Benefits: provides insight on architecture capacitance dependencies relatively easy to understandMonday, June 20, 2011
  42. 42. Develop a model Queue theoretic models are for “other people.” Sorta, not really. Problems: very hard to develop a complete and accurate model Benefits: provides insight on architecture capacitance dependencies relatively easy to understand illustrates opportunities to further isolate workMonday, June 20, 2011
  43. 43. Rationalize your modelMonday, June 20, 2011
  44. 44. Rationalize your model Draw your model outMonday, June 20, 2011
  45. 45. Rationalize your model Draw your model out Take measurements and walk through the model to rationalize it i.e. prove it to be empirically correctMonday, June 20, 2011
  46. 46. Rationalize your model Draw your model out Take measurements and walk through the model to rationalize it i.e. prove it to be empirically correct You should be able to map actions to consequences:Monday, June 20, 2011
  47. 47. Rationalize your model Draw your model out Take measurements and walk through the model to rationalize it i.e. prove it to be empirically correct You should be able to map actions to consequences: A user signs up ➙ 4 synchronous DB inserts (1 synch IOPS + 4 asynch writes) 1 AMQP durable, persistent message 1 asynch DB read ➙ 1/10 IOPS writing new Lucene indexesMonday, June 20, 2011
  48. 48. Rationalize your model Draw your model out Take measurements and walk through the model to rationalize it i.e. prove it to be empirically correct You should be able to map actions to consequences: A user signs up ➙ 4 synchronous DB inserts (1 synch IOPS + 4 asynch writes) 1 AMQP durable, persistent message 1 asynch DB read ➙ 1/10 IOPS writing new Lucene indexes In a dev environment, simulate traffic and rationalize your modelMonday, June 20, 2011
  49. 49. Rationalize your model Draw your model out Take measurements and walk through the model to rationalize it i.e. prove it to be empirically correct You should be able to map actions to consequences: A user signs up ➙ 4 synchronous DB inserts (1 synch IOPS + 4 asynch writes) 1 AMQP durable, persistent message 1 asynch DB read ➙ 1/10 IOPS writing new Lucene indexes In a dev environment, simulate traffic and rationalize your model I call this a “data flow causality map”Monday, June 20, 2011
  50. 50. Complexity will eat your lunchMonday, June 20, 2011
  51. 51. Complexity will eat your lunch there will always be empirical variance from your modelMonday, June 20, 2011
  52. 52. Complexity will eat your lunch there will always be empirical variance from your model explaining the phantoms leads to enlightenmentMonday, June 20, 2011
  53. 53. Complexity will eat your lunch there will always be empirical variance from your model explaining the phantoms leads to enlightenment service decoupling in complex systems gives:Monday, June 20, 2011
  54. 54. Complexity will eat your lunch there will always be empirical variance from your model explaining the phantoms leads to enlightenment service decoupling in complex systems gives: simplified modeling and capacity planningMonday, June 20, 2011
  55. 55. Complexity will eat your lunch there will always be empirical variance from your model explaining the phantoms leads to enlightenment service decoupling in complex systems gives: simplified modeling and capacity planning slight inefficienciesMonday, June 20, 2011
  56. 56. Complexity will eat your lunch there will always be empirical variance from your model explaining the phantoms leads to enlightenment service decoupling in complex systems gives: simplified modeling and capacity planning slight inefficiencies promotes lower contentionMonday, June 20, 2011
  57. 57. Complexity will eat your lunch there will always be empirical variance from your model explaining the phantoms leads to enlightenment service decoupling in complex systems gives: simplified modeling and capacity planning slight inefficiencies promotes lower contention requires design of systems with less coherency requirementsMonday, June 20, 2011
  58. 58. Complexity will eat your lunch there will always be empirical variance from your model explaining the phantoms leads to enlightenment service decoupling in complex systems gives: simplified modeling and capacity planning slight inefficiencies promotes lower contention requires design of systems with less coherency requirements each isolated service is simpler and saferMonday, June 20, 2011
  59. 59. Complexity will eat your lunch there will always be empirical variance from your model explaining the phantoms leads to enlightenment service decoupling in complex systems gives: simplified modeling and capacity planning slight inefficiencies promotes lower contention requires design of systems with less coherency requirements each isolated service is simpler and safer SCALES.Monday, June 20, 2011
  60. 60. Asynchronous Systems / it’s likely you have no idea what you’re doingMonday, June 20, 2011
  61. 61. Asychronous • of or requiring a form of computer control timing protocol in which a specific operation begins upon receipt of an indication (signal) that the preceding operation has been completed. • ...or “I’ll act when you tell me you are done” • ...or a protocol wherein the initiation of a task and the report of its completion are separate operations.Monday, June 20, 2011
  62. 62. Protocols • Standards: • AMQP (impl: ActiveMQ, RabbitMQ, OpenAMQ, etc.) • Others: • ZeroMQ • GearmanMonday, June 20, 2011
  63. 63. Guarantees • Queueing protocols can be misleading. • Are you sure you did what you think you did? • Let’s use a publish as an example.Monday, June 20, 2011
  64. 64. Publication • Imagine a Queue: • You assume that by calling “publish” that your message is placed on the queue and will eventually be consumed (assuming a consumer). • Most systems are ‘more’ asynchronous than that.Monday, June 20, 2011
  65. 65. Publication what you think happens User Space Kernel Network Stack Network Stack Queue publish write call message frame read S A F read write E error message frame return call publish write message frame read B O O M read message frame write error returnMonday, June 20, 2011
  66. 66. Publication what really happens User Space Kernel Network Stack Network Stack Queue call publish write return message frame read S A F call publish write E return message frame read B O O M write read message frame error returnMonday, June 20, 2011
  67. 67. Why? • Why do queueing protocols use “silence for success?” • Simple: performance • no need for a roundtrip before the next message • success is common, failure rareMonday, June 20, 2011
  68. 68. Why? • AMQP is not alone in this... • 0MQ as well.Monday, June 20, 2011
  69. 69. Now what? • In each component you must decide if you need: • synchronous system w/ synchronous protocol • asynchronous system w/ synchronous protocol • asynchronous system w/ asynchronous protocolMonday, June 20, 2011
  70. 70. Service decisions • Knowing you can lose messages is... okay? • it can be • there are plenty of uses for unreliable communications • however... generally, it is much easier to build services that have end- to-end guarantees.Monday, June 20, 2011
  71. 71. Non-asynchronous: synchronous User Space Kernel Network Stack Network Stack Database publish write call message frame read S A F read write E error message frame return call publish write message frame read B O O M read message frame write error returnMonday, June 20, 2011
  72. 72. Asynchronous to the purpose • Why is a Queue “asynchronous” • and a Database “synchronous” • I lied... “asynchronous” is “to the purpose.” • If the ultimate, final goal is: storage in a DB • and you return the result only after a commit • then you are synchronousMonday, June 20, 2011
  73. 73. Simple example: image thumbnailing • A user uploads an email to a web site • you need to produce 7 different transformations • (size, color, etc.) • Asynchronous system: • synchronous upload protocol: • user upload -> thank you we have it • asynchronous processing • file -> 7 mutationsMonday, June 20, 2011
  74. 74. A (more) complete example. / foursquare-like, untappd.com-like serviceMonday, June 20, 2011
  75. 75. Better example: rewards calculation • A user performs an action on your site • and you need to reward them based on: • social network, history, value • you want to show them their reward “immediately.” • Step 1: engineer for failure.Monday, June 20, 2011
  76. 76. Rewards calculation: step 1 • the inability to calculate the reward shall not prevent the action. (think: beer checkin on untappd) • I want the reward calculation immediately. • I need the checkin to be recorded.Monday, June 20, 2011
  77. 77. Rewards calculation: step 2 • Decouple the rewards calculation: 1. receive user request 2. store(C) 3. queue the checkin(C) on QC 4. wait up to 500ms (reading rewards R from QR) 5. return R witnessed.Monday, June 20, 2011
  78. 78. Rewards calculation: step 3 • Decouple the rewards calculation: 1. dequeue checkin: C from QC 2. calculate rewards(C) -> R 3. store(R) 4. queue(R) on QRMonday, June 20, 2011
  79. 79. Rewards calculation: win • You win big. • If the rewards calculation system is • too slow, or • goes offline • checkins still proceed and • responses are served within 500ms • You have decoupled the service availability requirements of the checkin system from the rewards system: happier users.Monday, June 20, 2011
  80. 80. Final random thoughts / think outside of the boxMonday, June 20, 2011
  81. 81. Things to look at: free your mind • Node.js • Javascript? Seriously? • Yes. • Forces you to think asynchronously • Forces you to share nothing • Forces you to build stateless systems • These systems scaleMonday, June 20, 2011
  82. 82. unsafe: when to use • “silence is success” messaging is almost always useful when new, more temporally relevant data is bound to arrive. • game location data • performance data • status data • the casual observerMonday, June 20, 2011
  83. 83. be mindful • Always monitor: • message rates • queue depths • queue counts • connection concurrencyMonday, June 20, 2011
  84. 84. Thank you. • Thank you • Merci beaucoup.Monday, June 20, 2011

×