Where to put_my_data

2,232 views
2,151 views

Published on

Presented at QCon San Francisco 2010.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,232
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Where to put_my_data

  1. 1. WHERETO PUT DATA – or – What are we going to do with all this stuff?
  2. 2. About The Speaker Application Developer/Architect – 21 years Web Developer – 15 years Web Operations – 7 years
  3. 3. Relational Shard Replicate Graph Document Key-value Cache Distributed ACID BASE Cluster Schema Schemaless Neo4J Redis Riak Cassandra Voldemort NetStorage S3 CloudFront MongoDB CouchDB Memcached Coherence GridGain BigMemory HDFS HBase eXist Xindice BDB BDB
  4. 4. Scalability Consistency Relational Not Only SQL Flexible Rigid
  5. 5. App Web Media Web API Gateway CMSApp layer Cache farm Search farm Indexer Search Admin Jobbers Caching Proxy/LB Caching Proxy/LB CDN Egress ISP
  6. 6. Freshness Response Time Consistency Resilience Total Cost Sustainability Agility
  7. 7. BACK INTHE 90’S
  8. 8. BACK INTHE 90’S
  9. 9. BACK INTHE 90’S
  10. 10. OS 2200 Hierarchical ("Network") Database
  11. 11. OS 2200 Hierarchical ("Network") Database Relational Mapper
  12. 12. OS 2200 Hierarchical ("Network") Database Relational Mapper POSIX.1 Virtual Machine
  13. 13. OS 2200 Hierarchical ("Network") Database Relational Mapper POSIX.1 Virtual Machine COBOL Compiler ANSI SQL Library
  14. 14. Given enough time, and perversity, you can create any query model on top of any storage model.
  15. 15. SAY WHEN The Importance of ResponseTime Distribution
  16. 16. FUNDAMENTAL PREMISE Black Box
  17. 17. THERE ARETHINGSYOU CANNOT KNOW Will a response arrive? When? Was it stored or computed? Is it still true?
  18. 18. ASYMMETRY OFTIME Send Request
  19. 19. Send Request 100 ms 200 ms ASYMMETRY OFTIME
  20. 20. Send Request 100 ms 200 ms 300 ms 400 ms 500 ms 600 ms ASYMMETRY OFTIME
  21. 21. Send Request 100 ms 200 ms 300 ms 400 ms 500 ms 600 ms Time Out ASYMMETRY OFTIME
  22. 22. 9000 100 200 300 400 500 600 700 800 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time Elapsed (ms) ConfidenceofResponsebeforeTimeout
  23. 23. Send Request 100 ms 200 ms 300 ms 400 ms 500 ms 600 ms Time Out Response Arrives ASYMMETRY OFTIME
  24. 24. To the observer, there is no difference between “too slow” and “not there”.
  25. 25. 0% 6.25% 12.50% 18.75% 25.00% 0 100 200 300 400 500 600 700 800 900 1000 Response Time Histogram%ofResponses Response Time (ms)
  26. 26. 0% 6.25% 12.50% 18.75% 25.00% 0 100 200 300 400 500 600 700 800 900 1000 Response Time Histogram%ofResponses Response Time (ms)
  27. 27. 0% 6.25% 12.50% 18.75% 25.00% 0 100 200 300 400 500 600 700 800 900 1000 Response Time Histogram%ofResponses Response Time (ms)
  28. 28. 0% 25.00% 50.00% 75.00% 100.00% 0 100 200 300 400 500 600 700 800 900 ∞ Response Time Histogram%ofResponses Response Time (ms)
  29. 29. This is a talk about data and data storage. scalability? What about Why am I talking so much about observers and response time?
  30. 30. Why do we worry about scalability?
  31. 31. 0% 25.00% 50.00% 75.00% 100.00% 0 100 200 300 400 500 600 700 800 900 1000 Response Time Histogram%ofResponses Response Time (ms) Empty Box Time Response time of a single request on an unloaded system.
  32. 32. Scalability is a means, not an end. What we need is fast response time, under all loads.
  33. 33. App Web Media Web API Gateway CMSApp layer Cache farm Search farm Indexer Search Admin Jobbers Caching Proxy/LB Caching Proxy/LB CDN Egress
  34. 34. HOWTRUE?
  35. 35. App Web Media Web API Gateway CMSApp layer Cache farm Search farm Indexer Search Admin Jobbers Caching Proxy/LB Caching Proxy/LB CDN Egress
  36. 36. App Web Media Web API Gateway CMS App layer Cache farm Search farm Indexer Search Admin Jobbers Caching Proxy/LB Caching Proxy/LB CDN Egress ISP
  37. 37. You cannot be sure the data is unchanged since your observation, except by making another observation. P(unch) = F(dC/dt, dt) UNCERTAINTY PRINCIPLE
  38. 38. THE ROLE OF SURPRISE Unlikely answers are often more interesting.
  39. 39. SAYS WHO? Thoughts on Consistency
  40. 40. OBSERVABILITY
  41. 41. OBSERVABILITY Steve Brian
  42. 42. OBSERVABILITY Left Right
  43. 43. OBSERVABILITY Steve Brian # Steve Brian 1 R R 2 L R 3 R L
  44. 44. STATE SPACE X1 = {L, R} X2 = {L, R}
  45. 45. SUPER-OBSERVER Has a view which dominates the views of all other observers.
  46. 46. SUPER-OBSERVER There are no one-to-many mappings from the super- observer’s states to any other observer’s states.
  47. 47. SUPER-OBSERVER A super-observer is maximally present if it can discriminate among the Cartesian product of all other observations. Observer Set of States Steve {L, R} Brian {L, R} Super-Observer {L → B, R → F} × {L, R}
  48. 48. OBSERVABILITY # Steve Brian Super-Observer 1 R R {F, R} 2 L R {B, R} 3 R L {F, L}
  49. 49. PORKY PIG’S WINDOW SHADE If Porky Pig is looking at the window shade, he always observes it to be down. If he is looking away from the window shade, it rolls up.
  50. 50. FIRST DIMENSION X1 = {looking, not looking}
  51. 51. SECOND DIMENSION X1 = {looking, not looking} X2 = {shade open, shade closed}
  52. 52. FORBIDDEN STATES X1 = {looking, not looking} X2 = {shade open, shade closed}
  53. 53. STATE SPACE Cartesian product of all possible sets of states. Example 1,000,000 bytes of RAM 8 bits per byte 2 states per bit 8,000,000 dimensions with 2 values each or 1,000,000 dimensions with 256 values each
  54. 54. STATE SPACE 10,000,000 rows in a table 20 columns Whole database is a single point in a 200,000,000 dimensional space. Changes to data are transforms of that point. State over time is the trajectory of that point.
  55. 55. CONSISTENCY Not every point in state space is allowed.
  56. 56. BLACK BOX HYPOTHESIS
  57. 57. BLACK BOX HYPOTHESIS External observers can only ever ask for projections of the state space, at defined points in time.
  58. 58. BLACK BOX HYPOTHESIS State space trajectories may cross into forbidden states, as long as those are not revealed to observers.
  59. 59. PROJECTION X1 = {looking, not looking} Is Porky looking at the window shade? P1 P2
  60. 60. BLACK BOX HYPOTHESIS Even two clustered machines have their own state spaces. It’s impossible for either to be a superobserver.
  61. 61. OBSERVED CONSISTENCY Sufficient to ensure that forbidden states cannot be observed.
  62. 62. DOES A SUPEROBSERVER EXIST? Only if there is exactly one single-threaded CPU, in exactly one computer.
  63. 63. CONSEQUENCES Consistency doesn’t exist in most systems today. Sometimes we can fake it. Many times, it doesn’t really matter.
  64. 64. WHAT ABOUT CAP? Consistency: “...there must exist a total order on all operations such that each operation looks as if it were completed at a single instant.” Seth Gilbert and Nancy Lynch. 2002. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (June 2002), 51-59. DOI=10.1145/564585.564601 http://doi.acm.org/10.1145/564585.564601
  65. 65. WHAT ABOUT CAP? Linearizability Seth Gilbert and Nancy Lynch. 2002. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (June 2002), 51-59. DOI=10.1145/564585.564601 http://doi.acm.org/10.1145/564585.564601
  66. 66. The data base consists of entities which are related in certain ways. These relationships are best thought of as assertions about the data.
  67. 67. Examples of such assertions are: “Names is an index forTelephone_numbers.” “The value of Count_of_X gives the number of employees in department X.”
  68. 68. The data base is said to be consistent if it satisfies all its assertions. In some cases, the data base must become temporarily inconsistent in order to transform it to a new consistent state. From "Granularity of Locks and Degrees of Consistency in a Shared Data Base", J.N. Gray, R.A. Lorie, G.R. Putzolu, I.L.Traiger, 1976 From "Granularity of Locks and Degrees of Consistency in a Shared Data Base", J.N. Gray, R.A. Lorie, G.R. Putzolu, I.L.Traiger, 1976
  69. 69. Consistency is a predicate C on entities and their values.The predicate is generally not known to the system but is embodied in the structure of the transactions. From "Transactions and Consistency in Distributed Database Systems", I.L.Traiger, J.N. Gray, C.A. Galtieri, and B.G. Lindsay, 1982
  70. 70. “C” VERSUS “A”? Response Time Consistency Resilience See also: http://goo.gl/1Yv3 → http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html
  71. 71. WHAT’STHAT? Data Models and Composability
  72. 72. DATA MODEL DEFINED The system’s representation of the consistency predicate C.
  73. 73. LATENT MODEL Implicit in the structure of application code.
  74. 74. EXPLICIT Visible to storage engine or applications, expressed in machine-readable form.
  75. 75. HOMOICONIC Explicit, and available for expressions, computations, and validation together with statements about the data itself.
  76. 76. Non-uniform. CHALLENGES
  77. 77. Non-uniform. Layered. CHALLENGES
  78. 78. Non-uniform. Layered. Confined. CHALLENGES
  79. 79. Non-uniform. Layered. Confined. Non-composable. CHALLENGES
  80. 80. ENFORCING C Must account for overlapping wavefronts of information. There is no master clock. Simultaneity is positional. Make C explicit in the application, don’t rely on storage engine.
  81. 81. HOW LONG? On Lifecycles and Lifespans
  82. 82. DOWN WITHTHE IRON FIST Throw out the DBAs Throw out the schemas Unstructured Semi-structured
  83. 83. DOWN WITHTHE IRON FIST Put the application in charge.
  84. 84. DOWN WITHTHE IRON FIST but...
  85. 85. DIFFICULTIES Application versions Validating correct behavior Capturing knowledge about that behavior
  86. 86. UGLYTRUTH Data routinely outlives applications.
  87. 87. Response Time Total Cost Sustainability Agility
  88. 88. WHERE NOW?
  89. 89. WHERETO PUTYOUR DATA? Data exists everywhere.
  90. 90. WHERETO PUTYOUR DATA? Nothing lasts forever.
  91. 91. WHERETO PUTYOUR DATA? Understand freshness.
  92. 92. WHERETO PUTYOUR DATA? Engineer a good response time distribution.
  93. 93. WHERETO PUTYOUR DATA? Select the consistency model you need.
  94. 94. WHERETO PUTYOUR DATA? Be agile and adaptable.
  95. 95. WHERETO PUTYOUR DATA? Make it sustainable.
  96. 96. Michael T. Nygard michael.nygard@n6consulting.com @mtnygard © 2010 N6 Consulting, LLC.All Rights Reserved.

×