Your SlideShare is downloading. ×
0
Runaway complexity in Big Data       And a plan to stop it                           Nathan Marz                          ...
Agenda• Common    sources of complexity in data systems• Design for a fundamentally better data system
What is a data system?   A system that manages the storage and   querying of data
What is a data system?   A system that manages the storage and   querying of data with a lifetime measured in   years
What is a data system?   A system that manages the storage and   querying of data with a lifetime measured in   years enco...
What is a data system?   A system that manages the storage and   querying of data with a lifetime measured in   years enco...
What is a data system?   A system that manages the storage and   querying of data with a lifetime measured in   years enco...
Common sources of complexity         Lack of human fault-tolerance         Conflation of data and queries             Schem...
Lack of human fault-tolerance
Human fault-tolerance• Bugs will be deployed to production over the lifetime of a data system• Operational mistakes will b...
Human fault-toleranceExamples of human error• Deploy a bug that increments counters by two instead of by one• Accidentally...
The worst consequence isdata loss or data corruption
As long as an error doesn’t lose or corrupt good data,you can fix what went wrong
Mutability• The U and D in CRUD• A mutable system updates the current state of the world• Mutable systems inherently lack ...
Immutability•   An immutable system captures a historical record of events•   Each event happens at a particular time and ...
Capturing change with mutable datamodel  Person    Location                    Person   Location   Sally   Philadelphia   ...
Capturing change with immutabledata modelPerson    Location      Time               Person    Location      Time Sally   P...
Immutability greatly restricts the range of errors that can cause  data loss or data corruption
Vastly more human fault-tolerant
ImmutabilityOther benefits•   Fundamentally simpler•   CR instead of CRUD•   Only write operation is appending new units of...
Immutability       Please watch Rich Hickey’s talks to learn more       about the enormous benefits of immutability
Conflation of data and queries
Conflation of data and queriesNormalization vs. denormalization   ID      Name     Location ID   Location ID     City      ...
Join is too expensive, so      denormalize...
ID           Name        Location ID        City         State1             Sally             3       Chicago             ...
Obviously, you prefer all data to be         fully normalized
But you are forced to denormalize         for performance
Because the way data is modeled,stored, and queried is complected
We will come back to how to build data systems in which these are          disassociated
Schemas done wrong
Schemas have a bad rap
Schemas•   Hard to change•   Get in the way•   Add development overhead•   Requires annoying configuration
I know! Use aschemaless database!
This is an overreaction
Confuses the poor implementation of schemas with the value that        schemas provide
What is a schema exactly?
function(data unit)
That says whether this data is valid             or not
This is useful
Value of schemas•   Structural integrity•   Guarantees on what can and can’t be stored•   Prevents corruption
Otherwise you’ll detect corruption      issues at read-time
Potentially long after the corruption              happened
With little insight into thecircumstances of the corruption
Much better to get an exception where the mistake is made,before it corrupts the database
Saves enormous amounts of time
Why are schemas consideredpainful?• Changing  the schema is hard (e.g., adding a column to a table)• Schema is overly rest...
None of these are fundamentally linked with function(data unit)
These are problems in theimplementation of schemas, not in      schemas themselves
Ideal schema tool• Data  is represented as maps• Schema tool is a library that helps construct the schema function:  • Con...
Let’s get provocative
The relational database will  be a footnote in history
Not because of SQL, restrictiveschemas, or scalability issues
But because of fundamentalflaws in the RDBMS approach      to managing data
Mutability
Conflating the storage of data   with how it is queriedBack in the day, these flaws were feature – because space was apremi...
“NewSQL” is misguided
Let’s use our ability to cheaplystore massive amounts of data
To do data right
And not inherit the complexities          of the past
I know! Use aNoSQL database!                  if SQL’s wrong, and                  NoSQL isn’t SQL, then                  ...
NoSQL databases are generallynot a step in the right direction
Some aspects are, but not theones that get all the attention
Still based on mutability and      not general purpose
Let’s see how you designa data system thatdoesn’t suffer fromthese complexities                           Let’s start from...
What does a data system do?
Retrieve data that you previously stored?    Put        Get
Not really...
Counterexamples          Store location information on people      How many people live in a particular location?         ...
Counterexamples             Store pageview information       How many pageviews on September 2nd?         How many unique ...
Counterexamples        Store transaction history for bank account         How much money does George have?     How much mo...
What does a data system do?    Query = Function(All data)
Sometimes you retrieve what       you stored
Oftentimes you do transformations,        aggregations, etc.
Queries as pure functions that takeall data as input is the most general              formulation
Example query Total number of pageviews to a URL over a range of time
Example query          Implementation
Too slow: “all data” is petabyte-scale
On-the-fly computation          All       Query         data
Precomputation     All     Precomputed                           Query    data         view
Example query Pageview Pageview Pageview                                  2930                                   Query Pag...
Precomputation     All     Precomputed                           Query    data         view
Precomputation     All              Precomputed                                               Query    data   Function    ...
Data system     All              Precomputed                                               Query    data   Function       ...
Data system     All              Precomputed                                               Query    data   Function       ...
Data system     All              Precomputed                                               Query    data   Function       ...
Computing views          All              Precomputed         data   Function                               view
Function that takes in  all data as input
Batch processing
MapReduce
MapReduce is a framework forcomputing arbitrary functions on        arbitrary data
Expressing those functions                       Cascalog                 Scalding
MapReduce precomputation                                 Batch view #1                    e wo rkflow           MapReducAll...
Batch view databaseNeed a database that...• Is batch-writable from MapReduce• Has fast random reads• Examples: ElephantDB,...
Batch view database  No random writes required!
Properties              All               Batch             data    Function                                view          ...
Properties              All               Batch             data    Function                                view          ...
Properties              All              Batch             data   Function                               view             ...
Properties               All              Batch              data   Function                                view  Can be h...
Properties              All                Batch             data     Function                                 view       ...
PropertiesNot exactly denormalization, becauseyou’re doing more than just retrieving datathat you stored (can do aggregati...
So we’re done, right?
Not quite...                                           Just a few hours• A batch workflow is too slow              of data!...
Properties              All              Batch             data   Function                               view             ...
Properties              All               Batch             data    Function                                view      (wit...
Properties               All                Batch              data     Function                                  view   (...
What’s left?   Precompute views for last few   hours of data
Realtime views
Realtime view #1New data stream   Stream processor                                        Realtime view #2                ...
Application queries         Batch view                        Merge        Realtime view
Precomputation     All     Precomputed                           Query    data         view
Precomputation               All             Precomputed                                batch view              data      ...
Precomputation               All   Precomputed                      batch view              data                          ...
Precomputation               All                                Precomputed                                               ...
Precomputation               All     Precomputed                        batch view              data                      ...
Precomputation                     All                Precomputed                                         batch view      ...
CAPRealtime layer decides whether to guarantee C or A• If it chooses consistency, queries are consistent• If it chooses av...
Eventual accuracy  Sometimes hard to compute  exact answer in realtime
Eventual accuracy     Example: unique count
Eventual accuracy Can compute exact answer in batch layer and approximate answer in realtime layer                     Tho...
Eventual accuracy   Best of both worlds of   performance and accuracy
Tools                                    ElephantDB, Voldemort               All     Hadoop Precomputed                   ...
Lambda Architecture• Candiscard batch views and realtime views and recreate everything from scratch• Mistakes   corrected ...
Future• Abstractionover batch and realtime• More data structure implementations for batch and realtime views
Learn more        http://manning.com/marz
Questions?
Upcoming SlideShare
Loading in...5
×

Runaway complexity in Big Data... and a plan to stop it

52,271

Published on

Published in: Technology
5 Comments
187 Likes
Statistics
Notes
  • Nice one. In my opinion the architecture will keep on evolving . We got to be really careful about what real time querying and processing would mean . Is it micro,milli second or so. If the answer is in for seconds than batch processing would do else lambda would be an overkill.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi, I would not use the concepts of a mutable (silde 16) and immutable data model (silde 17), because in both cases the state of the system has changed as soon as Sally has moved to New York.

    What is even worse, if you think of a system that interacts with the outside world, as any system does up to a certain degree, e.g. a self driving car from Google. Any error state in the system may create events that change the future of the world for ever. e.g. an car accident that kills people.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • @LadislavUrban Excellent paper! Really impressive applications.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I like your Big Data presentation.
    I would like to share with you document about application of Big Data and Data Science in retail banking. http://www.slideshare.net/LadislavUrban/syoncloud-big-data-for-retail-banking-syoncloud
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • HI dear i am Priscilla by name, i read your profile and like to be in contact with you, that is why i drop this note for you, please i we like you to contact me in my privet mail box to enable me send my picture to you,and also tell you more about me thanks i we be waiting for your reply, (bernard_priscilla@yahoo.com)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
52,271
On Slideshare
0
From Embeds
0
Number of Embeds
36
Actions
Shares
0
Downloads
1,385
Comments
5
Likes
187
Embeds 0
No embeds

No notes for slide

Transcript of "Runaway complexity in Big Data... and a plan to stop it"

  1. 1. Runaway complexity in Big Data And a plan to stop it Nathan Marz @nathanmarz 1
  2. 2. Agenda• Common sources of complexity in data systems• Design for a fundamentally better data system
  3. 3. What is a data system? A system that manages the storage and querying of data
  4. 4. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years
  5. 5. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist
  6. 6. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure
  7. 7. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure, and every human mistake ever made
  8. 8. Common sources of complexity Lack of human fault-tolerance Conflation of data and queries Schemas done wrong
  9. 9. Lack of human fault-tolerance
  10. 10. Human fault-tolerance• Bugs will be deployed to production over the lifetime of a data system• Operational mistakes will be made• Humans are part of the overall system, just like your hard disks, CPUs, memory, and software• Must design for human error like you’d design for any other fault
  11. 11. Human fault-toleranceExamples of human error• Deploy a bug that increments counters by two instead of by one• Accidentally delete data from database• Accidental DOS on important internal service
  12. 12. The worst consequence isdata loss or data corruption
  13. 13. As long as an error doesn’t lose or corrupt good data,you can fix what went wrong
  14. 14. Mutability• The U and D in CRUD• A mutable system updates the current state of the world• Mutable systems inherently lack human fault-tolerance• Easy to corrupt or lose data
  15. 15. Immutability• An immutable system captures a historical record of events• Each event happens at a particular time and is always true
  16. 16. Capturing change with mutable datamodel Person Location Person Location Sally Philadelphia Sally New York Bob Chicago Bob Chicago Sally moves to New York
  17. 17. Capturing change with immutabledata modelPerson Location Time Person Location Time Sally Philadelphia 1318358351 Sally Philadelphia 1318358351 Bob Chicago 1327928370 Bob Chicago 1327928370 Sally New York 1338469380 Sally moves to New York
  18. 18. Immutability greatly restricts the range of errors that can cause data loss or data corruption
  19. 19. Vastly more human fault-tolerant
  20. 20. ImmutabilityOther benefits• Fundamentally simpler• CR instead of CRUD• Only write operation is appending new units of data• Easy to implement on top of a distributed filesystem • File = list of data records • Append = Add a new file into a directory basing a system on mutability is like pouring gasoline on your house (but don’t worry, i checked all the wires carefully to make sure there won’t be any sparks). when someone makes a mistake who knows what will burn
  21. 21. Immutability Please watch Rich Hickey’s talks to learn more about the enormous benefits of immutability
  22. 22. Conflation of data and queries
  23. 23. Conflation of data and queriesNormalization vs. denormalization ID Name Location ID Location ID City State Population 1 Sally 3 1 New York NY 8.2M 2 George 1 2 San Diego CA 1.3M 3 Bob 3 3 Chicago IL 2.7M Normalized schema
  24. 24. Join is too expensive, so denormalize...
  25. 25. ID Name Location ID City State1 Sally 3 Chicago IL2 George 1 New York NY3 Bob 3 Chicago IL Location ID City State Population 1 New York NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M Denormalized schema
  26. 26. Obviously, you prefer all data to be fully normalized
  27. 27. But you are forced to denormalize for performance
  28. 28. Because the way data is modeled,stored, and queried is complected
  29. 29. We will come back to how to build data systems in which these are disassociated
  30. 30. Schemas done wrong
  31. 31. Schemas have a bad rap
  32. 32. Schemas• Hard to change• Get in the way• Add development overhead• Requires annoying configuration
  33. 33. I know! Use aschemaless database!
  34. 34. This is an overreaction
  35. 35. Confuses the poor implementation of schemas with the value that schemas provide
  36. 36. What is a schema exactly?
  37. 37. function(data unit)
  38. 38. That says whether this data is valid or not
  39. 39. This is useful
  40. 40. Value of schemas• Structural integrity• Guarantees on what can and can’t be stored• Prevents corruption
  41. 41. Otherwise you’ll detect corruption issues at read-time
  42. 42. Potentially long after the corruption happened
  43. 43. With little insight into thecircumstances of the corruption
  44. 44. Much better to get an exception where the mistake is made,before it corrupts the database
  45. 45. Saves enormous amounts of time
  46. 46. Why are schemas consideredpainful?• Changing the schema is hard (e.g., adding a column to a table)• Schema is overly restrictive (e.g., cannot do nested objects)• Require translation layers (e.g. ORM)• Requires more typing (development overhead)
  47. 47. None of these are fundamentally linked with function(data unit)
  48. 48. These are problems in theimplementation of schemas, not in schemas themselves
  49. 49. Ideal schema tool• Data is represented as maps• Schema tool is a library that helps construct the schema function: • Concisely specify required fields and types • Insert custom validation logic for fields (e.g. ages are between 0 and 200)• Built-in support for evolving the schema over time• Fast and space-efficient serialization/deserialization• Cross-language this is easy to use and gets out of your way i use apache thrift, but it lacks the custom validation logic i think it could be done better with a clojure-like data as maps approach given that parameters of a data system: long-lived, ever changing, with mistakes being made, the amount of work it takes to make a schema (not that much) is absolutely worth it
  50. 50. Let’s get provocative
  51. 51. The relational database will be a footnote in history
  52. 52. Not because of SQL, restrictiveschemas, or scalability issues
  53. 53. But because of fundamentalflaws in the RDBMS approach to managing data
  54. 54. Mutability
  55. 55. Conflating the storage of data with how it is queriedBack in the day, these flaws were feature – because space was apremium. The landscape has changed, and this is no longer theconstraint it once was. So these properties of mutability andconflating data and queries are now major, glaring flaws.Because there are better ways to design data system
  56. 56. “NewSQL” is misguided
  57. 57. Let’s use our ability to cheaplystore massive amounts of data
  58. 58. To do data right
  59. 59. And not inherit the complexities of the past
  60. 60. I know! Use aNoSQL database! if SQL’s wrong, and NoSQL isn’t SQL, then NoSQL must be right
  61. 61. NoSQL databases are generallynot a step in the right direction
  62. 62. Some aspects are, but not theones that get all the attention
  63. 63. Still based on mutability and not general purpose
  64. 64. Let’s see how you designa data system thatdoesn’t suffer fromthese complexities Let’s start from scratch
  65. 65. What does a data system do?
  66. 66. Retrieve data that you previously stored? Put Get
  67. 67. Not really...
  68. 68. Counterexamples Store location information on people How many people live in a particular location? Where does Sally live? What are the most populous locations?
  69. 69. Counterexamples Store pageview information How many pageviews on September 2nd? How many unique visitors over time?
  70. 70. Counterexamples Store transaction history for bank account How much money does George have? How much money do people spend on housing?
  71. 71. What does a data system do? Query = Function(All data)
  72. 72. Sometimes you retrieve what you stored
  73. 73. Oftentimes you do transformations, aggregations, etc.
  74. 74. Queries as pure functions that takeall data as input is the most general formulation
  75. 75. Example query Total number of pageviews to a URL over a range of time
  76. 76. Example query Implementation
  77. 77. Too slow: “all data” is petabyte-scale
  78. 78. On-the-fly computation All Query data
  79. 79. Precomputation All Precomputed Query data view
  80. 80. Example query Pageview Pageview Pageview 2930 Query Pageview Pageview All data Precomputed view
  81. 81. Precomputation All Precomputed Query data view
  82. 82. Precomputation All Precomputed Query data Function view Function
  83. 83. Data system All Precomputed Query data Function view Function Two problems to solve
  84. 84. Data system All Precomputed Query data Function view Function How to compute views
  85. 85. Data system All Precomputed Query data Function view Function How to compute queries from views
  86. 86. Computing views All Precomputed data Function view
  87. 87. Function that takes in all data as input
  88. 88. Batch processing
  89. 89. MapReduce
  90. 90. MapReduce is a framework forcomputing arbitrary functions on arbitrary data
  91. 91. Expressing those functions Cascalog Scalding
  92. 92. MapReduce precomputation Batch view #1 e wo rkflow MapReducAll data MapRed uce work flow Batch view #2
  93. 93. Batch view databaseNeed a database that...• Is batch-writable from MapReduce• Has fast random reads• Examples: ElephantDB, Voldemort
  94. 94. Batch view database No random writes required!
  95. 95. Properties All Batch data Function view ElephantDB is only a few thousand lines of code Simple
  96. 96. Properties All Batch data Function view Scalable
  97. 97. Properties All Batch data Function view Highly available
  98. 98. Properties All Batch data Function view Can be heavily optimized (b/c no random writes)
  99. 99. Properties All Batch data Function view Normalized
  100. 100. PropertiesNot exactly denormalization, becauseyou’re doing more than just retrieving datathat you stored (can do aggregations)You’re able to optimize data storageseparately from data modeling, withoutthe complexity typical of denormalizationin relational databases All Batch*This is because the batch view is a purefunction of all data* -> hard to get out ofsync, and if there’s ever a problem (like a data Function viewbug in your code that computes the wrongbatch view) you can recomputealso easy to debug problems, since you havethe input that produced the batch view ->this is not true in a mutable system basedon incremental updates “Denormalized”
  101. 101. So we’re done, right?
  102. 102. Not quite... Just a few hours• A batch workflow is too slow of data!• Views are out of date Absorbed into batch views Not absorbed Now Time
  103. 103. Properties All Batch data Function view Eventually consistent
  104. 104. Properties All Batch data Function view (without the associated complexities)
  105. 105. Properties All Batch data Function view (such as divergent values, vector clocks, etc.)
  106. 106. What’s left? Precompute views for last few hours of data
  107. 107. Realtime views
  108. 108. Realtime view #1New data stream Stream processor Realtime view #2 NoSQL databases
  109. 109. Application queries Batch view Merge Realtime view
  110. 110. Precomputation All Precomputed Query data view
  111. 111. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream “Lambda Architecture”
  112. 112. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream Most complex part of system
  113. 113. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream This is where things like Random write dbs much more complex vector clocks have to be dealt with if using eventually consistent NoSQL database
  114. 114. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream But only represents few hours of data
  115. 115. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream Can continuously discard realtime views, keeping them small If anything goes wrong, auto-corrects
  116. 116. CAPRealtime layer decides whether to guarantee C or A• If it chooses consistency, queries are consistent• If it chooses availability, queries are eventually consistent All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer. If anything goes wrong, it’s *auto-corrected* CAP is now a choice, as it should be, rather than a complexity burden. Making a mistake w.r.t. eventual consistency *won’t corrupt* your data
  117. 117. Eventual accuracy Sometimes hard to compute exact answer in realtime
  118. 118. Eventual accuracy Example: unique count
  119. 119. Eventual accuracy Can compute exact answer in batch layer and approximate answer in realtime layer Though for functions which can be computed exactly in the realtime layer (e.g. counting), you can achieve full accuracy
  120. 120. Eventual accuracy Best of both worlds of performance and accuracy
  121. 121. Tools ElephantDB, Voldemort All Hadoop Precomputed batch view data Storm Query Precomputed realtime view New data stream Storm Cassandra, Riak, HBase Kafka “Lambda Architecture”
  122. 122. Lambda Architecture• Candiscard batch views and realtime views and recreate everything from scratch• Mistakes corrected via recomputation• Data storage layer optimized independently from query resolution layer what mistakes can be made? - write bad data? - remove the data and recompute the views - bug in the functions that compute view? - recompute the view - bug in query function? just deploy the fix
  123. 123. Future• Abstractionover batch and realtime• More data structure implementations for batch and realtime views
  124. 124. Learn more http://manning.com/marz
  125. 125. Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×