A Call for Sanity in NoSQL

724 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1nyhaC6.

Nathan Marz discusses building NoSQL-based data systems that are scalable and easy to reason about. Filmed at qconlondon.com.

Nathan Marz is the creator of many open source projects which are relied upon by over 50 companies around the world, including Cascalog and Storm. Nathan is also working on a book for Manning publications entitled "Big Data: principles and best practices of scalable realtime data systems". Nathan was previously the lead engineer at BackType before being acquired by Twitter in 2011.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
724
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

A Call for Sanity in NoSQL

  1. 1. A call for sanity in NoSQL Nathan Marz @nathanmarz 1
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /nosql-cons http://www.infoq.com/presentati ons/nasa-big-data http://www.infoq.com/presentati ons/nasa-big-data
  3. 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. “Doofus programmer”
  5. 5. You broke the build!
  6. 6. Where are the tests?
  7. 7. The site is down!
  8. 8. DELETE FROM users...
  9. 9. Oops, I forgot the WHERE clause!
  10. 10. =
  11. 11. =
  12. 12. No problem...
  13. 13. Mistakes are guaranteed
  14. 14. Create Read Update Delete
  15. 15. Mutable database Guaranteed corruption =
  16. 16. Insane
  17. 17. Schemaless databases
  18. 18. Insane
  19. 19. Avoiding complexity
  20. 20. counter++
  21. 21. Insane
  22. 22. Denormalization
  23. 23. ID Name Location ID 1 Sally 3 2 George 1 3 Bob 3 Location ID City State Population 1 NewYork NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M Normalized schema
  24. 24. Join is too expensive, so denormalize...
  25. 25. ID Name Location ID City State 1 Sally 3 Chicago IL 2 George 1 NewYork NY 3 Bob 3 Chicago IL Location ID City State Population 1 NewYork NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M Denormalized schema
  26. 26. Insane
  27. 27. What is the source of the insanity?
  28. 28. Code is deterministic. I understand logic. Therefore, I can write correct code. Programming fallacy
  29. 29. Your code is wrong
  30. 30. Your code Dependency 1 Dependency 2 Dependency 3
  31. 31. Dependency 1 Dependency 4 Dependency 5
  32. 32. Dependency 4 Dependency 6 Dependency 9 Dependency 7 Dependency 8
  33. 33. Dependency 3,000,000 Hardware
  34. 34. Electronics
  35. 35. Chemistry
  36. 36. Atomic physics
  37. 37. Quantum mechanics
  38. 38. I think I can safely say that nobody understands quantum mechanics. Richard Feynman
  39. 39. Your code is wrong
  40. 40. Your code ...
  41. 41. Infinite regress
  42. 42. All the software you’ve used has had bugs in it
  43. 43. Including the software you’ve written
  44. 44. - Mutability - Schemaless databases - Eventual consistency / read-repair - Denormalization Insanity
  45. 45. Person Location Sally NewYork Bob Chicago Mutability
  46. 46. Person Location Sally London Bob Chicago Mutability
  47. 47. Person Location Time Sally NewYork 1318358351 Bob Chicago 1327928370 Immutability
  48. 48. Person Location Time Sally NewYork 1318358351 Bob Chicago 1327928370 Sally London 1338273801 Immutability
  49. 49. All data Precomputed batch view Query Precomputed realtime view New data stream Lambda Architecture
  50. 50. Relational database
  51. 51. NewSQL
  52. 52. All data problems?
  53. 53. What does a data system do?
  54. 54. Retrieve data that you previously stored? GetPut
  55. 55. Counterexamples Store location information on people Where does Sally live? What are the most populous locations? How many people live in a particular location?
  56. 56. Counterexamples Store pageview information How many unique visitors over time? How many pageviews on September 2nd?
  57. 57. Counterexamples Store transaction history for bank account How much money do people spend on housing? How much money does George have?
  58. 58. What does a data system do? Query = Function(All data)
  59. 59. Example query Total number of pageviews to a URL over a range of time
  60. 60. Example query Implementation
  61. 61. On-the-fly computation All data Query
  62. 62. Precomputation All data Precomputed view Query
  63. 63. Precomputed view Example query All data Pageview Pageview Pageview Pageview Pageview Query 2930
  64. 64. Precomputation All data Precomputed view Query (Immutable)
  65. 65. Precomputation All data Precomputed view Query Function Function
  66. 66. Computing views All data Precomputed view Function
  67. 67. Function that takes in all data as input
  68. 68. Batch processing
  69. 69. MapReduce
  70. 70. MapReduce is a framework for computing arbitrary functions on arbitrary data
  71. 71. Alldata Batch view #1 Batch view #2 MapReduce workflow MapReduce workflow MapReduce precomputation
  72. 72. All data Precomputed view Query
  73. 73. All data Precomputed view Query Normalized Denormalized
  74. 74. Human fault-tolerant
  75. 75. Not quite... • A batch workflow is too slow • Views are out of date Absorbed into batch views Not absorbed Now Time Just a few hours of data!
  76. 76. Precomputation All data Precomputed batch view Query Precomputed realtime view New data stream Compute parallel realtime views
  77. 77. NoSQL databases New data stream Realtime view #1 Realtime view #2 Stream processor
  78. 78. Precomputation All data Precomputed batch view Query Precomputed realtime view New data stream Most complex part of system
  79. 79. Precomputation All data Precomputed batch view Query Precomputed realtime view New data stream But only represents few hours of data
  80. 80. Precomputation All data Precomputed batch view Query Precomputed realtime view New data stream So can be kept small
  81. 81. Precomputation All data Precomputed batch view Query Precomputed realtime view New data stream If anything goes wrong, auto-corrects
  82. 82. Precomputation All data Precomputed batch view Query Precomputed realtime view New data stream “Complexity isolation”
  83. 83. Precomputation All data Precomputed batch view Query Precomputed realtime view New data stream No random writes needed
  84. 84. All data Precomputed batch view Query Precomputed realtime view New data stream “Lambda Architecture”
  85. 85. Applying the Lambda Architecture
  86. 86. Unique visitors over a range of hours A B C A B C D E C D E 1 2 3 4 Hours Pageviews
  87. 87. A B C A B C D E C D E 1 2 3 4 Hours Pageviews A = B B = D Equivs Unique visitors over a range of hours
  88. 88. What should view look like?
  89. 89. Attempt #1 1-1 3 Count 1-2 6 1-3 6 2-2 4 2-3 5 3-3 1 Hour rangeURL foo.com/blog 1-1 3 1-2 6 1-3 6 2-2 4 2-3 5 3-3 1 bar.com/foo
  90. 90. Attempt #2 1 3 Count 2 6 3 6 4 8 5 5 6 1 HourURL foo.com/blog 1 1 2 11 3 23 4 8 5 9 6 0 bar.com/foo
  91. 91. Attempt #3 1 A, B, C Visitors 2 A 3 D, E 4 A 5 B, C, D 6 F, G HourURL foo.com/blog 1 A 2 B, C 3 A, C 4 D 5 A, E 6 A bar.com/foo
  92. 92. Attempt #4 1 <HLL> HyperLogLog set 2 <HLL> 3 <HLL> 4 <HLL> 5 <HLL> 6 <HLL> HourURL foo.com/blog 1 <HLL> 2 <HLL> 3 <HLL> 4 <HLL> 5 <HLL> 6 <HLL> bar.com/foo
  93. 93. Final attempt 1 <HLL> HyperLogLog set 2 <HLL> 3 <HLL> 4 <HLL> 5 <HLL> 6 <HLL> HourURL foo.com/blog 1 <HLL> 2 <HLL> 3 <HLL> 4 <HLL> 5 <HLL> 6 <HLL> bar.com/foo 1 <HLL> HyperLogLog set 2 <HLL> 3 <HLL> 4 <HLL> 5 <HLL> 6 <HLL> MonthURL foo.com/blog 1 <HLL> 2 <HLL> 3 <HLL> 4 <HLL> 5 <HLL> 6 <HLL> bar.com/foo 1 <HLL> HyperLogLog set 2 <HLL> 3 <HLL> 4 <HLL> 5 <HLL> 6 <HLL> WeekURL foo.com/blog 1 <HLL> 2 <HLL> 3 <HLL> 4 <HLL> 5 <HLL> 6 <HLL> bar.com/foo
  94. 94. Batch workflow All data Normalize pageview URLs Equiv connected- component labeling Normalize pageview userids Aggregate HyperLogLog sets per bucket Index into batch view
  95. 95. Equiv graph 1 3 4 5 11 2 7 6 9
  96. 96. Equiv graph 1 3 4 5 11 2 7 6 9 Person A Person B
  97. 97. Equiv graph 1 -> A 2 -> A 3 -> A 4 -> A 5 -> A 11 -> A 6 -> B 7 -> B 9 -> B
  98. 98. Batch workflow All data Normalize pageview URLs Equiv connected- component labeling Normalize pageview userids Aggregate HyperLogLog sets per bucket Index into batch view
  99. 99. Realtime workflow The equiv problem
  100. 100. The equiv problem 1 <HLL> HyperLogLog set 2 <HLL> 3 <HLL> 4 <HLL> 5 <HLL> 6 <HLL> HourURL foo.com/blog 1 <HLL> 2 <HLL> 3 <HLL> 4 <HLL> 5 <HLL> 6 <HLL> bar.com/foo
  101. 101. The equiv problem 1 A, B, C Visitors 2 A 3 D, E 4 A 5 B, C, D 6 F, G HourURL foo.com/blog 1 A 2 B, C 3 A, C 4 D 5 A, E 6 A bar.com/foo
  102. 102. Solving the equiv problem All data Equiv connected- component labeling Index into batch view of userid -> normalized userid Batch
  103. 103. Realtime workflow Batch All data Equiv connected- component labeling Index into batch view of userid -> normalized userid Pageviews stream Normalize userid Aggregate realtime HyperLogLog buckets Realtime view
  104. 104. Schema
  105. 105. Schema
  106. 106. Schema
  107. 107. Schema
  108. 108. Schema
  109. 109. Schema
  110. 110. Conclusions from example - Avoids every single insane complexity I talked about - Powerful to be able to extract more out of a piece of data the longer you have it - Eventual accuracy is a super useful technique - Schemas can be useful and non-painful - A Lambda Architecture is fundamentally easy to extend with new views/queries
  111. 111. Thank you
  112. 112. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/nosql- cons

×