Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Simplicity to Make Hard Big Data Problems Easy

5,451 views

Published on

Presented at Data Day Texas on January 10th, 2015

Published in: Technology
  • Be the first to comment

Using Simplicity to Make Hard Big Data Problems Easy

  1. 1. Using simplicity to make hard Big Data problems easy
  2. 2. MasterMaster DatasetDataset Batch viewsBatch views New DataNew Data RealtimeRealtime viewsviews QueryQuery Lambda Architecture
  3. 3. MasterMaster DatasetDataset R/WR/W databasesdatabases StreamStream processorprocessor Proposed alternative (Problematic)
  4. 4. Easy problem
  5. 5. struct PageView { UserID id, String url, Timestamp timestamp }
  6. 6. Implement: function NumUniqueVisitors( String url, int startHour, int endHour)
  7. 7. Unique visitors over a range of hours
  8. 8. Notes: • Not limiting ourselves to current tooling • Reasonable variations of existing tooling are acceptable • Interested in what’s fundamentally possible
  9. 9. Traditional Architectures ApplicationApplication DatabasesDatabases ApplicationApplication DatabasesDatabases StreamStream processorprocessor QueueQueue Synchronous Asynchronous
  10. 10. Approach #1 • Use Key->Set database • Key = [URL, hour bucket] • Value = Set of UserIDs
  11. 11. Approach #1 • Queries: • Get all sets for all hours in range of query • Union sets together • Compute count of merged set
  12. 12. Approach #1 • Lot of database lookups for large ranges • Potentially a lot of items in sets, so lots of work to merge/count • Database will use a lot of space
  13. 13. Approach #2 Use HyperLogLog
  14. 14. interface HyperLogLog { boolean add(Object o); long size(); HyperLogLog merge(HyperLogLog... otherSets); }
  15. 15. Approach #2 • Use Key->HyperLogLog database • Key = [URL, hour bucket] • Value = HyperLogLog structure
  16. 16. Approach #2 • Queries: • Get all HyperLogLog structures for all hours in range of query • Merge structures together • Retrieve count from merged structure
  17. 17. Approach #2 • Much more efficient use of storage • Less work at query time • Mild accuracy tradeoff
  18. 18. Approach #2 • Large ranges still require lots of database lookups / work
  19. 19. Approach #3 • Use Key->HyperLogLog database • Key = [URL, bucket, granularity] • Value = HyperLogLog structure
  20. 20. Approach #3 • Queries: • Compute minimal number of database lookups to satisfy range • Get all HyperLogLog structures in range • Merge structures together • Retrieve count from merged structure
  21. 21. Approach #3 • All benefits of #2 • Minimal number of lookups for any range, so less variation in latency • Minimal increase in storage • Requires more work at write time
  22. 22. Hard problem
  23. 23. struct Equiv { UserID id1, UserID id2 } struct PageView { UserID id, String url, Timestamp timestamp }
  24. 24. Person A Person B
  25. 25. Implement: function NumUniqueVisitors( String url, int startHour, int endHour)
  26. 26. [“foo.com/page1”, 0] [“foo.com/page1”, 1] [“foo.com/page1”, 2] ... [“foo.com/page1”, 1002] {A, B, C} {B} {A, C, D, E} ... {A, B, C, F, Z}
  27. 27. [“foo.com/page1”, 0] [“foo.com/page1”, 1] [“foo.com/page1”, 2] ... [“foo.com/page1”, 1002] {A, B, C} {B} {A, C, D, E} ... {A, B, C, F, Z} A <-> C
  28. 28. Any single equiv could change any bucket
  29. 29. No way to take advantage of HyperLogLog
  30. 30. Approach #1 • [URL, hour] -> Set of PersonIDs • UserID -> Set of buckets • Indexes to incrementally normalize UserIDs into PersonIDs
  31. 31. Approach #1 • Getting complicated • Large indexes • Operations require a lot of work
  32. 32. Approach #2 • [URL, bucket] -> Set of UserIDs • Like Approach 1, incrementally normalize UserId’s • UserID -> PersonID
  33. 33. Approach #2 • Query: • Retrieve all UserID sets for range • Merge sets together • Convert UserIDs -> PersonIDs to produce new set • Get count of new set
  34. 34. Incremental UserID normalization
  35. 35. Attempt 1: • Maintain index from UserID -> PersonID • When receive A <-> B: • Find what they’re each normalized to, and transitively normalize all reachable IDs to “smallest” val
  36. 36. 1 <-> 4 1 -> 1 4 -> 1 2 <-> 5 5 -> 2 2 -> 2 5 <-> 3 3 -> 2 4 <-> 5 5 -> 1 2 -> 1 3 -> 1 never gets produced!
  37. 37. Attempt 2: • UserID -> PersonID • PersonID -> Set of UserIDs • When receive A <-> B • Find what they’re each normalized to, and choose one for both to be normalized to • Update all UserID’s in both normalized sets
  38. 38. 1 <-> 4 1 -> 1 4 -> 1 1 -> {1, 4} 2 <-> 5 5 -> 2 2 -> 2 2 -> {2, 5} 5 <-> 3 3 -> 2 2 -> {2, 3, 5} 4 <-> 5 5 -> 1 2 -> 1 3 -> 1 1 -> {1, 2, 3, 4, 5}
  39. 39. Challenges • Fault-tolerance / ensuring consistency between indexes • Concurrency challenges
  40. 40. General challenges with traditional architectures • Redundant storage of information (“denormalization”) • Brittle to human error • Operational challenges of enormous installations of very complex databases
  41. 41. MasterMaster DatasetDataset Indexes forIndexes for uniques overuniques over timetime StreamStream processorprocessor No fully incremental approach will work!
  42. 42. Let’s take a completely different approach!
  43. 43. Some Rough Definitions Complicated: lots of parts
  44. 44. Some Rough Definitions Complex: intertwinement between separate functions
  45. 45. Some Rough Definitions Simple: the opposite of complex
  46. 46. Real World Example
  47. 47. ID Name Location ID 1 Sally 3 2 George 1 3 Bob 3 Location ID City State Population 1 New York NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M Normalized schema Normalization vs Denormalization
  48. 48. Join is too expensive, so denormalize...
  49. 49. ID Name Location ID City State 1 Sally 3 Chicago IL 2 George 1 New York NY 3 Bob 3 Chicago IL Location ID City State Population 1 New York NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M Denormalized schema
  50. 50. Complexity between robust data model and query performance
  51. 51. Allow queries to be out of date by hours
  52. 52. Store every Equiv and PageView MasterMaster DatasetDataset
  53. 53. MasterMaster DatasetDataset Continuously recompute indexes Indexes forIndexes for uniques overuniques over timetime
  54. 54. Indexes = function(all data)
  55. 55. Iterative graph algorithm
  56. 56. Join
  57. 57. Basic aggregation
  58. 58. Sidenote on tooling • Batch processing systems are tools to implement function(all data) scalably • Implementing this is easy
  59. 59. Person 1 Person 6 UserID normalization
  60. 60. UserID normalization
  61. 61. Conclusions • Easy to understand and implement • Scalable • Concurrency / fault-tolerance easily abstracted away from you • Great query performance
  62. 62. Conclusions • But... always out of date
  63. 63. Absorbed into batch viewsAbsorbed into batch views NotNot absorbedabsorbed No wTime Just a small percentage of data!
  64. 64. MasterMaster DatasetDataset Batch viewsBatch views New DataNew Data RealtimeRealtime viewsviews QueryQuery
  65. 65. Get historical buckets from batch views and recent buckets from realtime views
  66. 66. Implementing realtime layer • Isn’t this the exact same problem we faced before we went down the path of batch computation?
  67. 67. Approach #1 • Use the exact same approach as we did in fully incremental implementation • Query performance only degraded for recent buckets • e.g., “last month” range computes vast majority of query from efficient batch indexes
  68. 68. Approach #1 • Relatively small number of buckets in realtime layer • So not that much effect on storage costs
  69. 69. Approach #1 • Complexity of realtime layer is softened by existence of batch layer • Batch layer continuously overrides realtime layer, so mistakes are auto-fixed
  70. 70. Approach #1 • Still going to be a lot of work to implement this realtime layer • Recent buckets with lots of uniques will still cause bad query performance • No way to apply recent equivs to batch views without restructuring batch views
  71. 71. Approach #2 • Approximate! • Ignore realtime equivs
  72. 72. UserID ->UserID -> PersonIDPersonID (from batch) Approach #2 PageviewPageview Convert UserIDConvert UserID to PersonIDto PersonID [URL, bucket][URL, bucket] ->-> HyperLogLogHyperLogLog
  73. 73. Approach #2 • Highly efficient • Great performance • Easy to implement
  74. 74. Approach #2 • Only inaccurate for recent equivs • Intuitively, shouldn’t be that much inaccuracy • Should quantify additional error
  75. 75. Approach #2 • Extra inaccuracy is automatically weeded out over time • “Eventual accuracy”
  76. 76. Simplicity
  77. 77. Input: Normalize/denormalize Output: Data model robustness Query performance
  78. 78. MasterMaster DatasetDataset Batch viewsBatch views Normalized Robust data model Denormalized Optimized for queries
  79. 79. Normalization problem solved • Maintaining consistency in views easy because defined as function(all data) • Can recompute if anything ever goes wrong
  80. 80. Human fault-tolerance
  81. 81. Complexity of Read/Write Databases
  82. 82. Black box fallacyBlack box fallacy
  83. 83. Incremental compaction • Databases write to write-ahead log before modifying disk and memory indexes • Need to occasionally compact the log and indexes
  84. 84. Memory Disk Write-ahead log
  85. 85. Memory Disk Write-ahead log
  86. 86. Memory Disk Write-ahead log Compaction
  87. 87. Incremental compaction • Notorious for causing huge, sudden changes in performance • Machines can seem locked up • Necessitated by random writes • Extremely complex to deal with
  88. 88. More Complexity • Dealing with CAP / eventual consistency • “Call Me Maybe” blog posts found data loss problems in many popular databases • Redis • Cassandra • ElasticSearch
  89. 89. MasterMaster DatasetDataset Batch viewsBatch views New DataNew Data RealtimeRealtime viewsviews QueryQuery
  90. 90. MasterMaster DatasetDataset Batch viewsBatch views New DataNew Data RealtimeRealtime viewsviews QueryQuery No random writes!
  91. 91. MasterMaster DatasetDataset R/WR/W databasesdatabases StreamStream processorprocessor
  92. 92. MasterMaster DatasetDataset ApplicationApplication R/WR/W databasesdatabases (Synchronous version)
  93. 93. MasterMaster DatasetDataset Batch viewsBatch views New DataNew Data RealtimeRealtime viewsviews QueryQuery Lambda Architecture
  94. 94. Lambda = Function Query = Function(All Data)
  95. 95. Lambda Architecture • This is most basic form of it • Many variants of it incorporating more and/or different kinds of layers

×