Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Using Simplicity to Make Hard Big Data Problems Easy

  1. Using simplicity to make hard Big Data problems easy
  2. MasterMaster DatasetDataset Batch viewsBatch views New DataNew Data RealtimeRealtime viewsviews QueryQuery Lambda Architecture
  3. MasterMaster DatasetDataset R/WR/W databasesdatabases StreamStream processorprocessor Proposed alternative (Problematic)
  4. Easy problem
  5. struct PageView { UserID id, String url, Timestamp timestamp }
  6. Implement: function NumUniqueVisitors( String url, int startHour, int endHour)
  7. Unique visitors over a range of hours
  8. Notes: • Not limiting ourselves to current tooling • Reasonable variations of existing tooling are acceptable • Interested in what’s fundamentally possible
  9. Traditional Architectures ApplicationApplication DatabasesDatabases ApplicationApplication DatabasesDatabases StreamStream processorprocessor QueueQueue Synchronous Asynchronous
  10. Approach #1 • Use Key->Set database • Key = [URL, hour bucket] • Value = Set of UserIDs
  11. Approach #1 • Queries: • Get all sets for all hours in range of query • Union sets together • Compute count of merged set
  12. Approach #1 • Lot of database lookups for large ranges • Potentially a lot of items in sets, so lots of work to merge/count • Database will use a lot of space
  13. Approach #2 Use HyperLogLog
  14. interface HyperLogLog { boolean add(Object o); long size(); HyperLogLog merge(HyperLogLog... otherSets); }
  15. Approach #2 • Use Key->HyperLogLog database • Key = [URL, hour bucket] • Value = HyperLogLog structure
  16. Approach #2 • Queries: • Get all HyperLogLog structures for all hours in range of query • Merge structures together • Retrieve count from merged structure
  17. Approach #2 • Much more efficient use of storage • Less work at query time • Mild accuracy tradeoff
  18. Approach #2 • Large ranges still require lots of database lookups / work
  19. Approach #3 • Use Key->HyperLogLog database • Key = [URL, bucket, granularity] • Value = HyperLogLog structure
  20. Approach #3 • Queries: • Compute minimal number of database lookups to satisfy range • Get all HyperLogLog structures in range • Merge structures together • Retrieve count from merged structure
  21. Approach #3 • All benefits of #2 • Minimal number of lookups for any range, so less variation in latency • Minimal increase in storage • Requires more work at write time
  22. Hard problem
  23. struct Equiv { UserID id1, UserID id2 } struct PageView { UserID id, String url, Timestamp timestamp }
  24. Person A Person B
  25. Implement: function NumUniqueVisitors( String url, int startHour, int endHour)
  26. [“foo.com/page1”, 0] [“foo.com/page1”, 1] [“foo.com/page1”, 2] ... [“foo.com/page1”, 1002] {A, B, C} {B} {A, C, D, E} ... {A, B, C, F, Z}
  27. [“foo.com/page1”, 0] [“foo.com/page1”, 1] [“foo.com/page1”, 2] ... [“foo.com/page1”, 1002] {A, B, C} {B} {A, C, D, E} ... {A, B, C, F, Z} A <-> C
  28. Any single equiv could change any bucket
  29. No way to take advantage of HyperLogLog
  30. Approach #1 • [URL, hour] -> Set of PersonIDs • UserID -> Set of buckets • Indexes to incrementally normalize UserIDs into PersonIDs
  31. Approach #1 • Getting complicated • Large indexes • Operations require a lot of work
  32. Approach #2 • [URL, bucket] -> Set of UserIDs • Like Approach 1, incrementally normalize UserId’s • UserID -> PersonID
  33. Approach #2 • Query: • Retrieve all UserID sets for range • Merge sets together • Convert UserIDs -> PersonIDs to produce new set • Get count of new set
  34. Incremental UserID normalization
  35. Attempt 1: • Maintain index from UserID -> PersonID • When receive A <-> B: • Find what they’re each normalized to, and transitively normalize all reachable IDs to “smallest” val
  36. 1 <-> 4 1 -> 1 4 -> 1 2 <-> 5 5 -> 2 2 -> 2 5 <-> 3 3 -> 2 4 <-> 5 5 -> 1 2 -> 1 3 -> 1 never gets produced!
  37. Attempt 2: • UserID -> PersonID • PersonID -> Set of UserIDs • When receive A <-> B • Find what they’re each normalized to, and choose one for both to be normalized to • Update all UserID’s in both normalized sets
  38. 1 <-> 4 1 -> 1 4 -> 1 1 -> {1, 4} 2 <-> 5 5 -> 2 2 -> 2 2 -> {2, 5} 5 <-> 3 3 -> 2 2 -> {2, 3, 5} 4 <-> 5 5 -> 1 2 -> 1 3 -> 1 1 -> {1, 2, 3, 4, 5}
  39. Challenges • Fault-tolerance / ensuring consistency between indexes • Concurrency challenges
  40. General challenges with traditional architectures • Redundant storage of information (“denormalization”) • Brittle to human error • Operational challenges of enormous installations of very complex databases
  41. MasterMaster DatasetDataset Indexes forIndexes for uniques overuniques over timetime StreamStream processorprocessor No fully incremental approach will work!
  42. Let’s take a completely different approach!
  43. Some Rough Definitions Complicated: lots of parts
  44. Some Rough Definitions Complex: intertwinement between separate functions
  45. Some Rough Definitions Simple: the opposite of complex
  46. Real World Example
  47. ID Name Location ID 1 Sally 3 2 George 1 3 Bob 3 Location ID City State Population 1 New York NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M Normalized schema Normalization vs Denormalization
  48. Join is too expensive, so denormalize...
  49. ID Name Location ID City State 1 Sally 3 Chicago IL 2 George 1 New York NY 3 Bob 3 Chicago IL Location ID City State Population 1 New York NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M Denormalized schema
  50. Complexity between robust data model and query performance
  51. Allow queries to be out of date by hours
  52. Store every Equiv and PageView MasterMaster DatasetDataset
  53. MasterMaster DatasetDataset Continuously recompute indexes Indexes forIndexes for uniques overuniques over timetime
  54. Indexes = function(all data)
  55. Iterative graph algorithm
  56. Join
  57. Basic aggregation
  58. Sidenote on tooling • Batch processing systems are tools to implement function(all data) scalably • Implementing this is easy
  59. Person 1 Person 6 UserID normalization
  60. UserID normalization
  61. Conclusions • Easy to understand and implement • Scalable • Concurrency / fault-tolerance easily abstracted away from you • Great query performance
  62. Conclusions • But... always out of date
  63. Absorbed into batch viewsAbsorbed into batch views NotNot absorbedabsorbed No wTime Just a small percentage of data!
  64. MasterMaster DatasetDataset Batch viewsBatch views New DataNew Data RealtimeRealtime viewsviews QueryQuery
  65. Get historical buckets from batch views and recent buckets from realtime views
  66. Implementing realtime layer • Isn’t this the exact same problem we faced before we went down the path of batch computation?
  67. Approach #1 • Use the exact same approach as we did in fully incremental implementation • Query performance only degraded for recent buckets • e.g., “last month” range computes vast majority of query from efficient batch indexes
  68. Approach #1 • Relatively small number of buckets in realtime layer • So not that much effect on storage costs
  69. Approach #1 • Complexity of realtime layer is softened by existence of batch layer • Batch layer continuously overrides realtime layer, so mistakes are auto-fixed
  70. Approach #1 • Still going to be a lot of work to implement this realtime layer • Recent buckets with lots of uniques will still cause bad query performance • No way to apply recent equivs to batch views without restructuring batch views
  71. Approach #2 • Approximate! • Ignore realtime equivs
  72. UserID ->UserID -> PersonIDPersonID (from batch) Approach #2 PageviewPageview Convert UserIDConvert UserID to PersonIDto PersonID [URL, bucket][URL, bucket] ->-> HyperLogLogHyperLogLog
  73. Approach #2 • Highly efficient • Great performance • Easy to implement
  74. Approach #2 • Only inaccurate for recent equivs • Intuitively, shouldn’t be that much inaccuracy • Should quantify additional error
  75. Approach #2 • Extra inaccuracy is automatically weeded out over time • “Eventual accuracy”
  76. Simplicity
  77. Input: Normalize/denormalize Output: Data model robustness Query performance
  78. MasterMaster DatasetDataset Batch viewsBatch views Normalized Robust data model Denormalized Optimized for queries
  79. Normalization problem solved • Maintaining consistency in views easy because defined as function(all data) • Can recompute if anything ever goes wrong
  80. Human fault-tolerance
  81. Complexity of Read/Write Databases
  82. Black box fallacyBlack box fallacy
  83. Incremental compaction • Databases write to write-ahead log before modifying disk and memory indexes • Need to occasionally compact the log and indexes
  84. Memory Disk Write-ahead log
  85. Memory Disk Write-ahead log
  86. Memory Disk Write-ahead log Compaction
  87. Incremental compaction • Notorious for causing huge, sudden changes in performance • Machines can seem locked up • Necessitated by random writes • Extremely complex to deal with
  88. More Complexity • Dealing with CAP / eventual consistency • “Call Me Maybe” blog posts found data loss problems in many popular databases • Redis • Cassandra • ElasticSearch
  89. MasterMaster DatasetDataset Batch viewsBatch views New DataNew Data RealtimeRealtime viewsviews QueryQuery
  90. MasterMaster DatasetDataset Batch viewsBatch views New DataNew Data RealtimeRealtime viewsviews QueryQuery No random writes!
  91. MasterMaster DatasetDataset R/WR/W databasesdatabases StreamStream processorprocessor
  92. MasterMaster DatasetDataset ApplicationApplication R/WR/W databasesdatabases (Synchronous version)
  93. MasterMaster DatasetDataset Batch viewsBatch views New DataNew Data RealtimeRealtime viewsviews QueryQuery Lambda Architecture
  94. Lambda = Function Query = Function(All Data)
  95. Lambda Architecture • This is most basic form of it • Many variants of it incorporating more and/or different kinds of layers

Editor's Notes

  1. clear up confusion around it. lambda architecture addresses a lot of nasty, fundamental complexities that isn’t talked about enough most of talk won’t even talk about LA, we’ll work on an example problem and you’ll see LA naturally emerge
  2. this isn’t even capable of solving the problem we’re going to look at
  3. i want this talk to be interactive... going deep into technical details please do not hesitate to jump in with any questions
  4. uniques for just hour 1 = 3 uniques for hours 1 and 2 = 3 uniques for 1 to 3 = 5 uniques for 2-4 = 4
  5. synchronous asynchronous characterized by maintaining state incrementally as data comes in and serving queries off of that same state
  6. 1 KB to estimate size up to 1B with only 2% error
  7. it’s not a stretch to imagine a database that can do hyperloglog natively, so updates don’t require fetching the entire set
  8. it’s not a stretch to imagine a database that can do hyperloglog natively, so updates don’t require fetching the entire set
  9. example: 1 month there are ~720 hours, 30 days, 4 weeks, 1 month... adding all granularities makes 755 stored values total instead of 720 values, only a 4.8% increase in storage
  10. except now userids should be normalized, so if there’s equiv that user only appears once even if under multiple ids
  11. equiv can change ANY or ALL buckets in the past
  12. will get back to incrementally updating userids
  13. will get back to incrementally updating userids
  14. offload a lot of the work to read time
  15. this is still a lot of work at read time overall
  16. if using distributed database to store indexes and computing everything concurrently when receive equivs for 4&amp;lt;-&amp;gt;3 and 3&amp;lt;-&amp;gt;1 at same time, will need some sort of locking so they don’t step on each other
  17. e.g. granularities, the 2 indexes for user id normalization... we know it’s a bad idea to store the same thing in multiple places... opens up possibility of them getting out of sync if you don’t handle every case perfectly If you have a bug that accidentally sets the second value of all equivs to 1, you’re in trouble even the version without equivs suffers from these problems
  18. 2 functions: produce water of a certain strength, and produce water of a certain temperature faucet on left gives you “hot” and “cold” inputs which each affect BOTH outputs - complex to use faucet on right gives you independent “heat” and “strength” inputs, so SIMPLE to use neither is very complicated
  19. so just a quick overview of denormalization, here’s a schema that stores user information and location information each is in its own table, and a user’s location is a reference to a row in the location table this is pretty standard relational database stuff now let’s say a really common query is getting the city and state a person lives in to do this you have to join the tables together as part of your query
  20. you might find joins are too expensive, they use too many resources
  21. so you denormalize the schema for performance you redundantly store the city and state in the users table to make that query faster, cause now it doesn’t require a join now obviously, this sucks. the same data is now stored in multiple places, which we all know is a bad idea whenever you need to change something about a location you need to change it everywhere it’s stored but since people make mistakes, inevitably things become inconsistent but you have no choice, you want to normalize, but you have to denormalize for performance
  22. i hope you are looking at this and asking the question... still have to compute uniques over time and deal with the equivs problem how are we better off than before?
  23. options for taking different approaches to problem without having to sacrifice too much
  24. people say it does “key/value”, so I can use it when I need key/value operations... and they stop there can’t treat it as a black box, that doesn’t tell the full story
  25. some of his tests was seeing over 30% data loss during partitions
  26. major operational simplification to not require random writes i’m not saying you can’t make a database that does incremental compaction and deals with the other complexities of random writes well, but it’s clearly a fundamental complexity, and i feel it’s better to not have to deal with it at all remember, we’re talking about what’s POSSIBLE, not what currently exists my experience with elephantdb
  27. Does not avoid any of the complexities of massive distributed r/w databases
  28. Does not avoid any of the complexities of massive distributed r/w databases or dealing with eventual consistency
  29. everything i’ve talked about completely generalizes, applies to both AP and CP architectures
Advertisement