Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Physical Design for Non-Relational Data Systems

10 views

Published on

My thesis defense talk

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Physical Design for Non-Relational Data Systems

  1. 1. Physical Design for Non-relational Data Systems Michael Mior • University of Waterloo
  2. 2. Proper design and configuration of data systems is critical for achieving good performance 2
  3. 3. 3 Many tools exist for relational database design optimization Source: https://www.databasejournal.com/features/mssql/article.php/10894_3523616_2/Index-Tuning-Wizard.htm https://dev.mysql.com/doc/mysql-monitor/4.0/en/mem-qanal-using-ui.html Microsoft AutoAdmin (1998) DB2 Design Advisor (2004) Oracle SQL Tuning (2004)
  4. 4. We want applications to be up 24/7 We're frequently dealing with changing data or with unstructured data We require sub-second responses to queries 4 Source: Mike Loukides, VP Content Strategy, O’Reilly Media Relational databases are not always sufficient for these uses
  5. 5. “Over 30 years, we've learned how to write business intelligence applications on top of relational databases -- there are patterns. With NoSQL today, we have no cookie cutters. We don't have any blueprints.” --Ravi Krishnappa, NetApp solutions architect 5 Source: TechTarget, 2015
  6. 6. • NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications
  7. 7. • NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications
  8. 8. Model column families around query patterns But start your design with entities and relationships, if you can De-normalize and duplicate for read performance But don’t de-normalize if you don’t need to Leverage wide rows for ordering, grouping, and filtering But don’t go too wide Schema Design Best Practices Source: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/ But But But ? ? ? 8
  9. 9. NoSQL Application Development Requirements ImplementationData Model App LogicDB Access NoSE[MSAL, ICDE ‘16] [MSAL, TKDE ‘17]9
  10. 10. Database Design Example Comment com_id com_date text User user_id nickname Post user_id post_date title 10
  11. 11. Database Design Example 11 SELECT post_id, post_title FROM users u JOIN comments c ON u.user_id = c.user_id JOIN posts p ON p.post_id = c.post_id ORDER BY p.post_date Query Find information on all posts a user has commented on in order by post date
  12. 12. Database Design Example user_id ↓ nickname comment_id post_id ↓ title post_date comment_id ↓ post_id post_date nickname nickname ↓ title post_date Execution A Execution B 12
  13. 13. NoSE Workload Query Plans Data Model 1. Candidate Enumeration 13 2. Query Planning 3. Design Optimization 4. Plan Recommendation Database Design
  14. 14. 14 3 4 5 Database Design Optimization NoSE considers all possible query plans and picks the one with minimum expected cost
  15. 15. Evaluation 15 Overall workload performance improves by 5x
  16. 16. • NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications
  17. 17. Physical Logical 17 {user_id: 1, post_date: "2017-04-05", com_id: 3, …} {user_id: 2, post_date: "2017-04-05", com_id: 7, …} {post_id: 6, com_date: "2017-04-03", com_id: 3, user_id: 1, …} {post_id: 6, com_date: "2017-04-01", com_id: 7, user_id: 2, …} ? Existing NoSQL designs are a black box ?!? JSON!
  18. 18. Removes redundancy implied by both functional and inclusion dependencies Recovering Logical Schemas Extract the structure of existing data Discover dependencies Produce a logical model of the database 18
  19. 19. user_comments {░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, …} {░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, …} comments_by_date {░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, ░░░░░░░: ░, …} {░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, ░░░░░░░: ░, …} We want to go from raw data to a logical model Comment User Post 19 [MS, ER ‘18] (to appear)
  20. 20. 20 user_comments user_id post_date com_id post_id title 1 2017-04-05 3 6 Stargate 2 2017-04-05 7 6 Stargate Data on the same logical entity appears multiple times
  21. 21. user_comments user_id com_id post_id 1 3 6 2 7 6 posts post_date post_id title 2017-04-05 6 Stargate 21 Post data can be (logically) extracted to normalize
  22. 22. 22 user_comments_user user_id user_comments_post post_id post_date, title comments_by_date_post post_id comments_by_date_com com_id com_date, text comments_by_date_user user_id, nickname22
  23. 23. 2323 posts post_id post_date, title comments com_id com_date, text users user_id, nickname This is the original logical model! Comment User Post
  24. 24. • NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications
  25. 25. Apache Spark Model ▸ Series of lazy transformations which are followed by actions that force evaluation of all transformations ▸ Each step produces a resilient distributed dataset (RDD) ▸ Intermediate results can be cached on memory or disk, optionally serialized 25
  26. 26. Caching is very useful for applications that re-use an RDD multiple times. Caching all of the generated RDDs is not a good strategy… Caching is very useful for applications that re-use an RDD multiple times. Caching all of the generated RDDs is not a good strategy… …deciding which ones to cache may be challenging. Spark Caching Best Practices Source: https://unraveldata.com/to-cache-or-not-to-cache/26
  27. 27. PageRank Example var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() 27
  28. 28. Transformations var rankGraph = graph var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph prevRankGraph = rankGraph rankGraph = rankGraph .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() .outerJoinVertices(...).map(...) .aggregateMessages(...) .outerJoinVertices(rankUpdates) 28
  29. 29. Actions var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() prevRankGraph.unpersist() } rankGraph.edges.foreachPartition(...) rankGraph.vertices.values.sum() 29
  30. 30. 30 PageRank RDDs Some RDDs are used more than once
  31. 31. Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) rankGraph.edges.foreachPartition(...) } rankGraph.vertices.values.sum() rankGraph.persist() .persist() prevRankGraph.unpersist() 31
  32. 32. ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 whileLoop (sc, iteration < numIter { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() }) rankGraph.vertices.values.sum() 32
  33. 33. ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 whileLoop (sc, iteration < numIter { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) }) rankGraph.vertices.values.sum() 33 rankGraph: 0
  34. 34. ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 whileLoop (sc, iteration < numIter { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) }) rankGraph.vertices.values.sum() 34 rankGraph: 2 Persist!
  35. 35. PageRank on ReSpark 35 Without any caching, many jobs take hours!
  36. 36. 36
  37. 37. Questions? xkcd.com

×