Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tactical data engineering

514 views

Published on

A talk given by Julian Hyde at DataCouncil SF on April 18, 2019

How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.

This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Tactical data engineering

  1. 1. Tactical data engineering Julian Hyde April 17–18, 2019 San Francisco
  2. 2. @julianhyde
  3. 3. DBMS Data pipeline & analytics DBMS tricks Tactical data engineering Evolving the data pipeline Adaptive data systems
  4. 4. 1. DBMS
  5. 5. File system vs. DBMS file 1 file 2 program table 1 query table 2
  6. 6. File system vs. DBMS file 1 file 2 program file 1 file 2 program query
  7. 7. Efficient join: reorganize the data and rewrite the program sorted file 1 sorted file 2 program file 1 file 2 program (merge join) query sorted file 2
  8. 8. ● Abstraction ● Declarative language ● Planning ● Easily reorganize data, add new algorithms ● Governance ● Metadata ● Security And, I propose: ● Adaptability DBMS adds value
  9. 9. 2. Data pipeline
  10. 10. The data pipeline: Extract - Load - Transform Cloud DB table table table table table table source source
  11. 11. The data pipeline: Extract - Load - Transform Cloud DB table table table table table table source source SQL query business query interactive users
  12. 12. File system vs. DBMS file 1 file 2 program file 1 file 2 program query
  13. 13. File system vs. DBMS vs. analytic data system file 1 file 2 program file 1 file 2 program query Cloud DB table table table table table table SQL query business query
  14. 14. File system vs. DBMS vs. analytic data system file 1 file 2 program file 1 file 2 program query Cloud DB table table table table table table SQL query business query business users analystsprogrammers
  15. 15. 3. DBMS tricks
  16. 16. Re-organize data a 1 c 3 c 4 b 2 a .. b c .. c Index a 1 b 2 c 3 c 4 Sort Raw data a 1 c 4 c 3 b 2 a 1 c 3 c 4 b 2 Partition a 1 c 3 c 4 b 2 Replicate a 1 1 c 7 2 b 2 1 Summarize
  17. 17. Caching a 1 c 3 c 4 b 2 Raw data a 1 c 3 c 4 b 2 Copy of data in memory
  18. 18. Apache Calcite Apache top-level project Query planning framework used in many projects and products Also works standalone: federated query engine with SQL / JDBC front end Apache community development model calcite.apache.org github.com/apache/calcite
  19. 19. SELECT d.name, COUNT(*) AS c FROM Emps AS e JOIN Depts AS d USING (deptno) WHERE e.age < 40 GROUP BY d.deptno HAVING COUNT(*) > 5 ORDER BY c DESC Relational algebra Based on set theory, plus operators: Project, Filter, Aggregate, Union, Join, Sort Requires: declarative language (SQL), query planner Original goal: data independence Enables: query optimization, new algorithms and data structures Scan [Emps] Scan [Depts] Join [e.deptno = d.deptno] Filter [e.age < 30] Aggregate [deptno, COUNT(*) AS c] Filter [c > 5] Project [name, c] Sort [c DESC]
  20. 20. SELECT d.name, COUNT(*) AS c FROM (SELECT * FROM Emps WHERE e.age > 50) AS e JOIN Depts AS d USING (deptno) GROUP BY d.deptno HAVING COUNT(*) > 5 ORDER BY c DESC Algebraic rewrite Optimize by applying rewrite rules that preserve semantics Hopefully the result is less expensive; but it’s OK if it’s not (planner keeps “before” and “after”) Planner uses dynamic programming, seeking the lowest total cost Scan [Emps] Scan [Depts] Join [e.deptno = d.deptno] Filter [e.age > 50] Aggregate [deptno, COUNT(*) AS c] Filter [c > 5] Project [name, c] Sort [c DESC]
  21. 21. SELECT deptno, MIN(salary) FROM Managers WHERE age > 50 GROUP BY deptno Views Scan [Emps] Scan [Emps] Join [e.id = underling.manager] Project [id, deptno, salary, age] Aggregate [manager] CREATE VIEW Managers AS SELECT * FROM Emps AS e WHERE EXISTS ( SELECT * FROM Emps AS underling WHERE underling.manager = e.id) Filter [age > 50] Aggregate [deptno, MIN(salary)] Scan [Managers]
  22. 22. SELECT deptno, MIN(salary) FROM Managers WHERE age > 50 GROUP BY deptno View query (after expansion) Scan [Emps] Scan [Emps] Join [e.id = underling.manager] Project [id, deptno, salary, age] Aggregate [manager] CREATE VIEW Managers AS SELECT * FROM Emps AS e WHERE EXISTS ( SELECT * FROM Emps AS underling WHERE underling.manager = e.id) Filter [age > 50] Aggregate [deptno, MIN(salary)]
  23. 23. CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, gender, COUNT(*) AS c, SUM(sal) AS s FROM Emps GROUP BY deptno, gender Materialized view Scan [Emps] SELECT COUNT(*) AS c FROM Emps WHERE deptno = 10 AND gender = ‘M’ Filter [deptno = 10 AND gender = ‘M’] Aggregate [COUNT(*)] Scan [EmpSummary] = Scan [Emps] Aggregate [deptno, gender, COUNT(*), SUM(salary)]
  24. 24. CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, gender, COUNT(*) AS c, SUM(sal) AS s FROM Emps GROUP BY deptno, gender Materialized view: rewrite query to match Scan [Emps] SELECT COUNT(*) AS c FROM Emps WHERE deptno = 10 AND gender = ‘M’ Filter [deptno = 10 AND gender = ‘M’] Scan [EmpSummary] = Scan [Emps] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Project [c]
  25. 25. CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, gender, COUNT(*) AS c, SUM(sal) AS s FROM Emps GROUP BY deptno, gender Materialized view: rewrite query to match Scan [Emps] SELECT COUNT(*) AS c FROM Emps WHERE deptno = 10 AND gender = ‘M’ Filter [deptno = 10 AND gender = ‘M’] Scan [EmpSummary] = Scan [Emps] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Project [c]
  26. 26. CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, gender, COUNT(*) AS c, SUM(sal) AS s FROM Emps GROUP BY deptno, gender Materialized view: substitute table scan SELECT COUNT(*) AS c FROM Emps WHERE deptno = 10 AND gender = ‘M’ Filter [deptno = 10 AND gender = ‘M’] Scan [EmpSummary] = Scan [Emps] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Project [c] Scan [EmpSummary]
  27. 27. CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, gender, COUNT(*) AS c, SUM(sal) AS s FROM Emps GROUP BY deptno, gender Materialized view: substitute table scan SELECT c FROM EmpSummary WHERE deptno = 10 AND gender = ‘M’ Filter [deptno = 10 AND gender = ‘M’] Scan [EmpSummary] = Scan [Emps] Aggregate [deptno, gender, COUNT(*), SUM(salary)] Project [c] Scan [EmpSummary]
  28. 28. 4. Analytics
  29. 29. “orders” view in LookML view: orders { dimension: id { primary_key: yes type: number sql: ${TABLE}.id ;; } dimension: customer_id { # field: orders.customer_id sql: ${TABLE}.customer_id ;; } dimension: amount { # field: orders.amount type: number value_format: "0.00" sql: ${TABLE}.amount ;; } measure: count { # field: orders.count type: count # creates a sql COUNT(*) } measure: total_amount { type: sum sql: ${amount} ;; } }
  30. 30. 5. Evolving the data pipeline
  31. 31. Cloud DB table table table table table table source source SQL query business query interactive users
  32. 32. Data engineering table table table table table table
  33. 33. Data engineering is not a static problem table table table table table table table file In memory table table
  34. 34. data engineer Who is responsible for data engineering? table table table table table table table file In memory table table
  35. 35. system (runtime adaptation) data scientist analystdata engineer Data engineering - empower users, reduce friction table table table table table table table file In memory table table
  36. 36. LookML - derived table (based on SQL) view: customer_order_facts { derived_table: { sql: SELECT customer_id, MIN(DATE(time)) AS first_order_date, SUM(amount) AS lifetime_amount FROM order GROUP BY customer_id ;; } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] sql: ${TABLE}.first_order_date ;; } dimension: lifetime_amount { type: number value_format: "0.00" sql: ${TABLE}.lifetime_amount ;; } }
  37. 37. LookML - derived table (based on an Explore) view: customer_order_facts { derived_table: { explore_source: orders { column: customer_id { field: order.customer_id } column: first_order { field: order.first_order } column: lifetime_amount { field: order.lifetime_amount } } } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] sql: ${TABLE}.first_order_date ;; }
  38. 38. Flavors of derived table Derived table flavor Purpose SQL equivalent Ephemeral Query expansion CREATE VIEW Persistent Query is executed once, used by several queries until it expires CREATE TABLE AS SELECT Transparent Populated as persistent DT, but can be used even if the business query does not reference it by name CREATE MATERIALIZED VIEW Each flavor comes can be based on either an Explore or SQL
  39. 39. Building materialized views Challenges: ● Design Which materializations to create? ● Populate Load them with data ● Maintain Incrementally populate when data changes ● Rewrite Transparently rewrite queries to use materializations ● Adapt Design and populate new materializations, drop unused ones ● Express Need a rich algebra, to model how data is derived Initial focus: summary tables (materialized views over star schemas)
  40. 40. CREATE LATTICE Sales AS SELECT t.*, c.*, COUNT(*), SUM(s.units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) JOIN Products AS p USING (productId); Designing summary tables via lattices CREATE MATERIALIZED VIEW SalesYearZipcode AS SELECT t.year, c.state, c.zipcode, COUNT(*), SUM(units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) GROUP BY 1, 2, 3; product product class sales customers time
  41. 41. Many possible summary tables Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 raw 1m (y, m) 60(g, y) 10 (z, s) 43.4k (g, y, m) 120 Fewer than you would expect, because 5m combinations cannot occur in 1m row table Fewer than you would expect, because state depends on zipcode
  42. 42. Algorithm: Design summary tables Given a database with 30 columns, 10M rows. Find X summary tables with under Y rows that improve query response time the most. AdaptiveMonteCarlo algorithm [1]: ● Based on research [2] ● Greedy algorithm that takes a combination of summary tables and tries to find the table that yields the greatest cost/benefit improvement ● Models “benefit” of the table as query time saved over simulated query load ● The “cost” of a table is its size [1] org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm [2] Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
  43. 43. Lattice (optimized) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24 Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12)
  44. 44. system (runtime adaptation) data scientist analystdata engineer Data engineering - empower users, reduce friction table table table table table table table file In memory table table
  45. 45. data scientist system (runtime adaptation) analystdata engineer Data engineering - productionize table table table table table table table In memory table file table
  46. 46. Adaptive data systems queries DML statistics adaptations recommender Goals ● Improve response time, throughput, storage cost ● Predictable, adaptive (short and long term), allow human intervention How? ● Humans ● Adaptive systems ● Smart algorithms Example adaptations ● Cache disk blocks in memory ● Cached query results ● Data organization, e.g. partition on a different key ● Secondary structures, e.g. b-tree and r-tree indexes
  47. 47. Thank you! Any questions? @julianhyde www.looker.com calcite.apache.org

×