Autonomous ETL With
Materialized Views
Abhishek Somani, Adesh Rao
May 2018
2
Agenda
1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto,
Spark etc)
2. Difficulties in structuring data
3. A case for Materialized Views
4. Challenges with Materialized Views
5. Solution
3
Agenda
1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto,
Spark etc)
2. Difficulties in structuring data
3. A case for Materialized Views
4. Challenges with Materialized Views
5. Solution
4
● Partitioning
Data structuring for SQL-on-Hadoop
5
● Columnar File Formats
Data organization for SQL-on-Hadoop
ORC
Parquet
6
● Sorting
● Bucketing
Data organization for SQL-on-Hadoop
7
Data organization for SQL-on-Hadoop
Speedup of Unsorted vs Sorted ORC data on TPCDS scale 1000
8
Agenda
1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto,
Spark etc)
2. Difficulties in structuring data
3. A case for Materialized Views
4. Challenges with Materialized Views
5. Solution
9
Difficulties in Structuring Data
● Evolving query patterns
● Data pipeline dependencies
● Large number of consumers
● Data Admin Involvement
● Downtime
● Workload Aware identification
of optimal data structure
● Flexibility of data structuring
● Seamless restructuring
● Continuous and automatic
maintenance
NO DOWNTIME!
10
Agenda
1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto,
Spark etc)
2. Difficulties in structuring data
3. A case for Materialized Views
4. Challenges with Materialized Views
5. Solution
11
● A materialized view is a database object that contains the results of a query.
● It is a view for which the data has been materialized.
● Materialized Views can be consumed automatically by the query engine
Example:
CREATE MATERIALIZED VIEW mv AS SELECT seller_id, seller_name, num_item*cost AS value FROM sales;
Effect: Query rewrite
SELECT seller_id, num_item*cost AS value FROM sales;
~
SELECT seller_id, value FROM mv;
Basics: Materialized View
12
Interesting properties of Materialized Views in Hive:
● A copy of the data(full, partial or transformed)
● Used automatically by the engine based on cost analysis
● Can be stored as ORC, Parquet etc
● Multiple materialized views can co-exist, optimally chosen
Plus: Storage is cheap
Idea: Create multiple materialized views of the full data with desired structures
Materialized Views in Hive for Data Restructuring
13
Query1: SELECT * from T1 where customer_id = 26988 and month
= “January”;
Rewritten: SELECT * from MV1 where customer_id = 26988 and
month = “January”;
Query2: SELECT * from T1 where seller_id = 121 and month =
“January”;
Rewritten: SELECT * from MV2 where seller_id = 121 and month =
“January”;
Materialized Views for Data Restructuring
Example:
Original Table T1:
● Partitioned on Year, Month, Day
● Stored as Text
Materialized View MV1:
● Partitioned on Year, Month, Day
● Sorted on Customer_Id
● Stored as ORC
Materialized View MV2:
● Partitioned on Year, Month, Day
● Sorted on Seller_Id
● Stored as ORC
14
Materialized Views in SQL-on-Hadoop engines
● Basic implementation available in Apache Hive 2.3.0
○ Uses Apache Calcite for query optimization and query rewrite
○ Multi file format support. Uses ORC (by default) for optimized columnar
storage of materialized queries
● Not available in Presto
● Not available in Spark
15
Agenda
1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto,
Spark etc)
2. Difficulties in structuring data
3. A case for Materialized Views
4. Challenges with Materialized Views
5. Solution
16
Challenges with Materialized Views
● Invalidation
○ Only a subset of use cases can work with stale data
● Rebuilds and Refreshes
○ Prohibitively expensive for full data copies
● Maintenance Isolation
○ Ongoing queries get affected
17
Agenda
1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto,
Spark etc)
2. Difficulties in structuring data
3. A case for Materialized Views
4. Challenges with Materialized Views
5. Solution
18
FastCopy: A framework for Autonomous Materialized Views
● Materialized Views for Sorting, Partitioning and Bucketing for structuring data
● Synchronous invalidation on table updates
● Asynchronous automatic refreshes
● Maintenance isolation by refreshes in their own scheduler queues, or even
their own cluster
● Recommendation Engine to suggest Materialized Views
● Cross engine support for using Materialized Views
19
Qubole FastCopy Infrastructure
20
Qubole FastCopy Infrastructure
FastCopy Creation
21
Qubole FastCopy Infrastructure
FastCopy Creation
22
Qubole FastCopy Infrastructure
FastCopy Creation
23
Qubole FastCopy Infrastructure
FastCopy CreationFastCopy Creation
24
Qubole FastCopy Infrastructure
Incoming query for rewrite
25
Qubole FastCopy Infrastructure
Query Rewrite
26
Qubole FastCopy Infrastructure
Query Rewrite
27
Qubole FastCopy Infrastructure
Query Rewrite
28
Qubole FastCopy Infrastructure
Invalidation and Refresh
29
Qubole FastCopy Infrastructure
Invalidation and Refresh
30
Qubole FastCopy Infrastructure
Invalidation and Refresh
31
Qubole FastCopy Infrastructure
Invalidation and Refresh
32
Qubole FastCopy Infrastructure
Invalidation and Refresh
33
Qubole FastCopy Infrastructure
Invalidation and Refresh
34
Qubole FastCopy Infrastructure
Invalidation and Refresh
35
Fun Details
● Auto detect added, dropped or updated partitions using partition level tokens
● Multi Version Concurrency Control for FastCopy
● Minion clusters for workload isolation
● Top Tables
36
Recommendations
● Top Tables
37
Recommendations
● Column Usage as Filter predicates
38
Recommendations
● Column Usage as Filter predicates
39
Recommendations
● Column Usage as Filter predicates
40
Recommendations
● Top Tables
41
Recommendations
● Top Tables
42
Recommendations
● Column Usage as Filter predicates
43
Recommendations
● Column Usage as Filter predicates
44
Recommendations
● Column Usage as Filter predicates
45
Recommendations
46
Revise
1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto,
Spark etc)
2. Difficulties in structuring data
3. A case for Materialized Views
4. Challenges with Materialized Views
5. Solution
47
Status
● FastCopy is at an internal Alpha
● Will soon be released as a beta for customers in the next Quarter
● Contribute to Open Source
Thank You
Thank You
Abhishek Somani, Adesh Rao
May 2018

Autonomous ETL with Materialized Views