1 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing with
materialized views in Apache Hive
Jesús Camacho Rodríguez
DataWorks Summit Berlin
April 18, 2018
2 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hive
• Initial use case: batch processing
• Read-only data
• HiveQL (SQL-like query language)
• MapReduce
• Effort to take Hive beyond its batch processing roots
• Started in Apache Hive 0.10.0 (January 2013)
• Upcoming release: Apache Hive 3.0 (May 2018)
• Extensive renovation to improve three different axes
• Latency: allow interactive and sub-second queries
• Scalability: from TB to PB of data
• SQL support: move from HiveQL to SQL standard
3 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hive
• Multiple execution engines: Apache Tez and Apache Spark
• More efficient join execution algorithms
• Vectorized query execution
• Integration with columnar storage formats:
Apache ORC, Apache Parquet
• LLAP (Live Long and Process)
• Persistent deamons for low-latency queries
• Rule-based and cost-based optimizer
• Better statistics
• Tighter integration with other data processing systems: Druid
Important internals improvements
4 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing
• Change data physical properties (distribute, sort)
• Filter rows
• Denormalize
• Preaggregate
Optimization based on access patterns
5 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing
• Establish relationship between original and new tables
• Has a similar table already been created?
• Rewrite your queries to use new tables
• What happens when access patterns change?
• Maintain your new tables when original tables change
• Do I have to fully rebuild new tables?
Optimization based on access patterns
Currently, Hive users
have to do it manually
6 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized views
• A materialized view is an entity that contains the result of a evaluating a query
• Important property  Awareness of the materialized view definition semantics
• Optimizer can exploit them for automatic query rewriting
• System can handle maintenance of the materialized views
• Generally, materializations can be created in different forms depending on the scope
• DBA writes “CREATE MATERIALIZED VIEW” statement
• Daemon creates materialized view based on recent query activity
• Cached result of previous similar query
• Query factorization identifies common pieces within a single query
7 © Hortonworks Inc. 2011–2018. All rights reserved
Possible workflow
1. Create materialized view using Hive tables
• Stored by Hive or Druid
2. User or dashboard sends queries to Hive
• Hive rewrites queries using available materialized views
• Execute rewitten query
Dashboards, BI tools
CREATE MATERIALIZED VIEW `ssb_mv`
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
ENABLE REWRITE
AS
<query>;
DBA, recommendation system
①
②
Data
Queries
8 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized views in Apache Hive
• First implementation will be part of Apache Hive 3.0
• Multiple storage options: Hive, Druid
• Automatic rewriting of incoming queries to use materialized views
• Efficient view maintenance
• Incremental refresh
• Multiple options to control materialized views lifecycle
9 © Hortonworks Inc. 2011–2018. All rights reserved
Management of
materialized views in Hive
10 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view creation
• CREATE MATERIALIZED VIEW statement
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
[ENABLE REWRITE | DISABLE REWRITE]
[COMMENT materialized_view_comment]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
AS
<query>;
⇢ Supports custom table properties, storage format, etc.
11 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view creation (stored in Druid)
• CREATE MATERIALIZED VIEW statement
CREATE MATERIALIZED VIEW druid_wiki_mv
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive materialized view name
Hive storage handler classname
12 © Hortonworks Inc. 2011–2018. All rights reserved
Other operations for materialized view management
DROP MATERIALIZED VIEW [db_name.]materialized_view_name;
SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’];
DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;
⇢ More operations to be added and extended
13 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based
query rewriting
14 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting algorithm
• Automatically rewrite incoming queries using materialized views
• Optimizer exploits materialized view definition semantics
• Built on the ideas presented in [GL01] using Apache Calcite
• Supports queries containing TableScan, Project, Filter, Join, Aggregate operators
• Includes some extensions
• Generation of additional rewritings without needing to do join permutation
• Partial rewritings using union operators
• More information about the rewriting coverage
• http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information
[GL01] Jonathan Goldstein and Per-åke Larson. Optimizing queries using materialized views: A practical,
scalable solution. In Proc. ACM SIGMOD Conf., 2001.
15 © Hortonworks Inc. 2011–2018. All rights reserved
Enable materialized view-based rewriting
• Global property to enable materialized view rewriting for queries
SET hive.materializedview.rewriting=true;
• User can selectively use enable/disable materialized views for rewriting
• Materialized views are enabled by default for rewriting
• Behavior can be altered after materialized view has been created
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
16 © Hortonworks Inc. 2011–2018. All rights reserved
depts
Materialized view-based rewriting (example)
• Materialized view definition
Employees that were hired after 2016
CREATE MATERIALIZED VIEW mv
AS
SELECT empid, deptname, hire_date
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2016-01-01';
• Query
Employees that were hired last quarter
SELECT empid, deptname
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
• Materialized view-based rewriting
SELECT empid, deptname
FROM mv
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
deptsemps
empid depname hire_date
10001 IT 2016-03-01
10002 IT 2017-01-02
10003 HR 2017-07-01
10004 Finance 2018-01-15
10005 HR 2018-02-02
mv contents
empid depname
10004 Finance
10005 HR
Query results
17 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting (example 2)
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT <dims>,
lo_revenue,
lo_extprice * lo_disc AS d_price,
lo_revenue - lo_supplycost,
FROM
customer, dates, lineorder, part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
• Query
SELECT sum(lo_extendedprice*lo_discount)
FROM
lineorder, dates
WHERE
lo_orderdate = d_datekey
and d_year = 2013
and lo_discount between 1 and 3;
• Materialized view-based rewriting
SELECT SUM(d_price)
FROM mv
WHERE
d_year = 2013
and lo_discount between 1 and 3;
supplier
part
dates
customerlineorder
Exploit SQL PK-FK and
NOT NULL constraints
d_year lo_discount <dims> d_price
2013 2 ... 7.55
2014 4 ... 432.60
2013 2 ... 34.45
2012 2 ... 2.05
… … ... …
mv contents
sum
42.0
…
Query results
18 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting (example 3)
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT floor(time to minute), page,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY floor(time to minute), page;
• Query
SELECT floor(time to month),
SUM(added) AS c_added
FROM wiki
GROUP BY floor(time to month);
• Materialized view-based rewriting
SELECT floor(time to month),
SUM(c_added) as c_added
FROM mv
GROUP BY floor(time to month);
wiki
__time page c_added c_rmv
2011-01-01 01:05:00 Justin 1800 25
2011-01-20 19:00:00 Justin 2912 42
2011-01-01 11:06:00 Ke$ha 1953 17
2011-02-02 13:15:00 Ke$ha 3194 170
2011-01-02 18:00:00 Miley 2232 34
mv contents
__time c_added
2011-01-01 00:00:00 8897
2011-02-01 00:00:00 3194
Query results
19 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view
maintenance
20 © Hortonworks Inc. 2011–2018. All rights reserved
Rebuilding materialized views
• Rebuild needs to be triggered manually by user
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;
• Incremental materialized view maintenance
• Only refresh data that has changed in source tables
• Multiple benefits
• Decrease rebuild step execution time
• Preserves LLAP cache for existing data
• Materialized view should only use transactional tables (micromanaged or ACID)
• Current implementation only supports incremental rebuild for insert operations
• Update/delete operations force full rebuild
• Optimizer will attempt incremental rebuild
• Otherwise, fallback to full rebuild (INSERT OVERWRITE with MV definition)
21 © Hortonworks Inc. 2011–2018. All rights reserved
Incremental view maintenance algorithm
• Relies on materialized view rewriting algorithm
• Materialized view stores write ID for its tables when it is created/refreshed
• Write ID associates rows with transactions
• When rebuild is triggered, introduce filter condition on write ID column in MV definition
• Read only new rows from source tables
• Execute materialized view rewriting
• Rewrite INSERT OVERWRITE (full rebuild) into more efficient plan
• INSERT (table scan, filter, project, join)
• MERGE (table scan, filter, project, join, aggregate)
22 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
Incremental view maintenance algorithm (example)
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
⇢ ALTER MATERIALIZED VIEW mv1 REBUILD;
23 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
① Rebuild statement rewriting
INSERT OVERWRITE mv1
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM (
SELECT page, user, c_added, c_removed
FROM mv1
UNION ALL
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM wiki
WHERE writeID > 9999
GROUP BY page, user) subq
GROUP BY page, user;
Incremental view maintenance algorithm (example)
Rollup data
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
24 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
② Rewrite INSERT OVERWRITE into MERGE statement
MERGE INTO mv1
USING (
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM wiki
WHERE writeID > 9999
GROUP BY page, user) src
ON mv1.page = src.page AND mv1.user = src.user
WHEN MATCHED
THEN UPDATE SET c_added = mv1.c_added + src.c_added,
c_removed = mv1.c_removed + src.c_rmv
WHEN NOT MATCHED
THEN INSERT VALUES (page, user, c_added, c_rmv);
Incremental view maintenance algorithm (example)
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
25 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view lifecycle
26 © Hortonworks Inc. 2011–2018. All rights reserved
Management of materialized view lifecycle
• Do not accept stale data (default)
• If content of the materialized view is not fresh, we do not use it for automatic query rewriting
• Still possible to trigger partial rewritings that read both the stale materialized view and new data
from source tables
• Accept stale data
• Freshness defined as a time parameter
• If MV was not rebuilt for a certain time period and there were changes in base tables, ignore
• SET hive.materializedview.rewriting.time.window=10min;
• Can also be overriden by a certain materialized view using table properties
• Periodically rebuild materialized view, e.g., every 5 minutes
t=0min t=10min t=20min
Create MV Rebuild Rebuild Rebuild Rebuild
t=5min t=15min
27 © Hortonworks Inc. 2011–2018. All rights reserved
Road ahead
28 © Hortonworks Inc. 2011–2018. All rights reserved
Road ahead
• Improvements to current materialized views implementation
• Rewriting performance and scalability
• Single/many MVs
• Control physical distribution of data
• DISTRIBUTE BY, SORT BY, CLUSTER BY
• Increase incremental view maintenance coverage
• Support update/delete in source tables
• Materialized view recommender
• Ease the identification of access patterns for a given workload
29 © Hortonworks Inc. 2011–2018. All rights reserved
Demo
30 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you
https://cwiki.apache.org/confluence/display/Hive/Materialized+views

Accelerating query processing with materialized views in Apache Hive

  • 1.
    1 © HortonworksInc. 2011–2018. All rights reserved Accelerating query processing with materialized views in Apache Hive Jesús Camacho Rodríguez DataWorks Summit Berlin April 18, 2018
  • 2.
    2 © HortonworksInc. 2011–2018. All rights reserved Apache Hive • Initial use case: batch processing • Read-only data • HiveQL (SQL-like query language) • MapReduce • Effort to take Hive beyond its batch processing roots • Started in Apache Hive 0.10.0 (January 2013) • Upcoming release: Apache Hive 3.0 (May 2018) • Extensive renovation to improve three different axes • Latency: allow interactive and sub-second queries • Scalability: from TB to PB of data • SQL support: move from HiveQL to SQL standard
  • 3.
    3 © HortonworksInc. 2011–2018. All rights reserved Apache Hive • Multiple execution engines: Apache Tez and Apache Spark • More efficient join execution algorithms • Vectorized query execution • Integration with columnar storage formats: Apache ORC, Apache Parquet • LLAP (Live Long and Process) • Persistent deamons for low-latency queries • Rule-based and cost-based optimizer • Better statistics • Tighter integration with other data processing systems: Druid Important internals improvements
  • 4.
    4 © HortonworksInc. 2011–2018. All rights reserved Accelerating query processing • Change data physical properties (distribute, sort) • Filter rows • Denormalize • Preaggregate Optimization based on access patterns
  • 5.
    5 © HortonworksInc. 2011–2018. All rights reserved Accelerating query processing • Establish relationship between original and new tables • Has a similar table already been created? • Rewrite your queries to use new tables • What happens when access patterns change? • Maintain your new tables when original tables change • Do I have to fully rebuild new tables? Optimization based on access patterns Currently, Hive users have to do it manually
  • 6.
    6 © HortonworksInc. 2011–2018. All rights reserved Materialized views • A materialized view is an entity that contains the result of a evaluating a query • Important property  Awareness of the materialized view definition semantics • Optimizer can exploit them for automatic query rewriting • System can handle maintenance of the materialized views • Generally, materializations can be created in different forms depending on the scope • DBA writes “CREATE MATERIALIZED VIEW” statement • Daemon creates materialized view based on recent query activity • Cached result of previous similar query • Query factorization identifies common pieces within a single query
  • 7.
    7 © HortonworksInc. 2011–2018. All rights reserved Possible workflow 1. Create materialized view using Hive tables • Stored by Hive or Druid 2. User or dashboard sends queries to Hive • Hive rewrites queries using available materialized views • Execute rewitten query Dashboards, BI tools CREATE MATERIALIZED VIEW `ssb_mv` STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler' ENABLE REWRITE AS <query>; DBA, recommendation system ① ② Data Queries
  • 8.
    8 © HortonworksInc. 2011–2018. All rights reserved Materialized views in Apache Hive • First implementation will be part of Apache Hive 3.0 • Multiple storage options: Hive, Druid • Automatic rewriting of incoming queries to use materialized views • Efficient view maintenance • Incremental refresh • Multiple options to control materialized views lifecycle
  • 9.
    9 © HortonworksInc. 2011–2018. All rights reserved Management of materialized views in Hive
  • 10.
    10 © HortonworksInc. 2011–2018. All rights reserved Materialized view creation • CREATE MATERIALIZED VIEW statement CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name [ENABLE REWRITE | DISABLE REWRITE] [COMMENT materialized_view_comment] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] AS <query>; ⇢ Supports custom table properties, storage format, etc.
  • 11.
    11 © HortonworksInc. 2011–2018. All rights reserved Materialized view creation (stored in Druid) • CREATE MATERIALIZED VIEW statement CREATE MATERIALIZED VIEW druid_wiki_mv STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler' AS SELECT __time, page, user, c_added, c_removed FROM src; Hive materialized view name Hive storage handler classname
  • 12.
    12 © HortonworksInc. 2011–2018. All rights reserved Other operations for materialized view management DROP MATERIALIZED VIEW [db_name.]materialized_view_name; SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’]; DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name; ⇢ More operations to be added and extended
  • 13.
    13 © HortonworksInc. 2011–2018. All rights reserved Materialized view-based query rewriting
  • 14.
    14 © HortonworksInc. 2011–2018. All rights reserved Materialized view-based rewriting algorithm • Automatically rewrite incoming queries using materialized views • Optimizer exploits materialized view definition semantics • Built on the ideas presented in [GL01] using Apache Calcite • Supports queries containing TableScan, Project, Filter, Join, Aggregate operators • Includes some extensions • Generation of additional rewritings without needing to do join permutation • Partial rewritings using union operators • More information about the rewriting coverage • http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information [GL01] Jonathan Goldstein and Per-åke Larson. Optimizing queries using materialized views: A practical, scalable solution. In Proc. ACM SIGMOD Conf., 2001.
  • 15.
    15 © HortonworksInc. 2011–2018. All rights reserved Enable materialized view-based rewriting • Global property to enable materialized view rewriting for queries SET hive.materializedview.rewriting=true; • User can selectively use enable/disable materialized views for rewriting • Materialized views are enabled by default for rewriting • Behavior can be altered after materialized view has been created ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
  • 16.
    16 © HortonworksInc. 2011–2018. All rights reserved depts Materialized view-based rewriting (example) • Materialized view definition Employees that were hired after 2016 CREATE MATERIALIZED VIEW mv AS SELECT empid, deptname, hire_date FROM emps JOIN depts ON (emps.deptno = depts.deptno) WHERE hire_date >= '2016-01-01'; • Query Employees that were hired last quarter SELECT empid, deptname FROM emps JOIN depts ON (emps.deptno = depts.deptno) WHERE hire_date >= '2018-01-01' AND hire_date <= '2018-03-31'; • Materialized view-based rewriting SELECT empid, deptname FROM mv WHERE hire_date >= '2018-01-01' AND hire_date <= '2018-03-31'; deptsemps empid depname hire_date 10001 IT 2016-03-01 10002 IT 2017-01-02 10003 HR 2017-07-01 10004 Finance 2018-01-15 10005 HR 2018-02-02 mv contents empid depname 10004 Finance 10005 HR Query results
  • 17.
    17 © HortonworksInc. 2011–2018. All rights reserved Materialized view-based rewriting (example 2) • Materialized view definition CREATE MATERIALIZED VIEW mv AS SELECT <dims>, lo_revenue, lo_extprice * lo_disc AS d_price, lo_revenue - lo_supplycost, FROM customer, dates, lineorder, part, supplier WHERE lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and lo_custkey = c_custkey; • Query SELECT sum(lo_extendedprice*lo_discount) FROM lineorder, dates WHERE lo_orderdate = d_datekey and d_year = 2013 and lo_discount between 1 and 3; • Materialized view-based rewriting SELECT SUM(d_price) FROM mv WHERE d_year = 2013 and lo_discount between 1 and 3; supplier part dates customerlineorder Exploit SQL PK-FK and NOT NULL constraints d_year lo_discount <dims> d_price 2013 2 ... 7.55 2014 4 ... 432.60 2013 2 ... 34.45 2012 2 ... 2.05 … … ... … mv contents sum 42.0 … Query results
  • 18.
    18 © HortonworksInc. 2011–2018. All rights reserved Materialized view-based rewriting (example 3) • Materialized view definition CREATE MATERIALIZED VIEW mv AS SELECT floor(time to minute), page, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY floor(time to minute), page; • Query SELECT floor(time to month), SUM(added) AS c_added FROM wiki GROUP BY floor(time to month); • Materialized view-based rewriting SELECT floor(time to month), SUM(c_added) as c_added FROM mv GROUP BY floor(time to month); wiki __time page c_added c_rmv 2011-01-01 01:05:00 Justin 1800 25 2011-01-20 19:00:00 Justin 2912 42 2011-01-01 11:06:00 Ke$ha 1953 17 2011-02-02 13:15:00 Ke$ha 3194 170 2011-01-02 18:00:00 Miley 2232 34 mv contents __time c_added 2011-01-01 00:00:00 8897 2011-02-01 00:00:00 3194 Query results
  • 19.
    19 © HortonworksInc. 2011–2018. All rights reserved Materialized view maintenance
  • 20.
    20 © HortonworksInc. 2011–2018. All rights reserved Rebuilding materialized views • Rebuild needs to be triggered manually by user ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD; • Incremental materialized view maintenance • Only refresh data that has changed in source tables • Multiple benefits • Decrease rebuild step execution time • Preserves LLAP cache for existing data • Materialized view should only use transactional tables (micromanaged or ACID) • Current implementation only supports incremental rebuild for insert operations • Update/delete operations force full rebuild • Optimizer will attempt incremental rebuild • Otherwise, fallback to full rebuild (INSERT OVERWRITE with MV definition)
  • 21.
    21 © HortonworksInc. 2011–2018. All rights reserved Incremental view maintenance algorithm • Relies on materialized view rewriting algorithm • Materialized view stores write ID for its tables when it is created/refreshed • Write ID associates rows with transactions • When rebuild is triggered, introduce filter condition on write ID column in MV definition • Read only new rows from source tables • Execute materialized view rewriting • Rewrite INSERT OVERWRITE (full rebuild) into more efficient plan • INSERT (table scan, filter, project, join) • MERGE (table scan, filter, project, join, aggregate)
  • 22.
    22 © HortonworksInc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; Incremental view maintenance algorithm (example) mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records ⇢ ALTER MATERIALIZED VIEW mv1 REBUILD;
  • 23.
    23 © HortonworksInc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; ① Rebuild statement rewriting INSERT OVERWRITE mv1 SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM ( SELECT page, user, c_added, c_removed FROM mv1 UNION ALL SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki WHERE writeID > 9999 GROUP BY page, user) subq GROUP BY page, user; Incremental view maintenance algorithm (example) Rollup data mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records
  • 24.
    24 © HortonworksInc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; ② Rewrite INSERT OVERWRITE into MERGE statement MERGE INTO mv1 USING ( SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki WHERE writeID > 9999 GROUP BY page, user) src ON mv1.page = src.page AND mv1.user = src.user WHEN MATCHED THEN UPDATE SET c_added = mv1.c_added + src.c_added, c_removed = mv1.c_removed + src.c_rmv WHEN NOT MATCHED THEN INSERT VALUES (page, user, c_added, c_rmv); Incremental view maintenance algorithm (example) mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records
  • 25.
    25 © HortonworksInc. 2011–2018. All rights reserved Materialized view lifecycle
  • 26.
    26 © HortonworksInc. 2011–2018. All rights reserved Management of materialized view lifecycle • Do not accept stale data (default) • If content of the materialized view is not fresh, we do not use it for automatic query rewriting • Still possible to trigger partial rewritings that read both the stale materialized view and new data from source tables • Accept stale data • Freshness defined as a time parameter • If MV was not rebuilt for a certain time period and there were changes in base tables, ignore • SET hive.materializedview.rewriting.time.window=10min; • Can also be overriden by a certain materialized view using table properties • Periodically rebuild materialized view, e.g., every 5 minutes t=0min t=10min t=20min Create MV Rebuild Rebuild Rebuild Rebuild t=5min t=15min
  • 27.
    27 © HortonworksInc. 2011–2018. All rights reserved Road ahead
  • 28.
    28 © HortonworksInc. 2011–2018. All rights reserved Road ahead • Improvements to current materialized views implementation • Rewriting performance and scalability • Single/many MVs • Control physical distribution of data • DISTRIBUTE BY, SORT BY, CLUSTER BY • Increase incremental view maintenance coverage • Support update/delete in source tables • Materialized view recommender • Ease the identification of access patterns for a given workload
  • 29.
    29 © HortonworksInc. 2011–2018. All rights reserved Demo
  • 30.
    30 © HortonworksInc. 2011–2018. All rights reserved Thank you https://cwiki.apache.org/confluence/display/Hive/Materialized+views

Editor's Notes

  • #3 Intro-Hive evolution from batch to interactive (modify a bit original slide)
  • #4 Important improvements in Hive in general Integration with Druid Mention improvements to optimization too, to link it with next slide about accelerating query processing using materializations
  • #5 Access patterns
  • #6 Access patterns
  • #7 A traditional technique to accelerating query execution is precalculation of materialized views Awareness of semantics enables materialized view rewriting and automatic maintenance of materialized views
  • #8 Possible workflow Three important points. You can query the materialized view as with any other table. Druid integration goes beyond materialized views: you can just query Druid from Hive. Materialized views do not work exclusively with Druid; in fact, we expect them to play well with LLAP.
  • #9 Work that we have done in Hive, main goals
  • #15 Implemented in Calcite, based on paper
  • #16 How to enable? Enabled by default in Hive 3.0, can alter materialized view to enable-disable
  • #18 Example 2 (materialized views exploit constraints)
  • #19 Example 3 (rollup based on time, richer semantics)
  • #21 Manual rebuild: full vs incremental
  • #22 Manual rebuild: full vs incremental
  • #23 Manual rebuild: full vs incremental
  • #24 Manual rebuild: full vs incremental
  • #25 Manual rebuild: full vs incremental
  • #27 Data freshness different options: fresh data vs accept data staleness
  • #29 Control physical distribution of data (distributed by, sorted by, cluster by) MV recommender From other slides, e.g, scaling as number of materialized views grow