SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing with
materialized views in Apache Hive
Jesús Camacho Rodríguez
DataWorks Summit Berlin
April 18, 2018
2 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hive
• Initial use case: batch processing
• Read-only data
• HiveQL (SQL-like query language)
• MapReduce
• Effort to take Hive beyond its batch processing roots
• Started in Apache Hive 0.10.0 (January 2013)
• Upcoming release: Apache Hive 3.0 (May 2018)
• Extensive renovation to improve three different axes
• Latency: allow interactive and sub-second queries
• Scalability: from TB to PB of data
• SQL support: move from HiveQL to SQL standard
3 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hive
• Multiple execution engines: Apache Tez and Apache Spark
• More efficient join execution algorithms
• Vectorized query execution
• Integration with columnar storage formats:
Apache ORC, Apache Parquet
• LLAP (Live Long and Process)
• Persistent deamons for low-latency queries
• Rule-based and cost-based optimizer
• Better statistics
• Tighter integration with other data processing systems: Druid
Important internals improvements
4 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing
• Change data physical properties (distribute, sort)
• Filter rows
• Denormalize
• Preaggregate
Optimization based on access patterns
5 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing
• Establish relationship between original and new tables
• Has a similar table already been created?
• Rewrite your queries to use new tables
• What happens when access patterns change?
• Maintain your new tables when original tables change
• Do I have to fully rebuild new tables?
Optimization based on access patterns
Currently, Hive users
have to do it manually
6 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized views
• A materialized view is an entity that contains the result of a evaluating a query
• Important property  Awareness of the materialized view definition semantics
• Optimizer can exploit them for automatic query rewriting
• System can handle maintenance of the materialized views
• Generally, materializations can be created in different forms depending on the scope
• DBA writes “CREATE MATERIALIZED VIEW” statement
• Daemon creates materialized view based on recent query activity
• Cached result of previous similar query
• Query factorization identifies common pieces within a single query
7 © Hortonworks Inc. 2011–2018. All rights reserved
Possible workflow
1. Create materialized view using Hive tables
• Stored by Hive or Druid
2. User or dashboard sends queries to Hive
• Hive rewrites queries using available materialized views
• Execute rewitten query
Dashboards, BI tools
CREATE MATERIALIZED VIEW `ssb_mv`
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
ENABLE REWRITE
AS
<query>;
DBA, recommendation system
①
②
Data
Queries
8 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized views in Apache Hive
• First implementation will be part of Apache Hive 3.0
• Multiple storage options: Hive, Druid
• Automatic rewriting of incoming queries to use materialized views
• Efficient view maintenance
• Incremental refresh
• Multiple options to control materialized views lifecycle
9 © Hortonworks Inc. 2011–2018. All rights reserved
Management of
materialized views in Hive
10 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view creation
• CREATE MATERIALIZED VIEW statement
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
[ENABLE REWRITE | DISABLE REWRITE]
[COMMENT materialized_view_comment]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
AS
<query>;
⇢ Supports custom table properties, storage format, etc.
11 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view creation (stored in Druid)
• CREATE MATERIALIZED VIEW statement
CREATE MATERIALIZED VIEW druid_wiki_mv
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive materialized view name
Hive storage handler classname
12 © Hortonworks Inc. 2011–2018. All rights reserved
Other operations for materialized view management
DROP MATERIALIZED VIEW [db_name.]materialized_view_name;
SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’];
DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;
⇢ More operations to be added and extended
13 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based
query rewriting
14 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting algorithm
• Automatically rewrite incoming queries using materialized views
• Optimizer exploits materialized view definition semantics
• Built on the ideas presented in [GL01] using Apache Calcite
• Supports queries containing TableScan, Project, Filter, Join, Aggregate operators
• Includes some extensions
• Generation of additional rewritings without needing to do join permutation
• Partial rewritings using union operators
• More information about the rewriting coverage
• http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information
[GL01] Jonathan Goldstein and Per-åke Larson. Optimizing queries using materialized views: A practical,
scalable solution. In Proc. ACM SIGMOD Conf., 2001.
15 © Hortonworks Inc. 2011–2018. All rights reserved
Enable materialized view-based rewriting
• Global property to enable materialized view rewriting for queries
SET hive.materializedview.rewriting=true;
• User can selectively use enable/disable materialized views for rewriting
• Materialized views are enabled by default for rewriting
• Behavior can be altered after materialized view has been created
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
16 © Hortonworks Inc. 2011–2018. All rights reserved
depts
Materialized view-based rewriting (example)
• Materialized view definition
Employees that were hired after 2016
CREATE MATERIALIZED VIEW mv
AS
SELECT empid, deptname, hire_date
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2016-01-01';
• Query
Employees that were hired last quarter
SELECT empid, deptname
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
• Materialized view-based rewriting
SELECT empid, deptname
FROM mv
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
deptsemps
empid depname hire_date
10001 IT 2016-03-01
10002 IT 2017-01-02
10003 HR 2017-07-01
10004 Finance 2018-01-15
10005 HR 2018-02-02
mv contents
empid depname
10004 Finance
10005 HR
Query results
17 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting (example 2)
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT <dims>,
lo_revenue,
lo_extprice * lo_disc AS d_price,
lo_revenue - lo_supplycost,
FROM
customer, dates, lineorder, part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
• Query
SELECT sum(lo_extendedprice*lo_discount)
FROM
lineorder, dates
WHERE
lo_orderdate = d_datekey
and d_year = 2013
and lo_discount between 1 and 3;
• Materialized view-based rewriting
SELECT SUM(d_price)
FROM mv
WHERE
d_year = 2013
and lo_discount between 1 and 3;
supplier
part
dates
customerlineorder
Exploit SQL PK-FK and
NOT NULL constraints
d_year lo_discount <dims> d_price
2013 2 ... 7.55
2014 4 ... 432.60
2013 2 ... 34.45
2012 2 ... 2.05
… … ... …
mv contents
sum
42.0
…
Query results
18 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting (example 3)
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT floor(time to minute), page,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY floor(time to minute), page;
• Query
SELECT floor(time to month),
SUM(added) AS c_added
FROM wiki
GROUP BY floor(time to month);
• Materialized view-based rewriting
SELECT floor(time to month),
SUM(c_added) as c_added
FROM mv
GROUP BY floor(time to month);
wiki
__time page c_added c_rmv
2011-01-01 01:05:00 Justin 1800 25
2011-01-20 19:00:00 Justin 2912 42
2011-01-01 11:06:00 Ke$ha 1953 17
2011-02-02 13:15:00 Ke$ha 3194 170
2011-01-02 18:00:00 Miley 2232 34
mv contents
__time c_added
2011-01-01 00:00:00 8897
2011-02-01 00:00:00 3194
Query results
19 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view
maintenance
20 © Hortonworks Inc. 2011–2018. All rights reserved
Rebuilding materialized views
• Rebuild needs to be triggered manually by user
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;
• Incremental materialized view maintenance
• Only refresh data that has changed in source tables
• Multiple benefits
• Decrease rebuild step execution time
• Preserves LLAP cache for existing data
• Materialized view should only use transactional tables (micromanaged or ACID)
• Current implementation only supports incremental rebuild for insert operations
• Update/delete operations force full rebuild
• Optimizer will attempt incremental rebuild
• Otherwise, fallback to full rebuild (INSERT OVERWRITE with MV definition)
21 © Hortonworks Inc. 2011–2018. All rights reserved
Incremental view maintenance algorithm
• Relies on materialized view rewriting algorithm
• Materialized view stores write ID for its tables when it is created/refreshed
• Write ID associates rows with transactions
• When rebuild is triggered, introduce filter condition on write ID column in MV definition
• Read only new rows from source tables
• Execute materialized view rewriting
• Rewrite INSERT OVERWRITE (full rebuild) into more efficient plan
• INSERT (table scan, filter, project, join)
• MERGE (table scan, filter, project, join, aggregate)
22 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
Incremental view maintenance algorithm (example)
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
⇢ ALTER MATERIALIZED VIEW mv1 REBUILD;
23 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
① Rebuild statement rewriting
INSERT OVERWRITE mv1
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM (
SELECT page, user, c_added, c_removed
FROM mv1
UNION ALL
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM wiki
WHERE writeID > 9999
GROUP BY page, user) subq
GROUP BY page, user;
Incremental view maintenance algorithm (example)
Rollup data
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
24 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
② Rewrite INSERT OVERWRITE into MERGE statement
MERGE INTO mv1
USING (
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM wiki
WHERE writeID > 9999
GROUP BY page, user) src
ON mv1.page = src.page AND mv1.user = src.user
WHEN MATCHED
THEN UPDATE SET c_added = mv1.c_added + src.c_added,
c_removed = mv1.c_removed + src.c_rmv
WHEN NOT MATCHED
THEN INSERT VALUES (page, user, c_added, c_rmv);
Incremental view maintenance algorithm (example)
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
25 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view lifecycle
26 © Hortonworks Inc. 2011–2018. All rights reserved
Management of materialized view lifecycle
• Do not accept stale data (default)
• If content of the materialized view is not fresh, we do not use it for automatic query rewriting
• Still possible to trigger partial rewritings that read both the stale materialized view and new data
from source tables
• Accept stale data
• Freshness defined as a time parameter
• If MV was not rebuilt for a certain time period and there were changes in base tables, ignore
• SET hive.materializedview.rewriting.time.window=10min;
• Can also be overriden by a certain materialized view using table properties
• Periodically rebuild materialized view, e.g., every 5 minutes
t=0min t=10min t=20min
Create MV Rebuild Rebuild Rebuild Rebuild
t=5min t=15min
27 © Hortonworks Inc. 2011–2018. All rights reserved
Road ahead
28 © Hortonworks Inc. 2011–2018. All rights reserved
Road ahead
• Improvements to current materialized views implementation
• Rewriting performance and scalability
• Single/many MVs
• Control physical distribution of data
• DISTRIBUTE BY, SORT BY, CLUSTER BY
• Increase incremental view maintenance coverage
• Support update/delete in source tables
• Materialized view recommender
• Ease the identification of access patterns for a given workload
29 © Hortonworks Inc. 2011–2018. All rights reserved
Demo
30 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you
https://cwiki.apache.org/confluence/display/Hive/Materialized+views

More Related Content

What's hot

The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in SparkDatabricks
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
Databricks
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
DataWorks Summit
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
Databricks
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
HostedbyConfluent
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
confluent
 

What's hot (20)

The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 

Similar to Accelerating query processing with materialized views in Apache Hive

Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
Sahil Takiar
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
DataWorks Summit
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
alanfgates
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Abdelkrim Hadjidj
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
OpenNTF Webinar, October 2020
OpenNTF Webinar, October 2020OpenNTF Webinar, October 2020
OpenNTF Webinar, October 2020
Howard Greenberg
 
Liquibase – a time machine for your data
Liquibase – a time machine for your dataLiquibase – a time machine for your data
Liquibase – a time machine for your data
Neev Technologies
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Artem Ervits
 
Oracle DV V4 new features overview
Oracle DV V4 new features overviewOracle DV V4 new features overview
Oracle DV V4 new features overview
Philippe Lions
 
The Central View of your Data with Postgres
The Central View of your Data with PostgresThe Central View of your Data with Postgres
The Central View of your Data with Postgres
EDB
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
HostedbyConfluent
 
Ansible Meetup FI - Ansible use cases with enterprise application
Ansible Meetup FI - Ansible use cases with enterprise application Ansible Meetup FI - Ansible use cases with enterprise application
Ansible Meetup FI - Ansible use cases with enterprise application
Tanja REPO 🦊
 
SharePoint 2014: Where to save my data, for devs!
SharePoint 2014: Where to save my data, for devs!SharePoint 2014: Where to save my data, for devs!
SharePoint 2014: Where to save my data, for devs!Ben Steinhauser
 

Similar to Accelerating query processing with materialized views in Apache Hive (20)

Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
OpenNTF Webinar, October 2020
OpenNTF Webinar, October 2020OpenNTF Webinar, October 2020
OpenNTF Webinar, October 2020
 
Liquibase – a time machine for your data
Liquibase – a time machine for your dataLiquibase – a time machine for your data
Liquibase – a time machine for your data
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Oracle DV V4 new features overview
Oracle DV V4 new features overviewOracle DV V4 new features overview
Oracle DV V4 new features overview
 
The Central View of your Data with Postgres
The Central View of your Data with PostgresThe Central View of your Data with Postgres
The Central View of your Data with Postgres
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
 
Ansible Meetup FI - Ansible use cases with enterprise application
Ansible Meetup FI - Ansible use cases with enterprise application Ansible Meetup FI - Ansible use cases with enterprise application
Ansible Meetup FI - Ansible use cases with enterprise application
 
SharePoint 2014: Where to save my data, for devs!
SharePoint 2014: Where to save my data, for devs!SharePoint 2014: Where to save my data, for devs!
SharePoint 2014: Where to save my data, for devs!
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 

Recently uploaded (20)

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 

Accelerating query processing with materialized views in Apache Hive

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Accelerating query processing with materialized views in Apache Hive Jesús Camacho Rodríguez DataWorks Summit Berlin April 18, 2018
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Apache Hive • Initial use case: batch processing • Read-only data • HiveQL (SQL-like query language) • MapReduce • Effort to take Hive beyond its batch processing roots • Started in Apache Hive 0.10.0 (January 2013) • Upcoming release: Apache Hive 3.0 (May 2018) • Extensive renovation to improve three different axes • Latency: allow interactive and sub-second queries • Scalability: from TB to PB of data • SQL support: move from HiveQL to SQL standard
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Apache Hive • Multiple execution engines: Apache Tez and Apache Spark • More efficient join execution algorithms • Vectorized query execution • Integration with columnar storage formats: Apache ORC, Apache Parquet • LLAP (Live Long and Process) • Persistent deamons for low-latency queries • Rule-based and cost-based optimizer • Better statistics • Tighter integration with other data processing systems: Druid Important internals improvements
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Accelerating query processing • Change data physical properties (distribute, sort) • Filter rows • Denormalize • Preaggregate Optimization based on access patterns
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Accelerating query processing • Establish relationship between original and new tables • Has a similar table already been created? • Rewrite your queries to use new tables • What happens when access patterns change? • Maintain your new tables when original tables change • Do I have to fully rebuild new tables? Optimization based on access patterns Currently, Hive users have to do it manually
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Materialized views • A materialized view is an entity that contains the result of a evaluating a query • Important property  Awareness of the materialized view definition semantics • Optimizer can exploit them for automatic query rewriting • System can handle maintenance of the materialized views • Generally, materializations can be created in different forms depending on the scope • DBA writes “CREATE MATERIALIZED VIEW” statement • Daemon creates materialized view based on recent query activity • Cached result of previous similar query • Query factorization identifies common pieces within a single query
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Possible workflow 1. Create materialized view using Hive tables • Stored by Hive or Druid 2. User or dashboard sends queries to Hive • Hive rewrites queries using available materialized views • Execute rewitten query Dashboards, BI tools CREATE MATERIALIZED VIEW `ssb_mv` STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler' ENABLE REWRITE AS <query>; DBA, recommendation system ① ② Data Queries
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Materialized views in Apache Hive • First implementation will be part of Apache Hive 3.0 • Multiple storage options: Hive, Druid • Automatic rewriting of incoming queries to use materialized views • Efficient view maintenance • Incremental refresh • Multiple options to control materialized views lifecycle
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Management of materialized views in Hive
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view creation • CREATE MATERIALIZED VIEW statement CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name [ENABLE REWRITE | DISABLE REWRITE] [COMMENT materialized_view_comment] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] AS <query>; ⇢ Supports custom table properties, storage format, etc.
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view creation (stored in Druid) • CREATE MATERIALIZED VIEW statement CREATE MATERIALIZED VIEW druid_wiki_mv STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler' AS SELECT __time, page, user, c_added, c_removed FROM src; Hive materialized view name Hive storage handler classname
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Other operations for materialized view management DROP MATERIALIZED VIEW [db_name.]materialized_view_name; SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’]; DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name; ⇢ More operations to be added and extended
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based query rewriting
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based rewriting algorithm • Automatically rewrite incoming queries using materialized views • Optimizer exploits materialized view definition semantics • Built on the ideas presented in [GL01] using Apache Calcite • Supports queries containing TableScan, Project, Filter, Join, Aggregate operators • Includes some extensions • Generation of additional rewritings without needing to do join permutation • Partial rewritings using union operators • More information about the rewriting coverage • http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information [GL01] Jonathan Goldstein and Per-åke Larson. Optimizing queries using materialized views: A practical, scalable solution. In Proc. ACM SIGMOD Conf., 2001.
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Enable materialized view-based rewriting • Global property to enable materialized view rewriting for queries SET hive.materializedview.rewriting=true; • User can selectively use enable/disable materialized views for rewriting • Materialized views are enabled by default for rewriting • Behavior can be altered after materialized view has been created ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved depts Materialized view-based rewriting (example) • Materialized view definition Employees that were hired after 2016 CREATE MATERIALIZED VIEW mv AS SELECT empid, deptname, hire_date FROM emps JOIN depts ON (emps.deptno = depts.deptno) WHERE hire_date >= '2016-01-01'; • Query Employees that were hired last quarter SELECT empid, deptname FROM emps JOIN depts ON (emps.deptno = depts.deptno) WHERE hire_date >= '2018-01-01' AND hire_date <= '2018-03-31'; • Materialized view-based rewriting SELECT empid, deptname FROM mv WHERE hire_date >= '2018-01-01' AND hire_date <= '2018-03-31'; deptsemps empid depname hire_date 10001 IT 2016-03-01 10002 IT 2017-01-02 10003 HR 2017-07-01 10004 Finance 2018-01-15 10005 HR 2018-02-02 mv contents empid depname 10004 Finance 10005 HR Query results
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based rewriting (example 2) • Materialized view definition CREATE MATERIALIZED VIEW mv AS SELECT <dims>, lo_revenue, lo_extprice * lo_disc AS d_price, lo_revenue - lo_supplycost, FROM customer, dates, lineorder, part, supplier WHERE lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and lo_custkey = c_custkey; • Query SELECT sum(lo_extendedprice*lo_discount) FROM lineorder, dates WHERE lo_orderdate = d_datekey and d_year = 2013 and lo_discount between 1 and 3; • Materialized view-based rewriting SELECT SUM(d_price) FROM mv WHERE d_year = 2013 and lo_discount between 1 and 3; supplier part dates customerlineorder Exploit SQL PK-FK and NOT NULL constraints d_year lo_discount <dims> d_price 2013 2 ... 7.55 2014 4 ... 432.60 2013 2 ... 34.45 2012 2 ... 2.05 … … ... … mv contents sum 42.0 … Query results
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based rewriting (example 3) • Materialized view definition CREATE MATERIALIZED VIEW mv AS SELECT floor(time to minute), page, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY floor(time to minute), page; • Query SELECT floor(time to month), SUM(added) AS c_added FROM wiki GROUP BY floor(time to month); • Materialized view-based rewriting SELECT floor(time to month), SUM(c_added) as c_added FROM mv GROUP BY floor(time to month); wiki __time page c_added c_rmv 2011-01-01 01:05:00 Justin 1800 25 2011-01-20 19:00:00 Justin 2912 42 2011-01-01 11:06:00 Ke$ha 1953 17 2011-02-02 13:15:00 Ke$ha 3194 170 2011-01-02 18:00:00 Miley 2232 34 mv contents __time c_added 2011-01-01 00:00:00 8897 2011-02-01 00:00:00 3194 Query results
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view maintenance
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Rebuilding materialized views • Rebuild needs to be triggered manually by user ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD; • Incremental materialized view maintenance • Only refresh data that has changed in source tables • Multiple benefits • Decrease rebuild step execution time • Preserves LLAP cache for existing data • Materialized view should only use transactional tables (micromanaged or ACID) • Current implementation only supports incremental rebuild for insert operations • Update/delete operations force full rebuild • Optimizer will attempt incremental rebuild • Otherwise, fallback to full rebuild (INSERT OVERWRITE with MV definition)
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Incremental view maintenance algorithm • Relies on materialized view rewriting algorithm • Materialized view stores write ID for its tables when it is created/refreshed • Write ID associates rows with transactions • When rebuild is triggered, introduce filter condition on write ID column in MV definition • Read only new rows from source tables • Execute materialized view rewriting • Rewrite INSERT OVERWRITE (full rebuild) into more efficient plan • INSERT (table scan, filter, project, join) • MERGE (table scan, filter, project, join, aggregate)
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; Incremental view maintenance algorithm (example) mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records ⇢ ALTER MATERIALIZED VIEW mv1 REBUILD;
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; ① Rebuild statement rewriting INSERT OVERWRITE mv1 SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM ( SELECT page, user, c_added, c_removed FROM mv1 UNION ALL SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki WHERE writeID > 9999 GROUP BY page, user) subq GROUP BY page, user; Incremental view maintenance algorithm (example) Rollup data mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; ② Rewrite INSERT OVERWRITE into MERGE statement MERGE INTO mv1 USING ( SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki WHERE writeID > 9999 GROUP BY page, user) src ON mv1.page = src.page AND mv1.user = src.user WHEN MATCHED THEN UPDATE SET c_added = mv1.c_added + src.c_added, c_removed = mv1.c_removed + src.c_rmv WHEN NOT MATCHED THEN INSERT VALUES (page, user, c_added, c_rmv); Incremental view maintenance algorithm (example) mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view lifecycle
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Management of materialized view lifecycle • Do not accept stale data (default) • If content of the materialized view is not fresh, we do not use it for automatic query rewriting • Still possible to trigger partial rewritings that read both the stale materialized view and new data from source tables • Accept stale data • Freshness defined as a time parameter • If MV was not rebuilt for a certain time period and there were changes in base tables, ignore • SET hive.materializedview.rewriting.time.window=10min; • Can also be overriden by a certain materialized view using table properties • Periodically rebuild materialized view, e.g., every 5 minutes t=0min t=10min t=20min Create MV Rebuild Rebuild Rebuild Rebuild t=5min t=15min
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Road ahead
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Road ahead • Improvements to current materialized views implementation • Rewriting performance and scalability • Single/many MVs • Control physical distribution of data • DISTRIBUTE BY, SORT BY, CLUSTER BY • Increase incremental view maintenance coverage • Support update/delete in source tables • Materialized view recommender • Ease the identification of access patterns for a given workload
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Demo
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Thank you https://cwiki.apache.org/confluence/display/Hive/Materialized+views

Editor's Notes

  1. Intro-Hive evolution from batch to interactive (modify a bit original slide)
  2. Important improvements in Hive in general Integration with Druid Mention improvements to optimization too, to link it with next slide about accelerating query processing using materializations
  3. Access patterns
  4. Access patterns
  5. A traditional technique to accelerating query execution is precalculation of materialized views Awareness of semantics enables materialized view rewriting and automatic maintenance of materialized views
  6. Possible workflow Three important points. You can query the materialized view as with any other table. Druid integration goes beyond materialized views: you can just query Druid from Hive. Materialized views do not work exclusively with Druid; in fact, we expect them to play well with LLAP.
  7. Work that we have done in Hive, main goals
  8. Implemented in Calcite, based on paper
  9. How to enable? Enabled by default in Hive 3.0, can alter materialized view to enable-disable
  10. Example 2 (materialized views exploit constraints)
  11. Example 3 (rollup based on time, richer semantics)
  12. Manual rebuild: full vs incremental
  13. Manual rebuild: full vs incremental
  14. Manual rebuild: full vs incremental
  15. Manual rebuild: full vs incremental
  16. Manual rebuild: full vs incremental
  17. Data freshness different options: fresh data vs accept data staleness
  18. Control physical distribution of data (distributed by, sorted by, cluster by) MV recommender From other slides, e.g, scaling as number of materialized views grow