SlideShare a Scribd company logo
Practical Performance Analysis
and Tuning for Cloudera
Impala
Headline Goes Here
Marcel Kornacker | marcel@cloudera.com 
Speaker Name or Subhead Goes Here
2013-11-11

Copyright © 2013 Cloudera Inc. All rights reserved.

Copyright © 2013 Cloudera Inc.
All rights reserved.
Just how fast is Impala?
You might just be surprised

!2
How fast is Impala?

“DeWitt Clause” prohibits
using DBMS vendor name
!3

Copyright © 2013 Cloudera Inc. All rights reserved.
Practical Performance
Running with Impala

!4
Pre-execution Checklist
• Data

types
• Partitioning
• File Format

!5

Copyright © 2013 Cloudera Inc. All rights reserved.
Data Type Choices
• Define

integer columns as INT/BIGINT

• Operations

• Convert

on INT/BIGINT more efficient than STRING

“external” data to good “internal” types on

load
• e.g.

CAST date strings to TIMESTAMPS
• This avoids expensive CASTs in queries later

!6

Copyright © 2013 Cloudera Inc. All rights reserved.
Partitioning
• “The

fastest I/O is the one that never takes place.”
• Understand your query filter predicates
• For

time-series data, this is usually the date/timestamp column
• Use this/these column(s) for a partition key(s)
• Validate

queries leverage partition pruning using EXPLAIN
• You can have too much of a good thing
•A

few thousand partitions per table is probably OK
• Tens of thousands partitions is probably too much
• Partitions/Files should be no less than a few hundred MBs
!7

Copyright © 2013 Cloudera Inc. All rights reserved.
Partition Pruning in EXPLAIN
explain
select count(*)
from sales_fact
where sold_date in('2000-01-01','2000-01-02');

Note the partition
count and missing
“predicates” filter due
to parse time

!

— Partition Pruning (partitioned table)
0:SCAN HDFS
table=grahn.sales_fact #partitions=2 size=803.15MB
tuple ids: 0
!

— Filter Predicate (non-partitioned table)
0:SCAN HDFS
table=grahn.sales_fact #partitions=1 size=799.42MB
predicates: sold_date IN ('2000-01-01', '2000-01-02')
tuple ids: 0
!8

Copyright © 2013 Cloudera Inc. All rights reserved.
Why use Parquet Columnar Format for
HDFS?
• Well

defined open format - http://parquet.io/

• Works

in Impala, Pig, Hive & Map/Reduce

• I/O

reduction by only reading necessary columns
• Columnar layout compresses/encodes better
• Supports nested data by shredding columns
• Uses

techniques used by Google’s ColumnIO 

• Impala
• Gzip

• Quick
!9

loads use Snappy compression by default

available: set PARQUET_COMPRESSION_CODEC=gzip;

word on Snappy vs. Gzip
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: A Motivating Example

!10

Copyright © 2013 Cloudera Inc. All rights reserved.
Quick Note on Compression
• Snappy
• Faster

compression/decompression speeds
• Less CPU cycles
• Lower

compression ratio
• Gzip/Zlib
• Slower

compression/decompression speeds 
• More CPU cycles
• Higher

compression ratio
• It’s all about trade-offs
!11

Copyright © 2013 Cloudera Inc. All rights reserved.
It’s all about the compression codec
Hortonworks
graphic fails to call out
that different codecs are used.
Gzip compresses better than
Snappy, but we all knew
that.

Source: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

Compressed
with Snappy
!12

Copyright © 2013 Cloudera Inc. All rights reserved.

Compressed
with Gzip
TPC-DS 500GB Scale Factor
% of Original Size - Lower is Better
32.3%
34.8%

Snappy

26.9%
27.4%

Gzip

0

!13

Using the same
compression codec
produces
comparable results

0.088

0.175

0.263

Copyright © 2013 Cloudera Inc. All rights reserved.

0.35

Parquet
ORCFile
Query Execution
What happens behind the scenes

!14
Left-Deep Join Tree
largest* table should
be listed first in the FROM
clause
• Joins are done in the order
tables are listed in FROM
clause
• Filter early - most
selective joins/tables first
• v1.2.1 will do JOIN ordering

⨝

• The

!15

⨝
⨝ R3
R1 R2

Copyright © 2013 Cloudera Inc. All rights reserved.

R4
Two Types of Hash Joins
• Default

hash join type is BROADCAST (aka replicated)

• Each

node ends up with a copy of the right table(s)*
• Left side, read locally and streamed through local hash join(s)
• Best choice for “star join”, single large fact table, multiple small
dims
• Alternate

hash join type is SHUFFLE (aka partitioned)

• Right

side hashed and shuffled; each node gets ~1/Nth the data
• Left side hashed and shuffled, then streamed through join
• Best choice for “large_table JOIN large_table”
• Only available if ANALYZE was used to gather table/column stats*
!16

Copyright © 2013 Cloudera Inc. All rights reserved.
How to use ANALYZE
• Table

Stats (from Hive shell)

• analyze

• Column

table T1 [partition(partition_key)] compute statistics;

Stats (from Hive shell)

• analyze

table T1 [partition(partition_key)]
compute statistics for columns c1,c2,…

• Impala

1.2.1 will have a built-in ANALYZE command

!

!17

Copyright © 2013 Cloudera Inc. All rights reserved.
Hinting Joins
select ...
from large_fact
join [broadcast] small_dim
!
!

select ...
from large_fact
join [shuffle] large_dim
!

*square brackets required

!18

Copyright © 2013 Cloudera Inc. All rights reserved.
Determining Join Type From EXPLAIN
explain
select
s_state,
count(*)
from store_sales
join store on (ss_store_sk = s_store_sk)
group by
s_state;
!

!

2:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| hash predicates:
|
ss_store_sk = s_store_sk
| tuple ids: 0 1
|
|----4:EXCHANGE
|
tuple ids: 1
|
0:SCAN HDFS
table=tpcds.store_sales
tuple ids: 0

!19

explain
select
c_preferred_cust_flag,
count(*)
from store_sales
join customer on (ss_customer_sk = c_customer_sk)
group by
c_preferred_cust_flag;
2:HASH JOIN
| join op: INNER JOIN (PARTITIONED)
| hash predicates:
|
ss_customer_sk = c_customer_sk
| tuple ids: 0 1
|
|----5:EXCHANGE
|
tuple ids: 1
|
4:EXCHANGE
tuple ids: 0

Copyright © 2013 Cloudera Inc. All rights reserved.
Memory Requirements for Joins &
Aggregates
• Impala

does not “spill” to disk -- pipelines are in-memory
• Operators’ mem usage need to fit within the memory limit
• This is not the same as “all data needs to fit in memory”
• Buffered

data generally significantly smaller than total accessed

data
• Aggregations’ mem usage proportional to number of groups
• Applies

for each in-flight query (sum of total)
• Minimum of 128GB of RAM is recommended for impala
nodes
!20

Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Debug Web Pages
A great resource for debug information

!21
!23
Quick Review
Impala performance checklist

!25
Impala Performance Checklist
• Choose

appropriate data types
• Leverage partition pruning
• Adopt Parquet format
• Gather table & column stats
• Order JOINS optimally
Automatic in v1.2.1
• Validate JOIN type
• Monitor hardware resources

!26

Copyright © 2013 Cloudera Inc. All rights reserved.
Test drive Impala
• Impala

Community:

• Download:
• Github:
• User

!27

http://cloudera.com/impala

https://github.com/cloudera/impala

group: impala-user on groups.cloudera.org

Copyright © 2013 Cloudera Inc. All rights reserved.
SELECT questions FROM audience;
Thank you.

!28

More Related Content

What's hot

Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
Jason Shih
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
trihug
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
Myung Ho Yun
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
Swiss Big Data User Group
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera, Inc.
 
Impala 2.0 Update #impalajp
Impala 2.0 Update #impalajpImpala 2.0 Update #impalajp
Impala 2.0 Update #impalajp
Cloudera Japan
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
Cloudera, Inc.
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Wangda Tan
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
Omid Vahdaty
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
mcsrivas
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
Chicago Hadoop Users Group
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
Scott Leberknight
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
Wangda Tan
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera, Inc.
 

What's hot (20)

Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Impala 2.0 Update #impalajp
Impala 2.0 Update #impalajpImpala 2.0 Update #impalajp
Impala 2.0 Update #impalajp
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 

Viewers also liked

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in ImpalaCloudera, Inc.
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
Cloudera, Inc.
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance Update
Cloudera, Inc.
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
mattlieber
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
Turi, Inc.
 
Steps involved in Sugar Production
Steps involved in Sugar ProductionSteps involved in Sugar Production
Steps involved in Sugar Productionmustafeex
 
Types of brand
Types of brandTypes of brand
Types of brand
Chao Onlamai
 
Product life cycle & marketing strategy
Product life cycle & marketing strategyProduct life cycle & marketing strategy
Product life cycle & marketing strategyHitesh Sunny
 
10 Ways to Improve Internal Communication
10 Ways to Improve Internal Communication10 Ways to Improve Internal Communication
10 Ways to Improve Internal Communication
Weekdone.com
 
Congratulations Graduate! Eleven Reasons Why I Will Never Hire You.
Congratulations Graduate! Eleven Reasons Why I Will Never Hire You.Congratulations Graduate! Eleven Reasons Why I Will Never Hire You.
Congratulations Graduate! Eleven Reasons Why I Will Never Hire You.
Mark O'Toole
 
Product life cycle & marketing strategies
Product life cycle & marketing strategiesProduct life cycle & marketing strategies
Product life cycle & marketing strategiesAmar Ingale
 
XII Marketing Project Work
XII Marketing Project WorkXII Marketing Project Work
XII Marketing Project Work
Rahil Jain
 

Viewers also liked (19)

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance Update
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Swimming
SwimmingSwimming
Swimming
 
A2 Sociology: Defining Relgion
A2 Sociology: Defining RelgionA2 Sociology: Defining Relgion
A2 Sociology: Defining Relgion
 
Steps involved in Sugar Production
Steps involved in Sugar ProductionSteps involved in Sugar Production
Steps involved in Sugar Production
 
Types of brand
Types of brandTypes of brand
Types of brand
 
Product life cycle & marketing strategy
Product life cycle & marketing strategyProduct life cycle & marketing strategy
Product life cycle & marketing strategy
 
10 Ways to Improve Internal Communication
10 Ways to Improve Internal Communication10 Ways to Improve Internal Communication
10 Ways to Improve Internal Communication
 
Congratulations Graduate! Eleven Reasons Why I Will Never Hire You.
Congratulations Graduate! Eleven Reasons Why I Will Never Hire You.Congratulations Graduate! Eleven Reasons Why I Will Never Hire You.
Congratulations Graduate! Eleven Reasons Why I Will Never Hire You.
 
Product life cycle & marketing strategies
Product life cycle & marketing strategiesProduct life cycle & marketing strategies
Product life cycle & marketing strategies
 
XII Marketing Project Work
XII Marketing Project WorkXII Marketing Project Work
XII Marketing Project Work
 

Similar to Cloudera Impala technical deep dive

Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
Kai Sasaki
 
Using spider for sharding in production
Using spider for sharding in productionUsing spider for sharding in production
Using spider for sharding in production
Kentoku
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Alluxio, Inc.
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Oracle business analytics best practices
Oracle business analytics best practicesOracle business analytics best practices
Oracle business analytics best practices
Nitai Partners Inc
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
 
Real World Storage in Treasure Data
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure Data
Kai Sasaki
 
Introduction to MySQL Cluster
Introduction to MySQL ClusterIntroduction to MySQL Cluster
Introduction to MySQL Cluster
Abel Flórez
 
45 ways to speed up firebird database
45 ways to speed up firebird database45 ways to speed up firebird database
45 ways to speed up firebird database
Fabio Codebue
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Dataconomy Media
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 

Similar to Cloudera Impala technical deep dive (20)

Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Using spider for sharding in production
Using spider for sharding in productionUsing spider for sharding in production
Using spider for sharding in production
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Oracle business analytics best practices
Oracle business analytics best practicesOracle business analytics best practices
Oracle business analytics best practices
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Real World Storage in Treasure Data
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure Data
 
Introduction to MySQL Cluster
Introduction to MySQL ClusterIntroduction to MySQL Cluster
Introduction to MySQL Cluster
 
45 ways to speed up firebird database
45 ways to speed up firebird database45 ways to speed up firebird database
45 ways to speed up firebird database
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 

More from huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
huguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
huguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
huguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
huguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
huguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
huguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
huguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
huguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
huguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
huguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
huguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
huguk
 

More from huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 

Cloudera Impala technical deep dive

  • 1. Practical Performance Analysis and Tuning for Cloudera Impala Headline Goes Here Marcel Kornacker | marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-11 Copyright © 2013 Cloudera Inc. All rights reserved. Copyright © 2013 Cloudera Inc. All rights reserved.
  • 2. Just how fast is Impala? You might just be surprised !2
  • 3. How fast is Impala? “DeWitt Clause” prohibits using DBMS vendor name !3 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 5. Pre-execution Checklist • Data types • Partitioning • File Format !5 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 6. Data Type Choices • Define integer columns as INT/BIGINT • Operations • Convert on INT/BIGINT more efficient than STRING “external” data to good “internal” types on load • e.g. CAST date strings to TIMESTAMPS • This avoids expensive CASTs in queries later !6 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 7. Partitioning • “The fastest I/O is the one that never takes place.” • Understand your query filter predicates • For time-series data, this is usually the date/timestamp column • Use this/these column(s) for a partition key(s) • Validate queries leverage partition pruning using EXPLAIN • You can have too much of a good thing •A few thousand partitions per table is probably OK • Tens of thousands partitions is probably too much • Partitions/Files should be no less than a few hundred MBs !7 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 8. Partition Pruning in EXPLAIN explain select count(*) from sales_fact where sold_date in('2000-01-01','2000-01-02'); Note the partition count and missing “predicates” filter due to parse time ! — Partition Pruning (partitioned table) 0:SCAN HDFS table=grahn.sales_fact #partitions=2 size=803.15MB tuple ids: 0 ! — Filter Predicate (non-partitioned table) 0:SCAN HDFS table=grahn.sales_fact #partitions=1 size=799.42MB predicates: sold_date IN ('2000-01-01', '2000-01-02') tuple ids: 0 !8 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 9. Why use Parquet Columnar Format for HDFS? • Well defined open format - http://parquet.io/ • Works in Impala, Pig, Hive & Map/Reduce • I/O reduction by only reading necessary columns • Columnar layout compresses/encodes better • Supports nested data by shredding columns • Uses techniques used by Google’s ColumnIO • Impala • Gzip • Quick !9 loads use Snappy compression by default available: set PARQUET_COMPRESSION_CODEC=gzip; word on Snappy vs. Gzip Copyright © 2013 Cloudera Inc. All rights reserved.
  • 10. Parquet: A Motivating Example !10 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 11. Quick Note on Compression • Snappy • Faster compression/decompression speeds • Less CPU cycles • Lower compression ratio • Gzip/Zlib • Slower compression/decompression speeds • More CPU cycles • Higher compression ratio • It’s all about trade-offs !11 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 12. It’s all about the compression codec Hortonworks graphic fails to call out that different codecs are used. Gzip compresses better than Snappy, but we all knew that. Source: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/ Compressed with Snappy !12 Copyright © 2013 Cloudera Inc. All rights reserved. Compressed with Gzip
  • 13. TPC-DS 500GB Scale Factor % of Original Size - Lower is Better 32.3% 34.8% Snappy 26.9% 27.4% Gzip 0 !13 Using the same compression codec produces comparable results 0.088 0.175 0.263 Copyright © 2013 Cloudera Inc. All rights reserved. 0.35 Parquet ORCFile
  • 14. Query Execution What happens behind the scenes !14
  • 15. Left-Deep Join Tree largest* table should be listed first in the FROM clause • Joins are done in the order tables are listed in FROM clause • Filter early - most selective joins/tables first • v1.2.1 will do JOIN ordering ⨝ • The !15 ⨝ ⨝ R3 R1 R2 Copyright © 2013 Cloudera Inc. All rights reserved. R4
  • 16. Two Types of Hash Joins • Default hash join type is BROADCAST (aka replicated) • Each node ends up with a copy of the right table(s)* • Left side, read locally and streamed through local hash join(s) • Best choice for “star join”, single large fact table, multiple small dims • Alternate hash join type is SHUFFLE (aka partitioned) • Right side hashed and shuffled; each node gets ~1/Nth the data • Left side hashed and shuffled, then streamed through join • Best choice for “large_table JOIN large_table” • Only available if ANALYZE was used to gather table/column stats* !16 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 17. How to use ANALYZE • Table Stats (from Hive shell) • analyze • Column table T1 [partition(partition_key)] compute statistics; Stats (from Hive shell) • analyze table T1 [partition(partition_key)] compute statistics for columns c1,c2,… • Impala 1.2.1 will have a built-in ANALYZE command ! !17 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 18. Hinting Joins select ... from large_fact join [broadcast] small_dim ! ! select ... from large_fact join [shuffle] large_dim ! *square brackets required !18 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 19. Determining Join Type From EXPLAIN explain select s_state, count(*) from store_sales join store on (ss_store_sk = s_store_sk) group by s_state; ! ! 2:HASH JOIN | join op: INNER JOIN (BROADCAST) | hash predicates: | ss_store_sk = s_store_sk | tuple ids: 0 1 | |----4:EXCHANGE | tuple ids: 1 | 0:SCAN HDFS table=tpcds.store_sales tuple ids: 0 !19 explain select c_preferred_cust_flag, count(*) from store_sales join customer on (ss_customer_sk = c_customer_sk) group by c_preferred_cust_flag; 2:HASH JOIN | join op: INNER JOIN (PARTITIONED) | hash predicates: | ss_customer_sk = c_customer_sk | tuple ids: 0 1 | |----5:EXCHANGE | tuple ids: 1 | 4:EXCHANGE tuple ids: 0 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 20. Memory Requirements for Joins & Aggregates • Impala does not “spill” to disk -- pipelines are in-memory • Operators’ mem usage need to fit within the memory limit • This is not the same as “all data needs to fit in memory” • Buffered data generally significantly smaller than total accessed data • Aggregations’ mem usage proportional to number of groups • Applies for each in-flight query (sum of total) • Minimum of 128GB of RAM is recommended for impala nodes !20 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 21. Impala Debug Web Pages A great resource for debug information !21
  • 22.
  • 23. !23
  • 24.
  • 26. Impala Performance Checklist • Choose appropriate data types • Leverage partition pruning • Adopt Parquet format • Gather table & column stats • Order JOINS optimally Automatic in v1.2.1 • Validate JOIN type • Monitor hardware resources !26 Copyright © 2013 Cloudera Inc. All rights reserved.
  • 27. Test drive Impala • Impala Community: • Download: • Github: • User !27 http://cloudera.com/impala https://github.com/cloudera/impala group: impala-user on groups.cloudera.org Copyright © 2013 Cloudera Inc. All rights reserved.
  • 28. SELECT questions FROM audience; Thank you. !28