SlideShare a Scribd company logo
Deep dive into enterprise data lake
through Impala
Evans Ye
2014.7.28
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 1
Agenda
• Introduction to Impala
• Impala Architecture
• Query Execution
• Getting Started
• Parquet File Format
• ACL via Sentry
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Introduction to
Impala
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 3
Impala
• General-purpose SQL query engine
• Support queries takes from milliseconds to
hours (near real-time)
• Support most of the Hadoop file formats
– Text, SequenseFile, Avro, RCFile, Parquet
• Suitable for data scientists or business
analysts
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Why do we need it?
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 5
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Current adhoc-query solution - Pig
• Do hourly count on akamai log
– A = load ‘/test_data/2014/07/20/00'
using PigStorage();
B = foreach (group A all) COUNT_STAR(A);
dump B;
– …
0% complete
100% complete
(194202349)
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Using Impala
• No memory cache
– > select count(*) from test_data
where day=20140720 and hour=0
– 194202349
• with OS cache
• Do a further query:
– select count(*) from test_data where
day=20140720 and hour=00 and c='US';
– 41118019
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Status quo
• Developed by Cloudera
• Open source under Apache License
• 1.0 available in 05/2013
• current version is 1.4
• connect via ODBC/JDBC/hue/impala-shell
• authenticate via Kerberos or LDAP
• fine-grained ACL via Apache Sentry
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Benefits
• High Performance
– C++
– direct access to data (no Mapreduce)
– in-memory query execution
• Flexibility
– Query across existing data(no duplication)
– support multiple Hadoop file format
• Scalable
– scale out by adding nodes
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Impala
Architecture
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 12
Impala Architecture
Datanode
Tasktracker
Regionserver
impala
daemon
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
NN, JT, HM
Active
NN, JT, HM
Standby
Datanode
Tasktracker
Regionserver
impala
daemon
Datanode
Tasktracker
Regionserver
impala
daemon
Datanode
Tasktracker
Regionserver
impala
daemon
State store
Catalog
Hive
Metastore
Components
• Impala daemon
– collocate with datanodes
– handle all impala internal requests related to query
execution
– User can submit request to impala daemon
running on any node and that node serve as
coordinator node
• State store daemon
– communicates with impala daemons to confirm
which node is healthy and can accept new work
• Catalog daemon
– broadcast metadata changes from impala SQL
statements to all the impala daemons
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Fault tolerance
• No fault tolerance for impala daemons
– A node failed, the query failed
• state-store offline
– query execution still function normally
– can not update metadata(create, alter…)
– if another impala daemon goes down, then
entire cluster can not execute any query
• catalog offline
– can not update metadata
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Query Execution
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 16
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Getting Started
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 21
Impala-shell (sercure cluster)
• $ yum install impala-shell
• $ kinit –kt evans_ye.keytab evans_ye
• $ impala-shell --kerberos 
--impalad IMPALA_HOST
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Create and insert
• > create table t1 (col1 string, col2 int);
• > insert into t1 values (‘foo’, 10);
– only supports writing to TEXT and PARQUET
tables
– every insert creates 1 tiny hdfs file
– by default, the file will be stored under
/user/hive/warehouse/t1/
– use it for setting up small dimension table,
experiment purpose, or with HBase table
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Create external table to read existing files
• > create external table t2
(col1 string, col2 int)
row format delimited fields terminated by ‘t’
location ‘/user/evans_ye/test_data’;
– location must be a directory
(for example, pig output directory)
– files to read:
• V /user/evans_ye/test_data/part-r-00000
• X /user/evans_ye/test_data/_logs/history
• X /user/evans_ye/test_data/20140701/00/part-r-00000
• no recursive
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
No recursive?
• Then how to add external data with folder
structure like this:
– /user/evans_ye/test_data/20140701/00
/user/evans_ye/test_data/20140701/01
…
/user/evans_ye/test_data/20140701/02
…
/user/evans_ye/test_data/20140702/00
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Partitioning
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 26
Create the table with partitions
• > create external table t3
(col1 string, col2 int)
partitioned by (`date` int, hour tinyint)
row format delimited fields terminated by ‘t’;
• No need to specify the location on create
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Add partitions into the table
• > alter table t3
add partition (`date`=20130330, hour=0)
location
‘/user/evans_ye/test_data/20130330/00‘;
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Partition number
• thousands of partitions per table
– OK
• tens of thousands partitions per table
– Not OK
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Compute table statistics
• > compute stats t3;
• > show table stats t3;
• Help impala to optimize join query:
broadcast join, partitioned join
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Parquet File
Format
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 31
Parquet
• apache incubator project
• column-oriented file format
– compression is better since all the value would be the
same type
– encoding is better since value could often be the same
and repeated
– SQL projection can avoid unnecessary read and
decoding on columns
• Supported by Pig, Impala, Hive, MR and Cascading
• impala by default use snappy with parquet
• impala + parquet = google dremel
– dremel doesn’t support join
– impala doesn’t support nested data structure(yet)
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Transform text files table into parquet format
• > create table t4 like t3 stored as parquet;
• > insert overwrite t4
partition (`date`, hour)
select * from t3
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Using parquet in Pig
• $ yum install parquet
• $ pig
• > A = load ‘/user/hive/warehouse/t4’
using parquet.pig.ParquetLoader
as (x: chararray, y: int);
• > store A into ‘/user/evans_ye/parquet_out’
using parquet.pig.ParquetStorer;
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
ACL via
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 35
Sentry
• apache incubator project
• provide fine-grained role based
authorization
• currently integrates with Hive and Impala
• require strong authentication such as
kerberos
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Enable Sentry for Impala
• turns on Sentry authorization for Impala
– add two lines into impala daemon’s
configuration file
(/etc/default/impala)
– auth-policy.ini  Sentry policy file
– server1  a symbolic name used in policy file
• all impala daemons must specify same server name
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Sentry policy file example
• roles:
– on server1,
spn_user_role has permission to
read(SELECT) all tables in spn database
• groups
– evans_ye group has role spn_user_role
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Sentry policy file example
• roles:
– evans_data has permission to access
/user/evans_ye
• allows you to add data under /user/evans_ye as
partitions
– foo_db_role can do anything in foo database
• create, alter, insert, drop
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Impala 2014 Roadmap
• 1.4 (now available)
– order by without limit
• 2.0
– nested data types (structs, arrays, maps)
– disk-based joins and aggregations
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Q&A
8/2/2014 41Confidential | Copyright 2013 TrendMicro Inc.

More Related Content

What's hot

How to find what is making your Oracle database slow
How to find what is making your Oracle database slowHow to find what is making your Oracle database slow
How to find what is making your Oracle database slow
SolarWinds
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Jim Dowling
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning Fundamentals
Enkitec
 
Christo kutrovsky oracle, memory & linux
Christo kutrovsky   oracle, memory & linuxChristo kutrovsky   oracle, memory & linux
Christo kutrovsky oracle, memory & linux
Kyle Hailey
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Oracle AWR Data mining
Oracle AWR Data miningOracle AWR Data mining
Oracle AWR Data mining
Yury Velikanov
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
Donald Miner
 
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And TricksCloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Scott Jenner
 
EAS Data Flow lessons learnt
EAS Data Flow lessons learntEAS Data Flow lessons learnt
EAS Data Flow lessons learnt
euc-dm-test
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Using Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and TuningUsing Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and Tuning
Texas Memory Systems, and IBM Company
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
OOUG - Oracle Performance Tuning with AAS
OOUG - Oracle Performance Tuning with AASOOUG - Oracle Performance Tuning with AAS
OOUG - Oracle Performance Tuning with AAS
Kyle Hailey
 
#dbhouseparty - Real World Problem Solving with SQL
#dbhouseparty - Real World Problem Solving with SQL#dbhouseparty - Real World Problem Solving with SQL
#dbhouseparty - Real World Problem Solving with SQL
Tammy Bednar
 
Moving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopMoving Data Between Exadata and Hadoop
Moving Data Between Exadata and Hadoop
Enkitec
 
Analyzing awr report
Analyzing awr reportAnalyzing awr report
Analyzing awr report
satish Gaddipati
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Kristofferson A
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
Renato Javier Marroquín Mogrovejo
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demo
Ajith Narayanan
 
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
Sandesh Rao
 

What's hot (20)

How to find what is making your Oracle database slow
How to find what is making your Oracle database slowHow to find what is making your Oracle database slow
How to find what is making your Oracle database slow
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning Fundamentals
 
Christo kutrovsky oracle, memory & linux
Christo kutrovsky   oracle, memory & linuxChristo kutrovsky   oracle, memory & linux
Christo kutrovsky oracle, memory & linux
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Oracle AWR Data mining
Oracle AWR Data miningOracle AWR Data mining
Oracle AWR Data mining
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
 
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And TricksCloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
 
EAS Data Flow lessons learnt
EAS Data Flow lessons learntEAS Data Flow lessons learnt
EAS Data Flow lessons learnt
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Using Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and TuningUsing Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and Tuning
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
OOUG - Oracle Performance Tuning with AAS
OOUG - Oracle Performance Tuning with AASOOUG - Oracle Performance Tuning with AAS
OOUG - Oracle Performance Tuning with AAS
 
#dbhouseparty - Real World Problem Solving with SQL
#dbhouseparty - Real World Problem Solving with SQL#dbhouseparty - Real World Problem Solving with SQL
#dbhouseparty - Real World Problem Solving with SQL
 
Moving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopMoving Data Between Exadata and Hadoop
Moving Data Between Exadata and Hadoop
 
Analyzing awr report
Analyzing awr reportAnalyzing awr report
Analyzing awr report
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demo
 
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
 

Viewers also liked

TrendMicro - Security Designed for the Software-Defined Data Center
TrendMicro - Security Designed for the Software-Defined Data CenterTrendMicro - Security Designed for the Software-Defined Data Center
TrendMicro - Security Designed for the Software-Defined Data Center
VMUG IT
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
Cloudera, Inc.
 
Top 5 Reasons to Select SolarWinds over HP NNMi
Top 5 Reasons to Select SolarWinds over HP NNMi Top 5 Reasons to Select SolarWinds over HP NNMi
Top 5 Reasons to Select SolarWinds over HP NNMi
SolarWinds
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
airisData
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 

Viewers also liked (6)

TrendMicro - Security Designed for the Software-Defined Data Center
TrendMicro - Security Designed for the Software-Defined Data CenterTrendMicro - Security Designed for the Software-Defined Data Center
TrendMicro - Security Designed for the Software-Defined Data Center
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Top 5 Reasons to Select SolarWinds over HP NNMi
Top 5 Reasons to Select SolarWinds over HP NNMi Top 5 Reasons to Select SolarWinds over HP NNMi
Top 5 Reasons to Select SolarWinds over HP NNMi
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 

Similar to Deep dive into enterprise data lake through Impala

Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
 
High Volume Payments using Mule
High Volume Payments using MuleHigh Volume Payments using Mule
High Volume Payments using Mule
Adhish Pendharkar
 
Apache cassandra v4.0
Apache cassandra v4.0Apache cassandra v4.0
Apache cassandra v4.0
Yuki Morishita
 
Jagadish-New
Jagadish-NewJagadish-New
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
Fwdays
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
HostedbyConfluent
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
Stitch Fix Algorithms
 
Whats new in Oracle Trace File analyzer 18.3.0
Whats new in Oracle Trace File analyzer 18.3.0Whats new in Oracle Trace File analyzer 18.3.0
Whats new in Oracle Trace File analyzer 18.3.0
Sandesh Rao
 
Whats new in oracle trace file analyzer 18.3.0
Whats new in oracle trace file analyzer 18.3.0Whats new in oracle trace file analyzer 18.3.0
Whats new in oracle trace file analyzer 18.3.0
Gareth Chapman
 
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
VirtualTech Japan Inc.
 
Amit Kumar_Resume
Amit Kumar_ResumeAmit Kumar_Resume
Amit Kumar_Resume
Amit Kumar
 
Extreme replication at IOUG Collaborate 15
Extreme replication at IOUG Collaborate 15Extreme replication at IOUG Collaborate 15
Extreme replication at IOUG Collaborate 15
Bobby Curtis
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
OpenStack Murano
OpenStack MuranoOpenStack Murano
OpenStack Murano
openstackstl
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Extreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateExtreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGate
Bobby Curtis
 
Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501
Tjarda Peelen
 
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
Masaaki Nakagawa
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
Jason Shih
 

Similar to Deep dive into enterprise data lake through Impala (20)

Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
High Volume Payments using Mule
High Volume Payments using MuleHigh Volume Payments using Mule
High Volume Payments using Mule
 
Apache cassandra v4.0
Apache cassandra v4.0Apache cassandra v4.0
Apache cassandra v4.0
 
Jagadish-New
Jagadish-NewJagadish-New
Jagadish-New
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
 
Whats new in Oracle Trace File analyzer 18.3.0
Whats new in Oracle Trace File analyzer 18.3.0Whats new in Oracle Trace File analyzer 18.3.0
Whats new in Oracle Trace File analyzer 18.3.0
 
Whats new in oracle trace file analyzer 18.3.0
Whats new in oracle trace file analyzer 18.3.0Whats new in oracle trace file analyzer 18.3.0
Whats new in oracle trace file analyzer 18.3.0
 
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
 
Amit Kumar_Resume
Amit Kumar_ResumeAmit Kumar_Resume
Amit Kumar_Resume
 
Extreme replication at IOUG Collaborate 15
Extreme replication at IOUG Collaborate 15Extreme replication at IOUG Collaborate 15
Extreme replication at IOUG Collaborate 15
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
OpenStack Murano
OpenStack MuranoOpenStack Murano
OpenStack Murano
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Extreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateExtreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGate
 
Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501
 
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 

More from Evans Ye

Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdfJoin ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
Evans Ye
 
非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽
Evans Ye
 
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep LearningTensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
Evans Ye
 
2017 big data landscape and cutting edge innovations public
2017 big data landscape and cutting edge innovations public2017 big data landscape and cutting edge innovations public
2017 big data landscape and cutting edge innovations public
Evans Ye
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 
The Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward SuccessThe Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward Success
Evans Ye
 
The Apache Way
The Apache WayThe Apache Way
The Apache Way
Evans Ye
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
Evans Ye
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
Evans Ye
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
Evans Ye
 
BigTop vm and docker provisioner
BigTop vm and docker provisionerBigTop vm and docker provisioner
BigTop vm and docker provisioner
Evans Ye
 
Docker workshop
Docker workshopDocker workshop
Docker workshop
Evans Ye
 
Fits docker into devops
Fits docker into devopsFits docker into devops
Fits docker into devops
Evans Ye
 
Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...
Evans Ye
 
How we lose etu hadoop competition
How we lose etu hadoop competitionHow we lose etu hadoop competition
How we lose etu hadoop competition
Evans Ye
 
Network Traffic Search using Apache HBase
Network Traffic Search using Apache HBaseNetwork Traffic Search using Apache HBase
Network Traffic Search using Apache HBase
Evans Ye
 
Vagrant
VagrantVagrant
Vagrant
Evans Ye
 
Building hadoop based big data environment
Building hadoop based big data environmentBuilding hadoop based big data environment
Building hadoop based big data environment
Evans Ye
 

More from Evans Ye (20)

Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdfJoin ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
 
非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽
 
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep LearningTensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
 
2017 big data landscape and cutting edge innovations public
2017 big data landscape and cutting edge innovations public2017 big data landscape and cutting edge innovations public
2017 big data landscape and cutting edge innovations public
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
The Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward SuccessThe Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward Success
 
The Apache Way
The Apache WayThe Apache Way
The Apache Way
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 
BigTop vm and docker provisioner
BigTop vm and docker provisionerBigTop vm and docker provisioner
BigTop vm and docker provisioner
 
Docker workshop
Docker workshopDocker workshop
Docker workshop
 
Fits docker into devops
Fits docker into devopsFits docker into devops
Fits docker into devops
 
Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...
 
How we lose etu hadoop competition
How we lose etu hadoop competitionHow we lose etu hadoop competition
How we lose etu hadoop competition
 
Network Traffic Search using Apache HBase
Network Traffic Search using Apache HBaseNetwork Traffic Search using Apache HBase
Network Traffic Search using Apache HBase
 
Vagrant
VagrantVagrant
Vagrant
 
Building hadoop based big data environment
Building hadoop based big data environmentBuilding hadoop based big data environment
Building hadoop based big data environment
 

Recently uploaded

TIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptxTIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptx
CVCSOfficial
 
smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...
um7474492
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
mahaffeycheryld
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
Roger Rozario
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
Kamal Acharya
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
Zener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and ApplicationsZener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and Applications
Shiny Christobel
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
Mechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineeringMechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineering
sachin chaurasia
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
AI-Based Home Security System : Home security
AI-Based Home Security System : Home securityAI-Based Home Security System : Home security
AI-Based Home Security System : Home security
AIRCC Publishing Corporation
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...
cannyengineerings
 

Recently uploaded (20)

TIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptxTIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptx
 
smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
Zener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and ApplicationsZener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and Applications
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
Mechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineeringMechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineering
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
AI-Based Home Security System : Home security
AI-Based Home Security System : Home securityAI-Based Home Security System : Home security
AI-Based Home Security System : Home security
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...
 

Deep dive into enterprise data lake through Impala

  • 1. Deep dive into enterprise data lake through Impala Evans Ye 2014.7.28 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 1
  • 2. Agenda • Introduction to Impala • Impala Architecture • Query Execution • Getting Started • Parquet File Format • ACL via Sentry 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 3. Introduction to Impala 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 3
  • 4. Impala • General-purpose SQL query engine • Support queries takes from milliseconds to hours (near real-time) • Support most of the Hadoop file formats – Text, SequenseFile, Avro, RCFile, Parquet • Suitable for data scientists or business analysts 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 5. Why do we need it? 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 5
  • 6. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 7. Current adhoc-query solution - Pig • Do hourly count on akamai log – A = load ‘/test_data/2014/07/20/00' using PigStorage(); B = foreach (group A all) COUNT_STAR(A); dump B; – … 0% complete 100% complete (194202349) 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 8. Using Impala • No memory cache – > select count(*) from test_data where day=20140720 and hour=0 – 194202349 • with OS cache • Do a further query: – select count(*) from test_data where day=20140720 and hour=00 and c='US'; – 41118019 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 9. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 10. Status quo • Developed by Cloudera • Open source under Apache License • 1.0 available in 05/2013 • current version is 1.4 • connect via ODBC/JDBC/hue/impala-shell • authenticate via Kerberos or LDAP • fine-grained ACL via Apache Sentry 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 11. Benefits • High Performance – C++ – direct access to data (no Mapreduce) – in-memory query execution • Flexibility – Query across existing data(no duplication) – support multiple Hadoop file format • Scalable – scale out by adding nodes 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 12. Impala Architecture 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 12
  • 13. Impala Architecture Datanode Tasktracker Regionserver impala daemon 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 NN, JT, HM Active NN, JT, HM Standby Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon State store Catalog Hive Metastore
  • 14. Components • Impala daemon – collocate with datanodes – handle all impala internal requests related to query execution – User can submit request to impala daemon running on any node and that node serve as coordinator node • State store daemon – communicates with impala daemons to confirm which node is healthy and can accept new work • Catalog daemon – broadcast metadata changes from impala SQL statements to all the impala daemons 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 15. Fault tolerance • No fault tolerance for impala daemons – A node failed, the query failed • state-store offline – query execution still function normally – can not update metadata(create, alter…) – if another impala daemon goes down, then entire cluster can not execute any query • catalog offline – can not update metadata 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 16. Query Execution 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 16
  • 17. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 18. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 19. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 20. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 21. Getting Started 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 21
  • 22. Impala-shell (sercure cluster) • $ yum install impala-shell • $ kinit –kt evans_ye.keytab evans_ye • $ impala-shell --kerberos --impalad IMPALA_HOST 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 23. Create and insert • > create table t1 (col1 string, col2 int); • > insert into t1 values (‘foo’, 10); – only supports writing to TEXT and PARQUET tables – every insert creates 1 tiny hdfs file – by default, the file will be stored under /user/hive/warehouse/t1/ – use it for setting up small dimension table, experiment purpose, or with HBase table 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 24. Create external table to read existing files • > create external table t2 (col1 string, col2 int) row format delimited fields terminated by ‘t’ location ‘/user/evans_ye/test_data’; – location must be a directory (for example, pig output directory) – files to read: • V /user/evans_ye/test_data/part-r-00000 • X /user/evans_ye/test_data/_logs/history • X /user/evans_ye/test_data/20140701/00/part-r-00000 • no recursive 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 25. No recursive? • Then how to add external data with folder structure like this: – /user/evans_ye/test_data/20140701/00 /user/evans_ye/test_data/20140701/01 … /user/evans_ye/test_data/20140701/02 … /user/evans_ye/test_data/20140702/00 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 26. Partitioning 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 26
  • 27. Create the table with partitions • > create external table t3 (col1 string, col2 int) partitioned by (`date` int, hour tinyint) row format delimited fields terminated by ‘t’; • No need to specify the location on create 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 28. Add partitions into the table • > alter table t3 add partition (`date`=20130330, hour=0) location ‘/user/evans_ye/test_data/20130330/00‘; 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 29. Partition number • thousands of partitions per table – OK • tens of thousands partitions per table – Not OK 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 30. Compute table statistics • > compute stats t3; • > show table stats t3; • Help impala to optimize join query: broadcast join, partitioned join 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 31. Parquet File Format 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 31
  • 32. Parquet • apache incubator project • column-oriented file format – compression is better since all the value would be the same type – encoding is better since value could often be the same and repeated – SQL projection can avoid unnecessary read and decoding on columns • Supported by Pig, Impala, Hive, MR and Cascading • impala by default use snappy with parquet • impala + parquet = google dremel – dremel doesn’t support join – impala doesn’t support nested data structure(yet) 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 33. Transform text files table into parquet format • > create table t4 like t3 stored as parquet; • > insert overwrite t4 partition (`date`, hour) select * from t3 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 34. Using parquet in Pig • $ yum install parquet • $ pig • > A = load ‘/user/hive/warehouse/t4’ using parquet.pig.ParquetLoader as (x: chararray, y: int); • > store A into ‘/user/evans_ye/parquet_out’ using parquet.pig.ParquetStorer; 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 35. ACL via 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 35
  • 36. Sentry • apache incubator project • provide fine-grained role based authorization • currently integrates with Hive and Impala • require strong authentication such as kerberos 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 37. Enable Sentry for Impala • turns on Sentry authorization for Impala – add two lines into impala daemon’s configuration file (/etc/default/impala) – auth-policy.ini  Sentry policy file – server1  a symbolic name used in policy file • all impala daemons must specify same server name 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 38. Sentry policy file example • roles: – on server1, spn_user_role has permission to read(SELECT) all tables in spn database • groups – evans_ye group has role spn_user_role 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 39. Sentry policy file example • roles: – evans_data has permission to access /user/evans_ye • allows you to add data under /user/evans_ye as partitions – foo_db_role can do anything in foo database • create, alter, insert, drop 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 40. Impala 2014 Roadmap • 1.4 (now available) – order by without limit • 2.0 – nested data types (structs, arrays, maps) – disk-based joins and aggregations 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 41. Q&A 8/2/2014 41Confidential | Copyright 2013 TrendMicro Inc.

Editor's Notes

  1. Impala node caches all of this metadata to reuse for future queries against the same table.
  2. group sum at datanode group sum at coordinator
  3. same type indicate that under lying bit of storage are pretty much the same
  4. a policy file many be used by multiple hadoop ecosystem