Deep dive into enterprise data lake
through Impala
Evans Ye
2014.7.28
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 1
Agenda
• Introduction to Impala
• Impala Architecture
• Query Execution
• Getting Started
• Parquet File Format
• ACL via Sentry
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Introduction to
Impala
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 3
Impala
• General-purpose SQL query engine
• Support queries takes from milliseconds to
hours (near real-time)
• Support most of the Hadoop file formats
– Text, SequenseFile, Avro, RCFile, Parquet
• Suitable for data scientists or business
analysts
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Why do we need it?
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 5
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Current adhoc-query solution - Pig
• Do hourly count on akamai log
– A = load ‘/test_data/2014/07/20/00'
using PigStorage();
B = foreach (group A all) COUNT_STAR(A);
dump B;
– …
0% complete
100% complete
(194202349)
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Using Impala
• No memory cache
– > select count(*) from test_data
where day=20140720 and hour=0
– 194202349
• with OS cache
• Do a further query:
– select count(*) from test_data where
day=20140720 and hour=00 and c='US';
– 41118019
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Status quo
• Developed by Cloudera
• Open source under Apache License
• 1.0 available in 05/2013
• current version is 1.4
• connect via ODBC/JDBC/hue/impala-shell
• authenticate via Kerberos or LDAP
• fine-grained ACL via Apache Sentry
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Benefits
• High Performance
– C++
– direct access to data (no Mapreduce)
– in-memory query execution
• Flexibility
– Query across existing data(no duplication)
– support multiple Hadoop file format
• Scalable
– scale out by adding nodes
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Impala
Architecture
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 12
Impala Architecture
Datanode
Tasktracker
Regionserver
impala
daemon
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
NN, JT, HM
Active
NN, JT, HM
Standby
Datanode
Tasktracker
Regionserver
impala
daemon
Datanode
Tasktracker
Regionserver
impala
daemon
Datanode
Tasktracker
Regionserver
impala
daemon
State store
Catalog
Hive
Metastore
Components
• Impala daemon
– collocate with datanodes
– handle all impala internal requests related to query
execution
– User can submit request to impala daemon
running on any node and that node serve as
coordinator node
• State store daemon
– communicates with impala daemons to confirm
which node is healthy and can accept new work
• Catalog daemon
– broadcast metadata changes from impala SQL
statements to all the impala daemons
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Fault tolerance
• No fault tolerance for impala daemons
– A node failed, the query failed
• state-store offline
– query execution still function normally
– can not update metadata(create, alter…)
– if another impala daemon goes down, then
entire cluster can not execute any query
• catalog offline
– can not update metadata
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Query Execution
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 16
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Getting Started
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 21
Impala-shell (sercure cluster)
• $ yum install impala-shell
• $ kinit –kt evans_ye.keytab evans_ye
• $ impala-shell --kerberos 
--impalad IMPALA_HOST
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Create and insert
• > create table t1 (col1 string, col2 int);
• > insert into t1 values (‘foo’, 10);
– only supports writing to TEXT and PARQUET
tables
– every insert creates 1 tiny hdfs file
– by default, the file will be stored under
/user/hive/warehouse/t1/
– use it for setting up small dimension table,
experiment purpose, or with HBase table
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Create external table to read existing files
• > create external table t2
(col1 string, col2 int)
row format delimited fields terminated by ‘t’
location ‘/user/evans_ye/test_data’;
– location must be a directory
(for example, pig output directory)
– files to read:
• V /user/evans_ye/test_data/part-r-00000
• X /user/evans_ye/test_data/_logs/history
• X /user/evans_ye/test_data/20140701/00/part-r-00000
• no recursive
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
No recursive?
• Then how to add external data with folder
structure like this:
– /user/evans_ye/test_data/20140701/00
/user/evans_ye/test_data/20140701/01
…
/user/evans_ye/test_data/20140701/02
…
/user/evans_ye/test_data/20140702/00
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Partitioning
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 26
Create the table with partitions
• > create external table t3
(col1 string, col2 int)
partitioned by (`date` int, hour tinyint)
row format delimited fields terminated by ‘t’;
• No need to specify the location on create
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Add partitions into the table
• > alter table t3
add partition (`date`=20130330, hour=0)
location
‘/user/evans_ye/test_data/20130330/00‘;
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Partition number
• thousands of partitions per table
– OK
• tens of thousands partitions per table
– Not OK
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Compute table statistics
• > compute stats t3;
• > show table stats t3;
• Help impala to optimize join query:
broadcast join, partitioned join
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Parquet File
Format
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 31
Parquet
• apache incubator project
• column-oriented file format
– compression is better since all the value would be the
same type
– encoding is better since value could often be the same
and repeated
– SQL projection can avoid unnecessary read and
decoding on columns
• Supported by Pig, Impala, Hive, MR and Cascading
• impala by default use snappy with parquet
• impala + parquet = google dremel
– dremel doesn’t support join
– impala doesn’t support nested data structure(yet)
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Transform text files table into parquet format
• > create table t4 like t3 stored as parquet;
• > insert overwrite t4
partition (`date`, hour)
select * from t3
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Using parquet in Pig
• $ yum install parquet
• $ pig
• > A = load ‘/user/hive/warehouse/t4’
using parquet.pig.ParquetLoader
as (x: chararray, y: int);
• > store A into ‘/user/evans_ye/parquet_out’
using parquet.pig.ParquetStorer;
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
ACL via
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 35
Sentry
• apache incubator project
• provide fine-grained role based
authorization
• currently integrates with Hive and Impala
• require strong authentication such as
kerberos
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Enable Sentry for Impala
• turns on Sentry authorization for Impala
– add two lines into impala daemon’s
configuration file
(/etc/default/impala)
– auth-policy.ini  Sentry policy file
– server1  a symbolic name used in policy file
• all impala daemons must specify same server name
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Sentry policy file example
• roles:
– on server1,
spn_user_role has permission to
read(SELECT) all tables in spn database
• groups
– evans_ye group has role spn_user_role
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Sentry policy file example
• roles:
– evans_data has permission to access
/user/evans_ye
• allows you to add data under /user/evans_ye as
partitions
– foo_db_role can do anything in foo database
• create, alter, insert, drop
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Impala 2014 Roadmap
• 1.4 (now available)
– order by without limit
• 2.0
– nested data types (structs, arrays, maps)
– disk-based joins and aggregations
8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Q&A
8/2/2014 41Confidential | Copyright 2013 TrendMicro Inc.

Deep dive into enterprise data lake through Impala

  • 1.
    Deep dive intoenterprise data lake through Impala Evans Ye 2014.7.28 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 1
  • 2.
    Agenda • Introduction toImpala • Impala Architecture • Query Execution • Getting Started • Parquet File Format • ACL via Sentry 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 3.
    Introduction to Impala 8/2/2014 Confidential| Copyright 2013 TrendMicro Inc. 3
  • 4.
    Impala • General-purpose SQLquery engine • Support queries takes from milliseconds to hours (near real-time) • Support most of the Hadoop file formats – Text, SequenseFile, Avro, RCFile, Parquet • Suitable for data scientists or business analysts 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 5.
    Why do weneed it? 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 5
  • 6.
    8/2/2014 Confidential |Copyright 2013 TrendMicro Inc. 2
  • 7.
    Current adhoc-query solution- Pig • Do hourly count on akamai log – A = load ‘/test_data/2014/07/20/00' using PigStorage(); B = foreach (group A all) COUNT_STAR(A); dump B; – … 0% complete 100% complete (194202349) 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 8.
    Using Impala • Nomemory cache – > select count(*) from test_data where day=20140720 and hour=0 – 194202349 • with OS cache • Do a further query: – select count(*) from test_data where day=20140720 and hour=00 and c='US'; – 41118019 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 9.
    8/2/2014 Confidential |Copyright 2013 TrendMicro Inc. 2
  • 10.
    Status quo • Developedby Cloudera • Open source under Apache License • 1.0 available in 05/2013 • current version is 1.4 • connect via ODBC/JDBC/hue/impala-shell • authenticate via Kerberos or LDAP • fine-grained ACL via Apache Sentry 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 11.
    Benefits • High Performance –C++ – direct access to data (no Mapreduce) – in-memory query execution • Flexibility – Query across existing data(no duplication) – support multiple Hadoop file format • Scalable – scale out by adding nodes 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 12.
    Impala Architecture 8/2/2014 Confidential |Copyright 2013 TrendMicro Inc. 12
  • 13.
    Impala Architecture Datanode Tasktracker Regionserver impala daemon 8/2/2014 Confidential| Copyright 2013 TrendMicro Inc. 2 NN, JT, HM Active NN, JT, HM Standby Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon State store Catalog Hive Metastore
  • 14.
    Components • Impala daemon –collocate with datanodes – handle all impala internal requests related to query execution – User can submit request to impala daemon running on any node and that node serve as coordinator node • State store daemon – communicates with impala daemons to confirm which node is healthy and can accept new work • Catalog daemon – broadcast metadata changes from impala SQL statements to all the impala daemons 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 15.
    Fault tolerance • Nofault tolerance for impala daemons – A node failed, the query failed • state-store offline – query execution still function normally – can not update metadata(create, alter…) – if another impala daemon goes down, then entire cluster can not execute any query • catalog offline – can not update metadata 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 16.
    Query Execution 8/2/2014 Confidential| Copyright 2013 TrendMicro Inc. 16
  • 17.
    8/2/2014 Confidential |Copyright 2013 TrendMicro Inc. 2
  • 18.
    8/2/2014 Confidential |Copyright 2013 TrendMicro Inc. 2
  • 19.
    8/2/2014 Confidential |Copyright 2013 TrendMicro Inc. 2
  • 20.
    8/2/2014 Confidential |Copyright 2013 TrendMicro Inc. 2
  • 21.
    Getting Started 8/2/2014 Confidential| Copyright 2013 TrendMicro Inc. 21
  • 22.
    Impala-shell (sercure cluster) •$ yum install impala-shell • $ kinit –kt evans_ye.keytab evans_ye • $ impala-shell --kerberos --impalad IMPALA_HOST 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 23.
    Create and insert •> create table t1 (col1 string, col2 int); • > insert into t1 values (‘foo’, 10); – only supports writing to TEXT and PARQUET tables – every insert creates 1 tiny hdfs file – by default, the file will be stored under /user/hive/warehouse/t1/ – use it for setting up small dimension table, experiment purpose, or with HBase table 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 24.
    Create external tableto read existing files • > create external table t2 (col1 string, col2 int) row format delimited fields terminated by ‘t’ location ‘/user/evans_ye/test_data’; – location must be a directory (for example, pig output directory) – files to read: • V /user/evans_ye/test_data/part-r-00000 • X /user/evans_ye/test_data/_logs/history • X /user/evans_ye/test_data/20140701/00/part-r-00000 • no recursive 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 25.
    No recursive? • Thenhow to add external data with folder structure like this: – /user/evans_ye/test_data/20140701/00 /user/evans_ye/test_data/20140701/01 … /user/evans_ye/test_data/20140701/02 … /user/evans_ye/test_data/20140702/00 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 26.
    Partitioning 8/2/2014 Confidential |Copyright 2013 TrendMicro Inc. 26
  • 27.
    Create the tablewith partitions • > create external table t3 (col1 string, col2 int) partitioned by (`date` int, hour tinyint) row format delimited fields terminated by ‘t’; • No need to specify the location on create 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 28.
    Add partitions intothe table • > alter table t3 add partition (`date`=20130330, hour=0) location ‘/user/evans_ye/test_data/20130330/00‘; 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 29.
    Partition number • thousandsof partitions per table – OK • tens of thousands partitions per table – Not OK 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 30.
    Compute table statistics •> compute stats t3; • > show table stats t3; • Help impala to optimize join query: broadcast join, partitioned join 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 31.
    Parquet File Format 8/2/2014 Confidential| Copyright 2013 TrendMicro Inc. 31
  • 32.
    Parquet • apache incubatorproject • column-oriented file format – compression is better since all the value would be the same type – encoding is better since value could often be the same and repeated – SQL projection can avoid unnecessary read and decoding on columns • Supported by Pig, Impala, Hive, MR and Cascading • impala by default use snappy with parquet • impala + parquet = google dremel – dremel doesn’t support join – impala doesn’t support nested data structure(yet) 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 33.
    Transform text filestable into parquet format • > create table t4 like t3 stored as parquet; • > insert overwrite t4 partition (`date`, hour) select * from t3 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 34.
    Using parquet inPig • $ yum install parquet • $ pig • > A = load ‘/user/hive/warehouse/t4’ using parquet.pig.ParquetLoader as (x: chararray, y: int); • > store A into ‘/user/evans_ye/parquet_out’ using parquet.pig.ParquetStorer; 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 35.
    ACL via 8/2/2014 Confidential| Copyright 2013 TrendMicro Inc. 35
  • 36.
    Sentry • apache incubatorproject • provide fine-grained role based authorization • currently integrates with Hive and Impala • require strong authentication such as kerberos 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 37.
    Enable Sentry forImpala • turns on Sentry authorization for Impala – add two lines into impala daemon’s configuration file (/etc/default/impala) – auth-policy.ini  Sentry policy file – server1  a symbolic name used in policy file • all impala daemons must specify same server name 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 38.
    Sentry policy fileexample • roles: – on server1, spn_user_role has permission to read(SELECT) all tables in spn database • groups – evans_ye group has role spn_user_role 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 39.
    Sentry policy fileexample • roles: – evans_data has permission to access /user/evans_ye • allows you to add data under /user/evans_ye as partitions – foo_db_role can do anything in foo database • create, alter, insert, drop 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 40.
    Impala 2014 Roadmap •1.4 (now available) – order by without limit • 2.0 – nested data types (structs, arrays, maps) – disk-based joins and aggregations 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  • 41.
    Q&A 8/2/2014 41Confidential |Copyright 2013 TrendMicro Inc.

Editor's Notes

  • #15 Impala node caches all of this metadata to reuse for future queries against the same table.
  • #21 group sum at datanode group sum at coordinator
  • #33 same type indicate that under lying bit of storage are pretty much the same
  • #38 a policy file many be used by multiple hadoop ecosystem