Your SlideShare is downloading. ×
Deep dive into enterprise data lake through Impala
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Deep dive into enterprise data lake through Impala

416
views

Published on

A Hadoop near real-time solution Impala

A Hadoop near real-time solution Impala

Published in: Engineering

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
416
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Impala node caches all of this metadata to reuse for future queries against the same table.
  • group sum at datanode
    group sum at coordinator
  • same type indicate that under lying bit of storage are pretty much the same
  • a policy file many be used by multiple hadoop ecosystem
  • Transcript

    • 1. Deep dive into enterprise data lake through Impala Evans Ye 2014.7.28 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 1
    • 2. Agenda • Introduction to Impala • Impala Architecture • Query Execution • Getting Started • Parquet File Format • ACL via Sentry 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 3. Introduction to Impala 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 3
    • 4. Impala • General-purpose SQL query engine • Support queries takes from milliseconds to hours (near real-time) • Support most of the Hadoop file formats – Text, SequenseFile, Avro, RCFile, Parquet • Suitable for data scientists or business analysts 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 5. Why do we need it? 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 5
    • 6. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 7. Current adhoc-query solution - Pig • Do hourly count on akamai log – A = load ‘/test_data/2014/07/20/00' using PigStorage(); B = foreach (group A all) COUNT_STAR(A); dump B; – … 0% complete 100% complete (194202349) 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 8. Using Impala • No memory cache – > select count(*) from test_data where day=20140720 and hour=0 – 194202349 • with OS cache • Do a further query: – select count(*) from test_data where day=20140720 and hour=00 and c='US'; – 41118019 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 9. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 10. Status quo • Developed by Cloudera • Open source under Apache License • 1.0 available in 05/2013 • current version is 1.4 • connect via ODBC/JDBC/hue/impala-shell • authenticate via Kerberos or LDAP • fine-grained ACL via Apache Sentry 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 11. Benefits • High Performance – C++ – direct access to data (no Mapreduce) – in-memory query execution • Flexibility – Query across existing data(no duplication) – support multiple Hadoop file format • Scalable – scale out by adding nodes 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 12. Impala Architecture 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 12
    • 13. Impala Architecture Datanode Tasktracker Regionserver impala daemon 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 NN, JT, HM Active NN, JT, HM Standby Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon State store Catalog Hive Metastore
    • 14. Components • Impala daemon – collocate with datanodes – handle all impala internal requests related to query execution – User can submit request to impala daemon running on any node and that node serve as coordinator node • State store daemon – communicates with impala daemons to confirm which node is healthy and can accept new work • Catalog daemon – broadcast metadata changes from impala SQL statements to all the impala daemons 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 15. Fault tolerance • No fault tolerance for impala daemons – A node failed, the query failed • state-store offline – query execution still function normally – can not update metadata(create, alter…) – if another impala daemon goes down, then entire cluster can not execute any query • catalog offline – can not update metadata 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 16. Query Execution 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 16
    • 17. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 18. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 19. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 20. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 21. Getting Started 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 21
    • 22. Impala-shell (sercure cluster) • $ yum install impala-shell • $ kinit –kt evans_ye.keytab evans_ye • $ impala-shell --kerberos --impalad IMPALA_HOST 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 23. Create and insert • > create table t1 (col1 string, col2 int); • > insert into t1 values (‘foo’, 10); – only supports writing to TEXT and PARQUET tables – every insert creates 1 tiny hdfs file – by default, the file will be stored under /user/hive/warehouse/t1/ – use it for setting up small dimension table, experiment purpose, or with HBase table 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 24. Create external table to read existing files • > create external table t2 (col1 string, col2 int) row format delimited fields terminated by ‘t’ location ‘/user/evans_ye/test_data’; – location must be a directory (for example, pig output directory) – files to read: • V /user/evans_ye/test_data/part-r-00000 • X /user/evans_ye/test_data/_logs/history • X /user/evans_ye/test_data/20140701/00/part-r-00000 • no recursive 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 25. No recursive? • Then how to add external data with folder structure like this: – /user/evans_ye/test_data/20140701/00 /user/evans_ye/test_data/20140701/01 … /user/evans_ye/test_data/20140701/02 … /user/evans_ye/test_data/20140702/00 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 26. Partitioning 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 26
    • 27. Create the table with partitions • > create external table t3 (col1 string, col2 int) partitioned by (`date` int, hour tinyint) row format delimited fields terminated by ‘t’; • No need to specify the location on create 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 28. Add partitions into the table • > alter table t3 add partition (`date`=20130330, hour=0) location ‘/user/evans_ye/test_data/20130330/00‘; 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 29. Partition number • thousands of partitions per table – OK • tens of thousands partitions per table – Not OK 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 30. Compute table statistics • > compute stats t3; • > show table stats t3; • Help impala to optimize join query: broadcast join, partitioned join 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 31. Parquet File Format 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 31
    • 32. Parquet • apache incubator project • column-oriented file format – compression is better since all the value would be the same type – encoding is better since value could often be the same and repeated – SQL projection can avoid unnecessary read and decoding on columns • Supported by Pig, Impala, Hive, MR and Cascading • impala by default use snappy with parquet • impala + parquet = google dremel – dremel doesn’t support join – impala doesn’t support nested data structure(yet) 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 33. Transform text files table into parquet format • > create table t4 like t3 stored as parquet; • > insert overwrite t4 partition (`date`, hour) select * from t3 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 34. Using parquet in Pig • $ yum install parquet • $ pig • > A = load ‘/user/hive/warehouse/t4’ using parquet.pig.ParquetLoader as (x: chararray, y: int); • > store A into ‘/user/evans_ye/parquet_out’ using parquet.pig.ParquetStorer; 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 35. ACL via 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 35
    • 36. Sentry • apache incubator project • provide fine-grained role based authorization • currently integrates with Hive and Impala • require strong authentication such as kerberos 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 37. Enable Sentry for Impala • turns on Sentry authorization for Impala – add two lines into impala daemon’s configuration file (/etc/default/impala) – auth-policy.ini  Sentry policy file – server1  a symbolic name used in policy file • all impala daemons must specify same server name 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 38. Sentry policy file example • roles: – on server1, spn_user_role has permission to read(SELECT) all tables in spn database • groups – evans_ye group has role spn_user_role 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 39. Sentry policy file example • roles: – evans_data has permission to access /user/evans_ye • allows you to add data under /user/evans_ye as partitions – foo_db_role can do anything in foo database • create, alter, insert, drop 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 40. Impala 2014 Roadmap • 1.4 (now available) – order by without limit • 2.0 – nested data types (structs, arrays, maps) – disk-based joins and aggregations 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 41. Q&A 8/2/2014 41Confidential | Copyright 2013 TrendMicro Inc.