Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep dive into enterprise data lake through Impala

1,229 views

Published on

A Hadoop near real-time solution Impala

Published in: Engineering
  • Be the first to comment

Deep dive into enterprise data lake through Impala

  1. 1. Deep dive into enterprise data lake through Impala Evans Ye 2014.7.28 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 1
  2. 2. Agenda • Introduction to Impala • Impala Architecture • Query Execution • Getting Started • Parquet File Format • ACL via Sentry 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  3. 3. Introduction to Impala 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 3
  4. 4. Impala • General-purpose SQL query engine • Support queries takes from milliseconds to hours (near real-time) • Support most of the Hadoop file formats – Text, SequenseFile, Avro, RCFile, Parquet • Suitable for data scientists or business analysts 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  5. 5. Why do we need it? 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 5
  6. 6. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  7. 7. Current adhoc-query solution - Pig • Do hourly count on akamai log – A = load ‘/test_data/2014/07/20/00' using PigStorage(); B = foreach (group A all) COUNT_STAR(A); dump B; – … 0% complete 100% complete (194202349) 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  8. 8. Using Impala • No memory cache – > select count(*) from test_data where day=20140720 and hour=0 – 194202349 • with OS cache • Do a further query: – select count(*) from test_data where day=20140720 and hour=00 and c='US'; – 41118019 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  9. 9. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  10. 10. Status quo • Developed by Cloudera • Open source under Apache License • 1.0 available in 05/2013 • current version is 1.4 • connect via ODBC/JDBC/hue/impala-shell • authenticate via Kerberos or LDAP • fine-grained ACL via Apache Sentry 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  11. 11. Benefits • High Performance – C++ – direct access to data (no Mapreduce) – in-memory query execution • Flexibility – Query across existing data(no duplication) – support multiple Hadoop file format • Scalable – scale out by adding nodes 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  12. 12. Impala Architecture 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 12
  13. 13. Impala Architecture Datanode Tasktracker Regionserver impala daemon 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 NN, JT, HM Active NN, JT, HM Standby Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon State store Catalog Hive Metastore
  14. 14. Components • Impala daemon – collocate with datanodes – handle all impala internal requests related to query execution – User can submit request to impala daemon running on any node and that node serve as coordinator node • State store daemon – communicates with impala daemons to confirm which node is healthy and can accept new work • Catalog daemon – broadcast metadata changes from impala SQL statements to all the impala daemons 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  15. 15. Fault tolerance • No fault tolerance for impala daemons – A node failed, the query failed • state-store offline – query execution still function normally – can not update metadata(create, alter…) – if another impala daemon goes down, then entire cluster can not execute any query • catalog offline – can not update metadata 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  16. 16. Query Execution 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 16
  17. 17. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  18. 18. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  19. 19. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  20. 20. 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  21. 21. Getting Started 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 21
  22. 22. Impala-shell (sercure cluster) • $ yum install impala-shell • $ kinit –kt evans_ye.keytab evans_ye • $ impala-shell --kerberos --impalad IMPALA_HOST 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  23. 23. Create and insert • > create table t1 (col1 string, col2 int); • > insert into t1 values (‘foo’, 10); – only supports writing to TEXT and PARQUET tables – every insert creates 1 tiny hdfs file – by default, the file will be stored under /user/hive/warehouse/t1/ – use it for setting up small dimension table, experiment purpose, or with HBase table 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  24. 24. Create external table to read existing files • > create external table t2 (col1 string, col2 int) row format delimited fields terminated by ‘t’ location ‘/user/evans_ye/test_data’; – location must be a directory (for example, pig output directory) – files to read: • V /user/evans_ye/test_data/part-r-00000 • X /user/evans_ye/test_data/_logs/history • X /user/evans_ye/test_data/20140701/00/part-r-00000 • no recursive 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  25. 25. No recursive? • Then how to add external data with folder structure like this: – /user/evans_ye/test_data/20140701/00 /user/evans_ye/test_data/20140701/01 … /user/evans_ye/test_data/20140701/02 … /user/evans_ye/test_data/20140702/00 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  26. 26. Partitioning 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 26
  27. 27. Create the table with partitions • > create external table t3 (col1 string, col2 int) partitioned by (`date` int, hour tinyint) row format delimited fields terminated by ‘t’; • No need to specify the location on create 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  28. 28. Add partitions into the table • > alter table t3 add partition (`date`=20130330, hour=0) location ‘/user/evans_ye/test_data/20130330/00‘; 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  29. 29. Partition number • thousands of partitions per table – OK • tens of thousands partitions per table – Not OK 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  30. 30. Compute table statistics • > compute stats t3; • > show table stats t3; • Help impala to optimize join query: broadcast join, partitioned join 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  31. 31. Parquet File Format 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 31
  32. 32. Parquet • apache incubator project • column-oriented file format – compression is better since all the value would be the same type – encoding is better since value could often be the same and repeated – SQL projection can avoid unnecessary read and decoding on columns • Supported by Pig, Impala, Hive, MR and Cascading • impala by default use snappy with parquet • impala + parquet = google dremel – dremel doesn’t support join – impala doesn’t support nested data structure(yet) 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  33. 33. Transform text files table into parquet format • > create table t4 like t3 stored as parquet; • > insert overwrite t4 partition (`date`, hour) select * from t3 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  34. 34. Using parquet in Pig • $ yum install parquet • $ pig • > A = load ‘/user/hive/warehouse/t4’ using parquet.pig.ParquetLoader as (x: chararray, y: int); • > store A into ‘/user/evans_ye/parquet_out’ using parquet.pig.ParquetStorer; 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  35. 35. ACL via 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 35
  36. 36. Sentry • apache incubator project • provide fine-grained role based authorization • currently integrates with Hive and Impala • require strong authentication such as kerberos 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  37. 37. Enable Sentry for Impala • turns on Sentry authorization for Impala – add two lines into impala daemon’s configuration file (/etc/default/impala) – auth-policy.ini  Sentry policy file – server1  a symbolic name used in policy file • all impala daemons must specify same server name 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  38. 38. Sentry policy file example • roles: – on server1, spn_user_role has permission to read(SELECT) all tables in spn database • groups – evans_ye group has role spn_user_role 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  39. 39. Sentry policy file example • roles: – evans_data has permission to access /user/evans_ye • allows you to add data under /user/evans_ye as partitions – foo_db_role can do anything in foo database • create, alter, insert, drop 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  40. 40. Impala 2014 Roadmap • 1.4 (now available) – order by without limit • 2.0 – nested data types (structs, arrays, maps) – disk-based joins and aggregations 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  41. 41. Q&A 8/2/2014 41Confidential | Copyright 2013 TrendMicro Inc.

×