Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

2,858 views

Published on

Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

Published in: Technology
  • Be the first to comment

Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

  1. 1. Impala Benchmarks and Tuning Tips Simon Hsu 徐瑞興 2014年9月13日
  2. 2. 2 HadoopCon 2014 About Me •徐瑞興(Simon Hsu) –Approach Hadoop in M.S. (2010) •“A Transparent Approach to Run MapReduce Programs on Collaborative Hadoops” –IEEE BigData2014 –FOXCONN –RDDept. •Hadoop Product Development –Etu –RD Dept. •Hadoop Solution(Etu/Cloudera) / Product Development
  3. 3. 3 HadoopCon 2014 Outline •Impala Performance Tuning Tips –“Practical Performance Analysis and Tuning for Cloudera Impala” -Greg Rahn @ Hadoop World 2013 •Impala Benchmarks –TPC-DS Kit for Impala
  4. 4. 4 HadoopCon 2014 Brief History of Impala http://mt.orz.at/archives/2012/12/hadoop.html
  5. 5. 5 HadoopCon 2014 Brief History of Impala http://mt.orz.at/archives/2012/12/hadoop.html
  6. 6. 6 HadoopCon 2014 Brief History of Impala http://mt.orz.at/archives/2012/12/hadoop.html
  7. 7. 7 HadoopCon 2014 Brief History of Impala http://mt.orz.at/archives/2012/12/hadoop.html
  8. 8. 8 HadoopCon 2014 Brief History of Impala http://mt.orz.at/archives/2012/12/hadoop.html
  9. 9. 9 HadoopCon 2014 Hive & Impala •Running MapReduce Jobs
  10. 10. 10 HadoopCon 2014 Hive & Impala •Running by In-memory, distributed SQL query engine •Running MapReduce Jobs
  11. 11. 11 HadoopCon 2014 Impala Feature •Fast –Low latency response •Bypass HDFSDataNode (Read directly from disk) •Optimized for data warehouse queries (Especially, Parquet) •Friendly to approach –Using the same database metadata with Hive •Benefits in some tools such as Sqoop –Common HDFS Files Format supported •Query existing files on HDFS
  12. 12. 12 HadoopCon 2014 No more predictions in length of columns!
  13. 13. 13 HadoopCon 2014 Impala Overview http://www.slideshare.net/cloudera/impala-v1update130709222849phpapp01 1 2 3 4 5
  14. 14. 14 HadoopCon 2014 Impala Performance Tuning Tips Pre-execution •Data Types •Partitioning •File Format •Compression Query Execution •Gather Table / Column Stats •Join Type •Query Profile Overall Review •Use Case •Experience
  15. 15. 15 HadoopCon 2014 http://www.safaribooksonline.com/library/view/strata-conference-new/9781491945551/part131.html
  16. 16. 16 HadoopCon 2014 Impala Performance Tuning Tips Pre-execution •Data Types •Partitioning •File Format •Compression Query Execution •Gather Table / Column Stats •Join Type •Query Profile Overall Review •Use Case •Experience
  17. 17. 17 HadoopCon 2014 Pre-execution •Data Types •Partitioning •File Format •Compression
  18. 18. 18 HadoopCon 2014 Data Types •Change data type to appropriate one –Avoid type casting overhead •Ex. –TimeStampsfor time –INT for Integer Although String is powerful..
  19. 19. 19 HadoopCon 2014 Partition •Create table partitions to reduce disk IO –Depends on general query pattern •Partitioned by Month •Partitioned by State
  20. 20. 20 HadoopCon 2014 PartitionFiles in HDFS Table files with partitions Tablefiles without partitions Directories Files
  21. 21. 21 HadoopCon 2014 Query Test in partitions with partition without partition
  22. 22. 22 HadoopCon 2014 File Format •Text –DefaultImpala table format •Parquet –Optimized for working with large data files •typically 1 GB per file –Reorganize data for maximum performance of data warehouse-stylequeries •Column-oriented binary file format
  23. 23. 23 HadoopCon 2014 Compression •Snappy Less CPU time Lower compression ratio •Gzip More CPU time Higher compression ratio
  24. 24. 24 HadoopCon 2014 •Test Table –Number of records: 183,364,043 •Test Query –[master.etu.im:21000] > SELECT COUNT(*) FROM store_sales; •Setting Compression codec –[master.etu.im:21000] > SET parquet.compression=[SNAPPY/GZIP/NONE/etc.] Query Time in different compression codec Codec TableSize on HDFS (GB) Query Time (s) Snappy 9.2 0.91 Gzip 6.8 1.22 None 16.5 1.21
  25. 25. 25 HadoopCon 2014 Compression Codec differshttp://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
  26. 26. 26 HadoopCon 2014 Compression Codec differs (cont.) http://www.safaribooksonline.com/library/view/strata-conference-new/9781491945551/part131.html
  27. 27. 27 HadoopCon 2014 Impala Performance Tuning Tips Pre-execution •Data Types •Partitioning •File Format •Compression Query Execution •Gather Table / Column Stats •Join Type •Query Profile Overall Review •Use Case •Experience
  28. 28. 28 HadoopCon 2014 Query Execution •Gather Table / Column Stats •Join Type •Query Profile
  29. 29. 29 HadoopCon 2014 Usage of Explain Clause Query Time : 0.31 (s) Query Time : 2.21 (s) with partition without partition •Query : –[master.etu.im:21000] > explainselect * from store_saleswhere ss_sold_date_skbetween 2451911 and 2451941 limit 10;
  30. 30. 30 HadoopCon 2014 Compute Tables Stats •[master.etu.im:21000] > COMPUTE STATS customer; •[master.etu.im:21000] > SHOW TABLE STATS customer ;
  31. 31. 31 HadoopCon 2014 Compute Tables Stats •[master.etu.im:21000] > COMPUTE STATS customer; •[master.etu.im:21000] > SHOW TABLE STATS customer ; 各位觀眾, 2個檔
  32. 32. 32 HadoopCon 2014 Gather Column Stats •[master.etu.im:21000] > SHOW COLUMN STATS tpcds_parquet.customer;
  33. 33. 33 HadoopCon 2014 Join Type •Two Types of Join –Broadcast Join •Default Join. Typically, broadcast joins are more efficient in cases where one table is much smaller than the other. –Shuffle Join •Typically, shuffle joins are more efficient for joins between large tables of similar size. •Join Order Optimization –If automatic optimization is not sufficient •consider add STRAIGHT_JOIN after SELECThttp://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_hints.html
  34. 34. 34 HadoopCon 2014 Query Profile •ImpaladWeb console http://Impalad_IP:25000/
  35. 35. 35 HadoopCon 2014 Impala Performance Tuning Tips Pre-execution •Configurations Check •Data Types •Partitioning •File Format Query Execution •Gather Table / Column Stats •Join Type •Query Profile Overall Review •Use Case •Experience
  36. 36. 36 HadoopCon 2014 Overall Review •Use Case •Experience
  37. 37. 37 HadoopCon 2014 Use case •Use case in Partition –L.T.V. of online gaming •Average Days •Average deposit •How many people in each interval http://goo.gl/TPoqvk
  38. 38. 38 HadoopCon 2014 Use case •Use case in File Format –Improve the query time in hospital •Reduce Query Time to 30%~50% •Number of Columns in each tables: 40~50 columns •Number of Records in largest table: over 100,000,000 “Taking a rest helps going further.“ http://goo.gl/RL6LSa
  39. 39. 39 HadoopCon 2014 Notes in Configs •HDFS Replication bandwith –dfs.datanode.balance.bandwidthPerSec •Default value : 10MB/s •Memory usage in impala daemon –Impala Daemon Memory Limit •(ex.) mem_limit: 80% •Enable HDFS Short Circuit Read –dfs.client.read.shortcircuit= true
  40. 40. 40 HadoopCon 2014 Notes during Operations •Preserve parquet block size –$ bin/hadoop distcp–pbsrcPathdstPath •Create external table / Create table –Preserve raw data or not while dropping table •Be aware of Insert into ….value .. –Generate many small files
  41. 41. 41 HadoopCon 2014 Turn off Beauty Print (-B)
  42. 42. 42 HadoopCon 2014 Impala Benchmarks •TPC Benchmark™DS(TPC-DS) –The New Decision Support Benchmark Standard •Although the underlying business model of TPC-DS is a retail product supplier, the database schema, data population, queries, data maintenance model and implementation rules have been designed to be broadly representative of modern decision support systems. https://github.com/cloudera/impala-tpcds-kit
  43. 43. 43 HadoopCon 2014 Procedure ofTPC-DS Benchmark (Impala) Preparation •tpcds-env.sh •hdfs-mkdirs.sh Data Generation •gen-dims.sh •gen-facts.sh Data Loading •impala-create- external- tables.sh •impala-load- dims.sh •impala-load- store_sales.sh
  44. 44. 44 HadoopCon 2014 Store Sales ER-Diagram http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf Fact Table
  45. 45. 45 HadoopCon 2014 Query 7 –Intro. •Compute the average quantity, list price, discount, and sales price for promotional items sold in stores where the promotion is not offered by mailor a special event. –Restrict the results to a specific gender, marital and educational status.
  46. 46. 46 HadoopCon 2014 •selecti_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 •fromstore_sales, customer_demographics, date_dim, item, promotion •wheress_sold_date_sk= d_date_skand ss_item_sk= i_item_skand ss_cdemo_sk= cd_demo_skand ss_promo_sk= p_promo_skand cd_gender= 'F'and cd_marital_status= 'W'and cd_education_status= 'Primary'and (p_channel_email= 'N'or p_channel_event= 'N') and d_year= 1998and ss_sold_date_skbetween 2450815 and 2451179 •group byi_item_id •order byi_item_id •limit 100; http://www.minddevelopmentanddesign.com/blog/leaving-las-vagues-or-focus-your-seo-keywords/
  47. 47. 47 HadoopCon 2014
  48. 48. 48 HadoopCon 2014
  49. 49. 49 HadoopCon 2014
  50. 50. 50 HadoopCon 2014 Conclusion •Consider the table format : “Parquet” •Compression codec tradeoffs •Disk I/O reduction by table partitioning •See Query profiles for more information •Run Impala Benchmarks and enjoy yourself –TPC-DS (Decision Support Benchmark)
  51. 51. 318, Rueiguang Rd., Taipei 114, Taiwan Simon Hsu –Sr. Software Engineer 0912-166-961 simonhsu@etusolution.com Thank you

×