Optimizing Hive Queries

15,747
-1

Published on

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.

Published in: Technology
4 Comments
39 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Apache_Hive.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • hive is unusably slow in my use cases. select count(*) from foo limit 1 uses mapreduce and takes over a minute. create table foo as select * from bar limit 1 uses mapreduce and takes forever. why? i shouldn't have to analyze simple queries like this to find workarounds that make them reasonably performant. just use impala.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Check out:
    http://www.youtube.com/watch?v=pftoccjZFOM&feature=youtu.be
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Is there any recording available from the presenation itself? If would be nice to clarify some points.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
15,747
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
0
Comments
4
Likes
39
Embeds 0
No embeds

No notes for slide

Optimizing Hive Queries

  1. 1. Optimizing Hive QueriesOwen O’MalleyFounder and Architectowen@hortonworks.com@owen_omalley© Hortonworks Inc. 2013: Page 1
  2. 2. Who Am I?• Founder and Architect at Hortonworks – Working on Hive, working with customer – Formerly Hadoop MapReduce & Security – Been working on Hadoop since beginning• Apache Hadoop, ASF – Hadoop PMC (Original VP) – Tez, Ambari, Giraph PMC – Mentor for: Accumulo, Kafka, Knox – Apache Member © Hortonworks Inc. 2013 Page 2
  3. 3. Outline• Data Layout• Data Format• Joins• Debugging © Hortonworks Inc. 2013 Page 3
  4. 4. Data LayoutLocation, Location, Location© Hortonworks Inc. 2013 Page 4
  5. 5. Fundamental Questions• What is your primary use case? – What kind of queries and filters?• How do you need to access the data? – What information do you need together?• How much data do you have? – What is your year to year growth?• How do you get the data? © Hortonworks Inc. 2013 Page 5
  6. 6. HDFS Characteristics• Provides Distributed File System – Very high aggregate bandwidth – Extreme scalability (up to 100 PB) – Self-healing storage – Relatively simple to administer• Limitations – Can’t modify existing files – Single writer for each file – Heavy bias for large files ( > 100 MB) © Hortonworks Inc. 2013 Page 6
  7. 7. Choices for Layout• Partitions – Top level mechanism for pruning – Primary unit for updating tables (& schema) – Directory per value of specified column• Bucketing – Hashed into a file, good for sampling – Controls write parallelism• Sort order – The order the data is written within file © Hortonworks Inc. 2013 Page 7
  8. 8. Example Hive Layout• Directory Structure warehouse/$database/$table• Partitioning /part1=$partValue/part2=$partValue• Bucketing /$bucket_$attempt (eg. 000000_0)• Sort – Each file is sorted within the file © Hortonworks Inc. 2013 Page 8
  9. 9. Layout Guidelines• Limit the number of partitions – 1,000 partitions is much faster than 10,000 – Nested partitions are almost always wrong• Gauge the number of buckets – Calculate file size and keep big (200-500MB) – Don’t forget number of files (Buckets * Parts)• Layout related tables the same way – Partition – Bucket and sort order © Hortonworks Inc. 2013 Page 9
  10. 10. Normalization• Most databases suggest normalization – Keep information about each thing together – Customer, Sales, Returns, Inventory tables• Has lots of good properties, but… – Is typically slow to query• Often best to denormalize during load – Write once, read many times – Additionally provides snapshots in time. © Hortonworks Inc. 2013 Page 10
  11. 11. Data FormatLocation, Location, Location© Hortonworks Inc. 2013 Page 11
  12. 12. Choice of Format• Serde – How each record is encoded?• Input/Output (aka File) Format – How are the files stored?• Primary Choices – Text – Sequence File – RCFile – ORC (Coming Soon!) © Hortonworks Inc. 2013 Page 12
  13. 13. Text Format• Critical to pick a Serde – Default - ^A’s between fields – JSON – top level JSON record – CSV – commas between fields (on github)• Slow to read and write• Can’t split compressed files – Leads to huge maps• Need to read/decompress all fields © Hortonworks Inc. 2013 Page 13
  14. 14. Sequence File• Traditional MapReduce binary file format – Stores keys and values as classes – Not a good fit for Hive, which has SQL types – Hive always stores entire row as value• Splittable but only by searching file – Default block size is 1 MB• Need to read and decompress all fields © Hortonworks Inc. 2013 Page 14
  15. 15. RC (Row Columnar) File• Columns stored separately – Read and decompress only needed ones – Better compression• Columns stored as binary blobs – Depends on metastore to supply types• Larger blocks – 4 MB by default – Still search file for split boundary © Hortonworks Inc. 2013 Page 15
  16. 16. ORC (Optimized Row Columnar)• Columns stored separately• Knows types – Uses type-specific encoders – Stores statistics (min, max, sum, count)• Has light-weight index – Skip over blocks of rows that don’t matter• Larger blocks – 256 MB by default – Has an index for block boundaries © Hortonworks Inc. 2013 Page 16
  17. 17. ORC - File Layout © Hortonworks Inc. 2013 Page 17
  18. 18. Example File Sizes from TPC-DS © Hortonworks Inc. 2013 Page 18
  19. 19. Compression• Need to pick level of compression – None – LZO or Snappy – fast but sloppy – Best for temporary tables – ZLIB – slow and complete – Best for long term storage © Hortonworks Inc. 2013 Page 19
  20. 20. JoinsPutting the pieces together© Hortonworks Inc. 2013 Page 20
  21. 21. Default Assumption• Hive assumes users are either: – Noobies – Hive developers• Default behavior is always finish – Little Engine that Could!• Experts could override default behaviors – Get better performance, but riskier• We’re working on improving heuristics © Hortonworks Inc. 2013 Page 21
  22. 22. Shuffle Join• Default choice – Always works (I’ve sorted a petabyte!) – Worst case scenario• Each process – Reads from part of one of the tables – Buckets and sorts on join key – Sends one bucket to each reduce• Works everytime! © Hortonworks Inc. 2013 Page 22
  23. 23. Map Join• One table is small (eg. dimension table) – Fits in memory• Each process – Reads small table into memory hash table – Streams through part of the big file – Joining each record from hash table• Very fast, but limited © Hortonworks Inc. 2013 Page 23
  24. 24. Sort Merge Bucket (SMB) Join• If both tables are: – Sorted the same – Bucketed the same – And joining on the sort/bucket column• Each process: – Reads a bucket from each table – Process the row with the lowest value• Very efficient if applicable © Hortonworks Inc. 2013 Page 24
  25. 25. DebuggingWhat could possibly go wrong?© Hortonworks Inc. 2013 Page 25
  26. 26. Performance Question• Which of the following is faster? – select count(distinct(Col)) from Tbl – select count(*) from (select distict(Col) from Tbl) © Hortonworks Inc. 2013 Page 26
  27. 27. Count Distinct © Hortonworks Inc. 2013 Page 27
  28. 28. Answer• Surprisingly the second is usually faster – In the first case: – Maps send each value to the reduce – Single reduce counts them all – In the second case: – Maps split up the values to many reduces – Each reduce generates its list – Final job counts the size of each list – Singleton reduces are almost always BAD © Hortonworks Inc. 2013 Page 28
  29. 29. Communication is Good!• Hive doesn’t tell you what is wrong. – Expects you to know! – “Lucy, you have some ‘splaining to do!”• Explain tool provides query plan – Filters on input – Numbers of jobs – Numbers of maps and reduces – What the jobs are sorting by – What directories are they reading or writing © Hortonworks Inc. 2013 Page 29
  30. 30. Blinded by Science• The explanation tool is confusing. – It takes practice to understand. – It doesn’t include some critical details like partition pruning.• Running the query makes things clearer! – Pay attention to the details – Look at JobConf and job history files © Hortonworks Inc. 2013 Page 30
  31. 31. Skew• Skew is typical in real datasets.• A user complained that his job was slow – He had 100 reduces – 98 of them finished fast – 2 ran really slow• The key was a boolean… © Hortonworks Inc. 2013 Page 31
  32. 32. Root Cause Analysis• Ambari – Apache project building Hadoop installation and management tool – Provides metrics (Ganglia & Nagios) – Root Cause Analysis – Processes MapReduce job logs – Displays timing of each part of query plan © Hortonworks Inc. 2013 Page 32
  33. 33. Root Cause Analysis Screenshots © Hortonworks Inc. 2013 Page 33
  34. 34. Root Cause Analysis Screenshots © Hortonworks Inc. 2013 Page 34
  35. 35. Thank You!Questions & Answers@owen_omalley © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 35
  36. 36. ORCFile - Comparison RC File Trevni ORC File Hive Type Model N N Y Separate complex columns N Y Y Splits found quickly N Y Y Default column group size 4MB 64MB* 250MB Files per a bucket 1 >1 1 Store min, max, sum, count N N Y Versioned metadata N Y Y Run length data encoding N N Y Store strings in dictionary N N Y Store row count N Y Y Skip compressed blocks N N Y Store internal indexes N N Y © Hortonworks Inc. 2013 Page 36

×