Hive partitioning best practices

Hive Partitioning
Best Practices
(Developer audience)
Nabeel Moidu
Solutions Architect
Cloudera Professional Services

2© Cloudera, Inc. All rights reserved.
• Data Warehouses in Hadoop
• Hadoop Data Modelling
• Columnar data storage
• Partitioning
• Bucketing
• Best Practices
• Small Files
• Storage optimizations
• Query optimisation
• Debugging Hive queries
Topics

3 © Cloudera, Inc. All rights reserved.
Data Warehouses in Hadoop
❏ Targeted for analytical query processing (OLAP).
❏ Query processing is handled differently by each engine:
❏ Hive uses MapReduce as the engine.
❏ CDP versions include Tez engine
❏ SparkSQL uses Spark engine
❏ Both Hive and Spark use table metadata stored in the Hive metastore.
❏ Both Hive and Spark are based on “Schema-on-read” approach
❏ Traditional RDBMS use “Schema-on-write” approach
Key differences in query patterns from RDBMS (OLTP) :
❏ May only need select columns from a table
❏ Entire set of columns of select rows not required in most cases.
❏ May involve large join operations.
❏ Denormalized tables are preferred
❏ Query optimisation focus includes minimizing amount of data fetched from disk (pruning)

Hadoop Data Modelling
General guidelines for Data Modelling in Hadoop:
1. Denormalize tables that are joined frequently**.
2. RDBMS implementations of any comparable features are significantly different from that of Hadoop/Hive.
3. Significant level of optimisation is done outside of partitioning.
4. ORC/Parquet file formats with embedded metadata greatly improve predicate pushdown
5. Numerical data stored as strings prevents predicate pushdown options in query optimisation
6. Query engines assess the data involved and optimize the SQL statement before actual execution.
7. Excessive partitioning/bucketing can negatively impact overall performance due to added overhead
** (Note: Very wide tables may have memory implications)

Columnar data storage
Physical data organization on the disk is optimised for OLAP patterns of access.
1. Columnar data organization improves performance of data retrieval.
2. Storing similar patterns together makes storage compression very efficient.
a. Low cardinality columns are replaced with dictionary mapped references
b. Repeated values in column are only stored once
c. Numerically close values are stored as delta
d. Bits are packed for efficient use of disk space
e. Per column metadata is separately stored in footer/trailer
3. Efficient storage on disk significantly improves disk I/O rate for data fetch
4. Columnar storage optimisation is applied both on Parquet and ORC
5. Common file formats like csv, json, xml etc are not optimal for performance

Partitioning
1. Achieves data pruning by dividing backend data storage location for table into sub-folders.
2. Each folder is named as “key=value” and corresponds to a unique value of the partition column.
3. Partitioning at multiple columns creates multiple levels of folders in the backend storage
4. Each subfolder corresponds to one column that is designated to be partitioned.
5. Partitioned directories are selectively chosen during data fetch for a query.
6. File structure doesn’t have to be opened and read to identify partitioned column value
7. Partitions cannot be used for a range of values.
8. Partitions should only be used for columns with low cardinality. Eg year, country etc.

Bucketing
1. Divides columns into preset number of buckets.
2. Hash values are computed on column values and placed into set of buckets.
3. Each unique column value will always end up in same bucket file
4. Helps isolating data fetch to only one file when doing joins on the bucketed column value
5. Bucketing and partitioning can be used together, but set a target file size as 200 MB - 1 GB *
6. Too much partitioning and bucketing can end up with the small file issue
7. Parquet/ORC storage optimisations largely cover similar optimisation as bucketing
* During the session this was mentioned as 1 GB - 2 GB. Please note the correction

Best Practices ( partitioning & bucketing )
● Choose no more than two levels of partitioning
● Partition along columns that are likely to be filtered in queries on the data
● Keep partition count in a table at max of 1000-2000 for optimal performance
● Never partition on columns that have high cardinality / unique values per row
● Target an optimal range of 200 MB - 1 GB of file size inside each partition/bucket.
● Ensure all files inside one partition are merged into one during the ingestion process itself.
● In case of small file issues in cluster, bucketing is better skipped when using ORC/Parquet tables
● For bucketing optimisation inputs, refer to :
https://community.cloudera.com/t5/Support-Questions/Hive-Deciding-the-number-of-buckets/m-p/129310

Small Files
1. The Hadoop cluster filesystem and Yarn processing framework involve some overhead which is:
a. Negligible when the chunk of data stored and processed by each storage block or task unit is large enough
b. Expensive when the data involved is proportionately low.
2. Hadoop and big data processing is generally optimized for large file sizes.
3. Small files make disk reads random during data fetch and significantly reduce performance of data retrieval.
4. No hard boundary is defined as to what constitutes a small file, but
a. Files < 30 MB are not optimal.
b. Files < 1 MB will significantly impact overall performance.
5. Numerous small files increase size of metadata at Master nodes for cluster filesystem
a. Impacts performance of the filesystem response.
6. Metadata processing for numerous individual small files impact query planning stages in Hive, Spark
a. File internal structure has to be individually read to get footer/trailer metadata in each file.

Storage optimizations
Parquet and ORC are the main file formats that are efficient and perform well in Hive.
1. Parquet was originally developed by Cloudera and Twitter - available in CDH clusters
2. ORC was originally by Hortonworks, before the merger with Cloudera - will be available in CDP clusters
3. Both formats involve columnar storage of data in the file
a. Both contain useful metadata in the footer of the file.
4. Both formats involve data being split first into sets of rows.
a. ORC calls it a stripe, while parquet refers to the same as Row Group.
5. Each set of rows then has their individual column data stored together.
a. This helps in both efficient compression and efficient retrieval of column based query output.
6. Data within columns are generally stored using Dictionary RLE( Run Length Encoding) method
7. Use parquet-tools utility to view metadata on a parquet file.

Query optimisation
1. Partitioning on filter column enables data to be fetched only from selected subfolders in HDFS
2. Column metadata is present at the footer of each file in ORC/Parquet
3. Footer data includes among others, min and max values for each column.
4. Footers also contain a dictionary list of column values for lower cardinality columns.
5. These and similar pieces of metadata help in predicate pushdown to optimise queries
6. Query engines can skip entire files during processing based on the metadata. Eg :
6.1. If filter criteria doesn’t fit into the range identified by per column min-max values, entire file can be skipped
6.2. If filter criteria is not part of keys in the dictionary of column values, entire file can be skipped

Debugging Hive queries
1. Use PerfLogger setting in the Hiverserver2 instance to debug Hive query performance.
2. Each stage in Hive query execution is separately timed and logged once PerfLogger is enabled.
3. Compile and Yarn stages are separately identifiable.
4. Compile stages normally should not take more than a few seconds.
a. For complicated queries this may go upto a minute or so.
b. 90% or more of query time is often spent in Yarn stage
5. Use Hive session ID and assigned thread ID to track individual user sessions in Hiveserver2.
6. Use application ID logged against the Hive query ID to track progress of query in Yarn.

Q&A - Part 1
1. How can tables with skewed data columns be partitioned ?
a. Use the Skewed by option
2. Why is there a difference in performance when using “BETWEEN” and “GREATER THAN OR EQUAL TO” ?
a. Needs to be investigated using EXPLAIN output
b. Speculative possibilities include difference in predicate pushdown or vectorisation
3. If downstream use patterns are not known at time of table design, what is to be done ?
a. Optimise based on generally known patterns (query by year/month etc)
b. Optimise based on ingest
4. Can bucketing be done on tables after data ingest ?
a. No. Data will have to be fully re-organized, so it’s better to create another bucketed table and copy into it.

Q&A - Part 2
1. How is HBase columnar format different from Hive columnar optimizations ?
a. HBase stores data in different column families in different folders in HDFS.
b. Inside that the column data is still stored with the key alongside every cell value
c. HBase is extremely fast for key based lookup, and key range based lookup.
d. Queries filtering on column values will be anti-pattern for HBase and will not perform well
e. Column based filtering on HBase will have to scan through all columns for the select rows.
f. HBase file format design is not optimized for any join type operations
g. Schema design in HBase primarily focuses on the way the row key is designed

Q&A - Part 3
1. Why does Spark not have it’s own metastore ?
a. Spark was designed as a data processing framework separate from Hadoop
b. The initial focus of Spark optimizations were in the query engine.
c. When Spark was introduced, data in Hadoop was already being stored as a catalogue in Hive metastore
d. Spark and Hive being open source, it was possible for Spark to directly talk to Hive metastore
e. Hence there was no necessity for Spark to introduce a separate catalog
2. Can Hive PerfLogger output be available to users without access to Hiveserver2 like Spark logs are accessible?
a. The Driver process is where a query is compiled and passed to Yarn in Hadoop
b. The Driver in Spark sits on the Application Master of the job launched. Hence it’s segregated per job
c. The Driver in Hive sits on the Hiveserver2 instance. Hence the logs sit on the central server for all queries.

Thank you

Hive partitioning best practices

More Related Content

What's hot

Similar to Hive partitioning best practices

Recently uploaded

Hive partitioning best practices