ETL Design for Impala Zero
Touch Metadata
Manish Maheshwari
2 © Cloudera, Inc. All rights reserved.
Zero Touch Metadata
CatalogD polls Hive Metastore (HMS) notifications events to. -
• Invalidate the tables when it receives the ALTER
TABLE events or the ALTER, ADD, or DROP their partitions.
• Add the tables or databases when it receives the CREATE
TABLE or CREATE DATABASE events.
• Remove the tables from catalogd when it receives
the DROP TABLE or DROP DATABASE events.
3 © Cloudera, Inc. All rights reserved.
Zero Touch Metadata - Benefits
• Automatic sync of metadata ops between Impala and all
other tools
• No need to run any commands on Impala via Impala shell
/ JDBC etc
• Avoid query failures due to stale metadata
4 © Cloudera, Inc. All rights reserved.
Zero Touch Metadata – Some edge cases
Avoid writing HDFS files directly
• Ensure that Use Load Data command to generate the HMS
notification
When to run legacy commands -
• Invalidate Metadata
• Block locations changed due to running HDFS load balancer
(run balancer on weekends)
• Recover Partitions
• If new “folders” are created in Hive/Spark without the ”alter
table add partition” command
• Refresh Table / Refresh Table Partition
• Adding/Removing/Overwriting files into HDFS via Hive/Spark
5 © Cloudera, Inc. All rights reserved.
Zero Touch Metadata – Handling Inserts into Tables
Avoid writing HDFS files directly
• Ensure that Use Load Data command to generate the HMS
notification
When to run legacy commands -
• Invalidate Metadata
• Block locations changed due to running HDFS load balancer
(run balancer on weekends)
• Recover Partitions
• If new “folders” are created in Hive/Spark without the ”alter
table add partition” command
• Refresh Table / Refresh Table Partition
• Adding/Removing/Overwriting files into HDFS via Hive/Spark
6 © Cloudera, Inc. All rights reserved.
Zero Touch Metadata – Handling Inserts into Tables
• Spark SQL can be used for ETL, with the below limitation
• Don’t use spark.write.save as it does not generate HMS notification
• Spark.write.save("/user/hive/warehouse/stg.db/customers/da..)
• Use
• Spark.sql(" INSERT OVERWRITE TABLE xxx PARTITION (date = , …) as
select * from spark_dataframe“ )
• Ditto for Hive
• Use Load Data …
© 2019 Cloudera, Inc. All rights reserved. 7
Ahdoc ETL
- End users can choose to run ETL in
- Hive / Spark / Impala
- Using Hive/Spark is recommended over Impala
- Metadata would be managed automatically as explained in the
previous slides
- If using Impala - No needed for metadata management
Recommendations - ETL and Data Ingestion
© 2019 Cloudera, Inc. All rights reserved. 8
End User BI
- Use Impala
- All Impala best practices would be implemented by PS for
- Admissions Control , Pool Design, Memory Limits
- Dedicated coordinators and Metadata V2
- Estimate compute stats, query timeouts,
- Data cache (only if we can spare 1 disk, use SSD)
- Load Balance using HaProxy / F5
- Scripts to be written to set stats manually for Impala on weekends / as part
of data load.
- Num rows and column cardinalities
Recommendations - ETL and Data Ingestion
© 2019 Cloudera, Inc. All rights reserved. 9
- UI Tools
- Enable LDAP authentication for HS2 and Impala
- Hue
- PS to implement all best practices for Hue
- Query/session timeouts, download limits, concurrent queries
etc,
- Use to upload data and limit usage for heavy users
- Dbeaver
- Power users to use thick client
Recommendations - ETL and Data Ingestion
Thanks

ETL Design for Impala Zero Touch Metadata.pptx

  • 1.
    ETL Design forImpala Zero Touch Metadata Manish Maheshwari
  • 2.
    2 © Cloudera,Inc. All rights reserved. Zero Touch Metadata CatalogD polls Hive Metastore (HMS) notifications events to. - • Invalidate the tables when it receives the ALTER TABLE events or the ALTER, ADD, or DROP their partitions. • Add the tables or databases when it receives the CREATE TABLE or CREATE DATABASE events. • Remove the tables from catalogd when it receives the DROP TABLE or DROP DATABASE events.
  • 3.
    3 © Cloudera,Inc. All rights reserved. Zero Touch Metadata - Benefits • Automatic sync of metadata ops between Impala and all other tools • No need to run any commands on Impala via Impala shell / JDBC etc • Avoid query failures due to stale metadata
  • 4.
    4 © Cloudera,Inc. All rights reserved. Zero Touch Metadata – Some edge cases Avoid writing HDFS files directly • Ensure that Use Load Data command to generate the HMS notification When to run legacy commands - • Invalidate Metadata • Block locations changed due to running HDFS load balancer (run balancer on weekends) • Recover Partitions • If new “folders” are created in Hive/Spark without the ”alter table add partition” command • Refresh Table / Refresh Table Partition • Adding/Removing/Overwriting files into HDFS via Hive/Spark
  • 5.
    5 © Cloudera,Inc. All rights reserved. Zero Touch Metadata – Handling Inserts into Tables Avoid writing HDFS files directly • Ensure that Use Load Data command to generate the HMS notification When to run legacy commands - • Invalidate Metadata • Block locations changed due to running HDFS load balancer (run balancer on weekends) • Recover Partitions • If new “folders” are created in Hive/Spark without the ”alter table add partition” command • Refresh Table / Refresh Table Partition • Adding/Removing/Overwriting files into HDFS via Hive/Spark
  • 6.
    6 © Cloudera,Inc. All rights reserved. Zero Touch Metadata – Handling Inserts into Tables • Spark SQL can be used for ETL, with the below limitation • Don’t use spark.write.save as it does not generate HMS notification • Spark.write.save("/user/hive/warehouse/stg.db/customers/da..) • Use • Spark.sql(" INSERT OVERWRITE TABLE xxx PARTITION (date = , …) as select * from spark_dataframe“ ) • Ditto for Hive • Use Load Data …
  • 7.
    © 2019 Cloudera,Inc. All rights reserved. 7 Ahdoc ETL - End users can choose to run ETL in - Hive / Spark / Impala - Using Hive/Spark is recommended over Impala - Metadata would be managed automatically as explained in the previous slides - If using Impala - No needed for metadata management Recommendations - ETL and Data Ingestion
  • 8.
    © 2019 Cloudera,Inc. All rights reserved. 8 End User BI - Use Impala - All Impala best practices would be implemented by PS for - Admissions Control , Pool Design, Memory Limits - Dedicated coordinators and Metadata V2 - Estimate compute stats, query timeouts, - Data cache (only if we can spare 1 disk, use SSD) - Load Balance using HaProxy / F5 - Scripts to be written to set stats manually for Impala on weekends / as part of data load. - Num rows and column cardinalities Recommendations - ETL and Data Ingestion
  • 9.
    © 2019 Cloudera,Inc. All rights reserved. 9 - UI Tools - Enable LDAP authentication for HS2 and Impala - Hue - PS to implement all best practices for Hue - Query/session timeouts, download limits, concurrent queries etc, - Use to upload data and limit usage for heavy users - Dbeaver - Power users to use thick client Recommendations - ETL and Data Ingestion
  • 10.