Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

887 views

Published on

Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

Published in: Technology

Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

  1. 1. Bridging the gap of Relational to Hadoop using Sqoop@Expedia (Enhancing Sqoop for Synchronization) Shashank Tandon, Expedia Kopal Niranjan, Expedia
  2. 2. Agenda • Problem statement • Why- Sqoop • Expedia Enhancements for Sqoop. • New Tool : Hive Merge • Data Synchronization • Demo | Expedia Inc. Proprietary & Confidential1
  3. 3. | Expedia Inc. Proprietary & Confidential2 Data Synchronization
  4. 4. Problem Statement • Import huge amount of data available on RDBMS to Hive table • Support multiple partitions on Hive while importing. • Regular updates happening on RDBMS. –Merge the new/updated data to hive tables. –Merge the data in parallel. | Expedia Inc. Proprietary & Confidential3
  5. 5. Community Solution - Sqoop • Sqoop is an open source tool designed to efficiently transfer bulk data between Hadoop and structured data stores such as relational databases. • Support various relational databases like Teradata, SQL Server, Oracle,Mysql,DB2 etc. | Expedia Inc. Proprietary & Confidential4
  6. 6. Enhanced Sqoop Features • Enhanced Sqoop Features for community business needs. - Hive Merge - Merges the incremental data migrated to hdfs into your existing hive tables. - Supports merge based on composite keys - Merges older partitions as well as add new partitions. | Expedia Inc. Proprietary & Confidential5
  7. 7. Enhanced Sqoop Features - Hive Dynamic Partition - Hive Dynamic Partition with Partition Format - Hive External Table - Compression like Snappy | Expedia Inc. Proprietary & Confidential6
  8. 8. Hcatalog for Hive - Hcatalog is a java wrapper on top of Hive metastore. - Sqoop supports all the latest Hive features using Hcatalog. | Expedia Inc. Proprietary & Confidential7
  9. 9. External tables with HCatalog | Expedia Inc. Proprietary & Confidential8
  10. 10. Sqoop Import to Hive Managed Table | Expedia Inc. Proprietary & Confidential9 • Sqoop connects to mysql database test • Import table MYTABLE in a hive managed table test_part1 • The hive managed table is located in /apps/hive/warehouse
  11. 11. | Expedia Inc. Proprietary & Confidential10
  12. 12. New Enhancement :Import to Hive External Table | Expedia Inc. Proprietary & Confidential11 • The above command creates a hive table in the user managed Directory /user/root/test_part2
  13. 13. | Expedia Inc. Proprietary & Confidential12
  14. 14. Dynamic Partitioning with HCatalog | Expedia Inc. Proprietary & Confidential13
  15. 15. Sqoop Import to Hive Static Partition • Can pass only 1 static partition as sqoop argument | Expedia Inc. Proprietary & Confidential14
  16. 16. Sqoop Import to Hive Static Partition • Check Hive Partition | Expedia Inc. Proprietary & Confidential15
  17. 17. Sqoop Import to Hive Static Partition on Date column • Can pass only 1 static partition as sqoop argument with date value specified manually. | Expedia Inc. Proprietary & Confidential16
  18. 18. Questions | Expedia Inc. Proprietary & Confidential17 How to Import Data if there are more than 200 partitions ? Should I manually run these jobs again and again ? How to Import Data if the date format is month or day or year? Is there any way that I can pass the format ?
  19. 19. New Enhancement : Import to Hive Dynamic Partition • A new argument is passed –hcatalog-dynamic-partition- keys in sqoop. • It works along with current static partition key. • If both are passed then it will give more preference to static partition key. | Expedia Inc. Proprietary & Confidential18
  20. 20. | Expedia Inc. Proprietary & Confidential19
  21. 21. New Enhancement : Import to Hive Dynamic Partition with Date Format • A new argument is passed –hcatalog-dynamic-partition- key-format with argument –hcatalog-dynamic-partition- keys. • Check the Hive Partitions after the Sqoop Import. • The partitions created will be in the user-specified format. | Expedia Inc. Proprietary & Confidential20
  22. 22. | Expedia Inc. Proprietary & Confidential21
  23. 23. Password encrypted in Sqoop Metastore • Password will now be saved in Sqoop metastore in encrypted manner. • The logic is same as done in file encryption where generic passkey and algorithm is passed in command line. | Expedia Inc. Proprietary & Confidential22
  24. 24. Issues with Sqoop Merge Tool • Designed to merge two directories on HDFS. Will need modification to support merging of Hive tables. • The output directory must be specified while performing the merge. • Supports merge based on a single column. • To merge many partitions, each will require separate sequential Sqoop jobs. | Expedia Inc. Proprietary & Confidential23
  25. 25. Merge Incremental data using Sqoop and Hive External Table • Import records from base table to a HDFS directory. • Import updates using incremental imports to another HDFS directory. • Create a hive external table for both the directories. • Create a view that combines record sets from both the Base (base_table) and Change (incremental_table) tables. | Expedia Inc. Proprietary & Confidential24
  26. 26. Merge Incremental data using Sqoop and Hive External Table • The view now contains the most up-to-date set of records. • Generate a table from the view created in above step. • Replace the base table with the entries from the above generated table. | Expedia Inc. Proprietary & Confidential25
  27. 27. New Tool: Hive Merge • Import original base table into Hive | Expedia Inc. Proprietary & Confidential26
  28. 28. New Tool : Hive merge • Import incremental data into Hive | Expedia Inc. Proprietary & Confidential27
  29. 29. • Finally merge data using tool hive-merge. | Expedia Inc. Proprietary & Confidential28 New Tool : Hive merge
  30. 30. Acquiring locks during Hive Merge • In order to allow only single Hive merge happen on same table, tool acquire lock in the start and release lock once it finishes. | Expedia Inc. Proprietary & Confidential29
  31. 31. Performance metrics : Hive Merge tool | Expedia Inc. Proprietary & Confidential30
  32. 32. Other Key Enhancements • Save encrypted password in Sqoop Metastore • Teradata varchar/char support • Teradata current timestamp support • Sqoop Job runs for Incremental Import • Snappy compression support in Hcatalog | Expedia Inc. Proprietary & Confidential31
  33. 33. Apache Sqoop Jiras These are the few jiras for which the patch has been provided by us: • SQOOP-2332: Dynamic Partition in Sqoop HCatalog- if Hive table does not exists & add support for Partition Date Format • SQOOP-2335 :Support for Hive External Table in Sqoop – Hcatalog | Expedia Inc. Proprietary & Confidential32
  34. 34. • SQOOP-2585: Merging hive tables using sqoop • SQOOP-2596:Precision of varchar/char column cannot be retrieved from teradata database during sqoop import • SQOOP-2801: Secure RDBMS password in Sqoop Metastore in a encrypted form. • SQOOP-2331: Snappy Compression Support in Sqoop- Hcatalog | Expedia Inc. Proprietary & Confidential33
  35. 35. 34 Demo
  36. 36. Questions | Expedia Inc. Proprietary & Confidential35
  37. 37. Hive Merge Internal Architecture | Expedia Inc. Proprietary & Confidential36 Step 1: Identify partitions to update. Skip this step for non-partitioned tables.
  38. 38. Hive Merge Internal Architecture | Expedia Inc. Proprietary & Confidential37 Step 2: Merge the new partitions with the old partitions(only for partitioned tables).
  39. 39. Hive Merge Internal Architecture | Expedia Inc. Proprietary & Confidential38 Step 3: Delete older versions.

×