Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced Sqoop


Published on

This is a slide pack which explains some of the not-so-known features of Sqoop, an integral part of the Hadoop ecosystem.

Published in: Technology
  • Be the first to comment

Advanced Sqoop

  1. 1. Sqoop – Advanced Options 2015
  2. 2. Contents 1 What is Sqoop ? 2 Import and Export data using Sqoop 3 Import and Export command in Sqoop 4 Saved Jobs in Sqoop 5 Option File 6 Important Sqoop Options
  3. 3. What is Sqoop? Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured data stores such as relational databases.
  4. 4. Import and Export using Sqoop The import command in Sqoop transfers the data from RDBMS to HDFS/Hive/HBase. The export command in Sqoop transfers the data from HDFS/Hive/HBase back to RDBMS.
  5. 5. Import command in Sqoop The command to import data into Hive : The command to import data into HDFS : The command to import data in HBase : sqoop import --connect <connect-string>/dbname --username uname -P --table table_name --hive-import -m 1 sqoop import --connect <connect-string>/dbname --username uname --P --table table_name -m 1 sqoop import --connect <connect-string>/dbname --username root -P --table table_name --hbase-table table_name --column-family col_fam_name --hbase-row-key row_key_name --hbase-create-table -m 1
  6. 6. Export command in Sqoop The command to export data from RDBMS to Hive : The command to export data from RDBMS to HDFS : Limitations of Import and Export command: - Import and Export commands are convenient to use when one wants to transfer data from RDBMS to HDFS/Hive/HBase and vice-a-versa for a limited number of times. So what if there is a requirement of executing the import and export commands several times a day ? In such situations Saved Sqoop Job can save your time. sqoop export --connect <connect-string>/db_name --table table_name -m 1 --export-dir <path_to_export_dir> sqoop export --connect <connect-string>/db_name --table table_name -m 1 --export-dir <path_to_export_dir>
  7. 7. Saved Jobs in Sqoop The Saved Sqoop Job remembers the parameters used by a job so they can be re- executed by invoking the job several times. Following command creates saved jobs: The command above just creates a job with the job name you specify. It means that the job you created is now available in your saved jobs list which can be executed later. Following command executes a saved job : sqoop job --create job_name --import --connect <connect-string>/dbname --table table_name sqoop job --exec job_name --username uname –P
  8. 8. Sample Saved Job sqoop job --create JOB1 -- import --connect jdbc:mysql:// -username XXX -password XXX --table transactionhistory --target-dir /user/cloudera/datasets/trans -m 1 --columns "TransactionID,ProductId,TransactionDate" --check-column TransactionDate --incremental lastmodified --last-value "2004-09-01 00:00:00";
  9. 9. Important Options in Saved Jobs in Sqoop Sqoop option Usage --connect Connection string for the source database --table Source table name --columns Columns to be extracted --username User name for accessing source table --password Password for accessing source table --check-column Specifies the column to be examined when determining which rows to import. --incremental Specifies how Sqoop determines which rows are new. --last-value Specifies the maximum value of the check column from the previous import. For the first execution of the job, “last-value” is treated as the upper bound and data is extracted from first record till the upper bound. --target-dir Target HDFS directory --m Number of mapper tasks --compress Specifies that compression has to be applied while loading data into target. --fields-terminated-by Fields separator in output directory
  10. 10. Sqoop Metastore • A Sqoop metastore keeps track of all jobs. • By default, the metastore is contained in your home directory under .sqoop and is only used for your own jobs. If you want to share jobs, you would need to install a JDBC-compliant database and use the --meta-connect argument to specify its location when issuing job commands. • Important Sqoop commands: • $ sqoop job –list – Lists all jobs available in metastore • sqoop job --exec JOB1 – Executes JOB1 • sqoop job --show JOB1 – Displays metadata of JOB1
  11. 11. Option File Certain arguments in import, export commands and saved jobs are to be written every time you execute them. What would be an alternative to this repetitive work ? For instance following arguments are used repetitively in import and export commands as well as saved jobs : • So these arguments can be saved in a single text file say option.txt. • While executing the command just include this file for the argument --options-file. • Following command shows the use of –options-file argument: import -connect jdbc:mysql//localhost -username -P Option.txt sqoop --options-file <path_to_option_file>/db_name --table table_name
  12. 12. Option File 1. Each argument in the option file should be on a new line. 2. -connect in option file cannot be written as --connect. 3. Same is the case for other arguments too. 4. Option file is generally used when large number of Sqoop jobs use a common set of parameters such as: 1. Source RDBMS ID, Password 2. Source database URL 3. Field Separator 4. Compression type
  13. 13. Sqoop Design Guidelines for Performance 1. Sqoop imports data in parallel from database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the - m argument. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; 2. By default, the import process will use JDBC. Some databases can perform imports in a more high-performance fashion by using database-specific data movement tools. For example, MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel.
  14. 14. Thank You