Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Effective Sqoop 
Alex Silva 
Principal Software Engineer 
alex.silva@rackspace.com
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! ...
Formatting Arguments 
The default delimiters are: comma (,) for fields, newline (n) for 
records, no quote character, and ...
ID LABEL STATUS 
1 Critical, test. ACTIVE 
3 By “agent-nd01” DISABLED 
$ sqoop import … 
$ sqoop import 
--fields-terminat...
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! ...
Taming the Elephant 
• Sqoop delegates all processing to Hadoop: 
• Each mapper transfers a slice of the table. 
• The par...
How Many Mappers? 
• The optimal number depends on a few variables: 
• The database type. 
• How does it handle parallelis...
Gotchas! 
• More mappers can lead to faster jobs, but only up 
to a saturation point. This varies per table, job 
paramete...
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! ...
Connectors 
• Two types of connectors: common (JDBC) and 
direct (vendor specific batch tools). 
Common Connectors 
MySQL ...
Direct Connectors 
• Performance! 
• --direct parameter. 
• Utilities need to be available on all task nodes. 
• Escape ch...
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! ...
Splitting Data 
• By default, the primary key is used. 
• Prior to starting the transfer, Sqoop will retrieve the 
min/max...
Boundary Queries 
What if your split-by column is skewed, table is 
not indexed or can be retrieved from another 
table? 
...
Splitting Free form Queries 
• By default, Sqoop will use the entire query as a subquery to 
calculate max/min: INEFFECTIV...
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! ...
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! ...
Options Files 
• Reusable Arguments that do not change. 
• Pass it to the command line via --options-file 
argument. 
• Co...
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! ...
File Formats 
• Text (default): 
• Non-binary data types. 
• Simple and human-readable. 
• Platform independent. 
• Binary...
Environment 
• A combination of text and AVRO files mostly. 
• Why Avro? 
• Compact, splittable binary encoding. 
• Suppor...
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! ...
Exports 
• Experiment with batching multiple insert statements 
together: 
• --batch parameter 
• sqoop.export.records.per...
Batch Exports 
• The --batch parameter uses the JDBC batch API. 
(addBatch/executeBatch) 
• However… 
• Implementation can...
Batch Exports 
• The sqoop.export.records.per.statement 
property will aggregate multiple rows inside one 
single insert s...
Batch Exports 
• The 
sqoop.export.records.per.transaction: 
how many insert statements will be issued per 
transaction. 
...
Which is better? 
• No silver bullet that applies to all use cases. 
• Start with enabling batch import. 
• Find out what’...
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! ...
Staging Tables are our 
Friends 
• All data is written to staging table first. 
• Data is copied to the final destination ...
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! ...
Hive 
• --hive-import parameter. 
• BONUS: If table doesn’t exist, Sqoop will create it for 
you! 
• Override default type...
Hive partitions 
• Two parameters: 
• --hive-partition-key 
• --hive-partition-value 
• Current Limitations: 
• One level ...
Hive and AVRO 
• Currently not compatible!! 
• Workaround is to create an EXTERNAL table. 
CREATE EXTERNAL TABLE cs_atom_e...
Data Pipeline 
Copyright 2014 Rackspace
Call to Action 
www.rackspace.com/cloud/big-data 
(On-Metal Free Trial) 
• Try it out! 
• Deploy a CBD Cluster, connect to...
Thank you! 
Alex Silva 
alex.silva@rackspace.com 
Copyright 2014 Rackspace
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Big data: Loading your data with flume and sqoop
Next
Upcoming SlideShare
Big data: Loading your data with flume and sqoop
Next
Download to read offline and view in fullscreen.

Share

Effective Sqoop: Best Practices, Pitfalls and Lessons

Download to read offline

A fast paced, in-depth, "no frills" talk about how to effectively use Sqoop as part of your data flow and ingestion pipeline. We will cover topics such as delimiters in text files, Hadoop, MapReduce execution and map tasks with Sqoop, parallelism, boundary queries and splitting data, connectors, different file formats available in Sqoop, batch exports, Hive, Hive exports and HiveQL.

Effective Sqoop: Best Practices, Pitfalls and Lessons

  1. 1. Effective Sqoop Alex Silva Principal Software Engineer alex.silva@rackspace.com
  2. 2. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  3. 3. Formatting Arguments The default delimiters are: comma (,) for fields, newline (n) for records, no quote character, and no escape character. Formatting Argument What is it for? enclosed-by The field enclosing character. escaped-by The escape character. fields-terminated-by The field separator character. lines-terminated-by The end of line char, mysql-delimiters Default delimiters: fields (,) lines (n) escaped-by () optionally-enclosed-by (') optionally-enclosed-by The field enclosing character. Copyright 2014 Rackspace
  4. 4. ID LABEL STATUS 1 Critical, test. ACTIVE 3 By “agent-nd01” DISABLED $ sqoop import … $ sqoop import --fields-terminated-by , --escaped-by --enclosed-by '"' ... “1”,”Critical, test”, “ACTIVE” 1,”Critical, test”, ACTIVE “3”,“By ”agent-nd01””,”DISABLED” 3,“By ”agent-nd01””,DISABLED 1,Critical,test,ACTIVE 3,By “agent-nd01”,DISABLED $ sqoop import --fields-terminated-by , --escaped-by --optionally-enclosed-by '"' ... Sometimes the problem doesn’t show up until much later… Copyright 2014 Rackspace
  5. 5. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  6. 6. Taming the Elephant • Sqoop delegates all processing to Hadoop: • Each mapper transfers a slice of the table. • The parameter --num-mappers (defaults to 4) tells Sqoop how many mappers to use to slice the data. Copyright 2014 Rackspace
  7. 7. How Many Mappers? • The optimal number depends on a few variables: • The database type. • How does it handle parallelism internally? • The server hardware and infrastructure. • Overall impact to other requests. Copyright 2014 Rackspace
  8. 8. Gotchas! • More mappers can lead to faster jobs, but only up to a saturation point. This varies per table, job parameters, time of day and server availability. • Too many mappers will increase the load on the database: people will notice! Copyright 2014 Rackspace
  9. 9. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  10. 10. Connectors • Two types of connectors: common (JDBC) and direct (vendor specific batch tools). Common Connectors MySQL PostgreSQL Oracle SQL Server DB2 Generic Direct Connectors MySQL PostgreSQL Oracle Teradata And others Copyright 2014 Rackspace
  11. 11. Direct Connectors • Performance! • --direct parameter. • Utilities need to be available on all task nodes. • Escape characters, type mapping, column and row delimiters may not be supported. • Binary formats don’t work. Copyright 2014 Rackspace
  12. 12. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  13. 13. Splitting Data • By default, the primary key is used. • Prior to starting the transfer, Sqoop will retrieve the min/max values for this column. • Changed column with the --split-by parameter: • Required in tables with no index columns or multi-column keys. Copyright 2014 Rackspace
  14. 14. Boundary Queries What if your split-by column is skewed, table is not indexed or can be retrieved from another table? Use a boundary query to create the splits. select min(<split-by>), max(<split-by>) from <table name> Copyright 2014 Rackspace
  15. 15. Splitting Free form Queries • By default, Sqoop will use the entire query as a subquery to calculate max/min: INEFFECTIVE! • Solution: use a --boundary-query. • Good choices: • Store boundary values in a separate table. • Good for incremental imports. (--last-value) • Run query prior to Sqoop and save its output in a temporary table. Copyright 2014 Rackspace
  16. 16. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  17. 17. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  18. 18. Options Files • Reusable Arguments that do not change. • Pass it to the command line via --options-file argument. • Composition: more than one option file is allowed. Copyright 2014 Rackspace
  19. 19. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability 7. Use the proper file format for your needs 8. Prefer batch mode when exporting 9. Use a staging table. 10.Aggregate data in Hive Copyright 2014 Rackspace
  20. 20. File Formats • Text (default): • Non-binary data types. • Simple and human-readable. • Platform independent. • Binary (AVRO and sequence files): • Precise representation and with efficient storage. • Good for Text containing separators. Copyright 2014 Rackspace
  21. 21. Environment • A combination of text and AVRO files mostly. • Why Avro? • Compact, splittable binary encoding. • Supports versioning and is language agnostic. • Also used as a container for smaller files. Copyright 2014 Rackspace
  22. 22. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  23. 23. Exports • Experiment with batching multiple insert statements together: • --batch parameter • sqoop.export.records.per.statement (100) property. • sqoop.export.statements.per.transaction (100) property. Copyright 2014 Rackspace
  24. 24. Batch Exports • The --batch parameter uses the JDBC batch API. (addBatch/executeBatch) • However… • Implementation can vary among drivers. • Some drivers actually perform worse in batch mode! (serialization and internal caches) Copyright 2014 Rackspace
  25. 25. Batch Exports • The sqoop.export.records.per.statement property will aggregate multiple rows inside one single insert statement. • However… • Not supported by all databases (most do.) • Be aware that most dbs have limits on the maximum query size. Copyright 2014 Rackspace
  26. 26. Batch Exports • The sqoop.export.records.per.transaction: how many insert statements will be issued per transaction. • However… • Exact behavior depends on database. • Be aware of table-level write locks. Copyright 2014 Rackspace
  27. 27. Which is better? • No silver bullet that applies to all use cases. • Start with enabling batch import. • Find out what’s the maximum query size for your database. • Set the number of rows per statement to roughly that value. • Go from there. Copyright 2014 Rackspace
  28. 28. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  29. 29. Staging Tables are our Friends • All data is written to staging table first. • Data is copied to the final destination iff all tasks succeed: all-or-nothing semantics. • Structure must match exactly: columns and types. • Staging table must exist before and must be empty. (--clear-staging-table parameter) Copyright 2014 Rackspace
  30. 30. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  31. 31. Hive • --hive-import parameter. • BONUS: If table doesn’t exist, Sqoop will create it for you! • Override default type mappings with --map-column-hive. • Data is first loaded into HDFS and then loaded into Hive.. • Default behavior is append. (—hive-overwrite.) Copyright 2014 Rackspace
  32. 32. Hive partitions • Two parameters: • --hive-partition-key • --hive-partition-value • Current Limitations: • One level of partitioning only. • The partition value has to be an actual value and not a column name. Copyright 2014 Rackspace
  33. 33. Hive and AVRO • Currently not compatible!! • Workaround is to create an EXTERNAL table. CREATE EXTERNAL TABLE cs_atom_events ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION ‘/user/cloud-analytics/snapshot/atom_events/cloud-servers’ TBLPROPERTIES (‘avro.schema.url’=‘hdfs:///user/cloud-analytics/avro/cs_cff_atom.avsc'); Copyright 2014 Rackspace
  34. 34. Data Pipeline Copyright 2014 Rackspace
  35. 35. Call to Action www.rackspace.com/cloud/big-data (On-Metal Free Trial) • Try it out! • Deploy a CBD Cluster, connect to your RDBMS. • Extract value from your data!
  36. 36. Thank you! Alex Silva alex.silva@rackspace.com Copyright 2014 Rackspace
  • ssuser8e29b7

    Jun. 29, 2020
  • lalpal

    Feb. 16, 2019
  • UmapathyV

    Jan. 16, 2019
  • RajeshKumar1104

    Mar. 23, 2018
  • chssunil

    Feb. 1, 2018
  • nitinr708

    Jan. 8, 2018
  • gabubellon

    Aug. 28, 2017
  • tashish786

    Aug. 24, 2017
  • 16892434

    May. 23, 2017
  • BorisTyukin

    Apr. 28, 2017
  • VenkateshJayapal1

    Apr. 20, 2017
  • zeratul0211

    Jan. 23, 2017
  • xt84

    Oct. 17, 2016
  • ParimalKumar8

    Sep. 25, 2016
  • oeegee

    Jul. 20, 2016
  • thangarajan60

    Jun. 30, 2016
  • HandGraradji1

    Jun. 3, 2016
  • ManishBajaj9

    May. 15, 2016
  • NELLAIVIJAY1

    May. 10, 2016
  • rvinoth08

    Apr. 20, 2016

A fast paced, in-depth, "no frills" talk about how to effectively use Sqoop as part of your data flow and ingestion pipeline. We will cover topics such as delimiters in text files, Hadoop, MapReduce execution and map tasks with Sqoop, parallelism, boundary queries and splitting data, connectors, different file formats available in Sqoop, batch exports, Hive, Hive exports and HiveQL.

Views

Total views

26,562

On Slideshare

0

From embeds

0

Number of embeds

119

Actions

Downloads

581

Shares

0

Comments

0

Likes

36

×