Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

2,357 views

Published on

Sqoop2 is Sqoop as a service. Its focus is on ease of use, ease of extensibility, and security. Recently, Sqoop2 was refactored to handle generic data transfer needs.

Published in: Software
  • Be the first to comment

Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

  1. 1. Sqoop 2 Refactoring for generic data transfer Abraham Elmahrek
  2. 2. Cloudera Ingest!
  3. 3. Introduction to Sqoop 2 Ease of use Extensible Security Provide a rest API and Java API for easy integration. Existing clients include a Hue UI and a command line client. Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector. Emphasize separation of responsibilities. Eventually have ACLs or RBAC.
  4. 4. Life of a Request • Client – Talks to server over REST + JSON – Does nothing but sends requests • Server – Extracts metadata from data source – Delegates to execution engine – Does all the heavy lifting really • MapReduce – Parallelizes execution of the job
  5. 5. Workflow
  6. 6. Job Types IMPORT into Hadoop and EXPORT out of Hadoop
  7. 7. Responsibilities Transfer data from Connector A to Hadoop Connector responsibilities Sqoop framework responsibilities
  8. 8. Connector Definitions • Connectors define: – How to connect to a data source – How to extract data from a data source – How to load data to a data source public Importer getImporter(); // Supply extract method public Importer getExporter(); // Supply load method public class getConnectionConfigurationClass(); public class getJobConfigurationClass(MJob.Type type); // MJob.Type is IMPORT or EXPORT
  9. 9. Intermediate Data Format • Describe a single record as it moves through Sqoop • currently available – CSV col1,col2,col3,... col1,col2,col3,... ...
  10. 10. What’s Wrong w/ Current Implementation? • Hadoop as a first class citizen disables transfers between the components in the Hadoop ecosystem – HBase to HDFS not supported – HDFS to Accumulo not supported • Hadoop ecosystem not well defined – Accumulo was not considered part of Hadoop ecosystem – What’s next? Kafka?
  11. 11. Refactoring • Connectors already defined extractors and loaders – Refactor the connector SDK • Pull out HDFS integration to a connector • Improve Schema integration Transfer data from Connector A to Connector B
  12. 12. Connector SDK • Connectors assume all roles • Add Direction for FROM and TO • Initializers and destroyers for both directions Connector responsibilities
  13. 13. HDFS Connector • Move Hadoop role to connector • Schemaless • Data formats – Text (CSV) – Sequence – etc.
  14. 14. Schema Improvements • Schema per connector • Intermediate data format (IDF) has a Schema • Introduce matcher • Schema represents data as it moves through the system
  15. 15. Matcher • Matcher ensures data goes to right place • Combinations – FROM and TO schema – FROM schema – TO schema – No schema = Error
  16. 16. Matcher Location Name User defined Ensure that FROM schema matches TO schema by index location of Schema Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector. Emphasize separation of responsibilities. Eventually have ACLs or RBAC.
  17. 17. Checkout http: //ingest.tips for general ingest
  18. 18. Thank you

×