0
Apache Sqoop:                        Highlights of Sqoop 2   Kathleen Ting (kathleen at cloudera dot com)   Alexander Alte...
What is Sqoop?     • Bulk data transfer tool           o   Import/Export from relational database, enterprise             ...
Sqoop 1 ArchitectureWednesday, May 23, 12
Sqoop 1 Challenges      • Cryptic, contextual command line arguments      • Tight coupling between data transfer and      ...
Sqoop 2 ArchitectureWednesday, May 23, 12
Agenda     • Ease of Use        o Sqoop 1: Client-side Tool        o Sqoop 2: Sqoop as a Service        o Client Interface...
Sqoop 1: Client-side Tool       • Sqoop 1 is a client-side tool             o   Client-side installation + configuration  ...
Sqoop 2: Sqoop as a Service      • Server-side installation + configuration                  Connectors configured in one...
Client Interface      • Sqoop 1 client interface:           o    Command-Line Interface (CLI) based, thus scriptable      ...
Sqoop 1: Service Level Integration      • Hive, HBase           o   Requires local installation      • Oozie           o  ...
Sqoop 2: Service Level Integration      • Hive, HBase           o   Server-side integration      • Oozie           o   RES...
Ease of Use (summary)     Sqoop 1                           Sqoop 2     Client-side install               Server-side inst...
Agenda      • Ease of Use         o Sqoop 1: Client-side Tool         o Sqoop 2: Sqoop as a Service         o Client Inter...
Sqoop 1: Implementing Connectors        • Connectors forced to follow JDBC model              o   Connectors limited/requi...
Sqoop 2: Implementing Connectors      • Connectors are not restricted to JDBC model           o   Connectors can define ow...
Different Options, Different Results  Which is running MySQL?  $ sqoop import --connect jdbc:mysql://localhost/db   --user...
Sqoop 1: Using Connectors      • Choice of connector is implicit           o   In a simple case, based on the URL in the -...
Sqoop 2: Using Connectors      • User makes explicit connector choice           o   Less error-prone, more predictable    ...
Ease of Extension (summary)     Sqoop 1                                   Sqoop 2     Connector forced to follow JDBC mode...
Agenda      • Ease of Use         o Sqoop 1: Client-side Tool         o Sqoop 2: Sqoop as a Service         o Client Inter...
Sqoop 1: Security      • Inherits/propagates Kerberos principal for the        jobs it launches      • Access to files on ...
Sqoop 2: Security     • Inherits/propagates Kerberos principal for the       jobs it launches     • Access to files on HDF...
Sqoop 1: Accessing External Systems      • Every invocation requires necessary credentials        to access external syste...
Sqoop 2: Accessing External Systems      • Sqoop 2 introduces Connections as First-Class        Objects           o   Conn...
Sqoop 1: Resource Management      • No explicit resource management policy           o   User specifies number of map jobs...
Sqoop 2: Resource Management      • Connections allow specification of resource        policy           o   Admin can limi...
Security (summary)      Sqoop 1                                   Sqoop 2      Support only for Hadoop security          S...
Takeaway    Sqoop 2 Highights:      • Ease of Use: Sqoop as a Service      • Ease of Extension: Connectors benefit from sh...
Current Status: work-in-progress      • Sqoop 2 Development: https://        issues.apache.org/jira/browse/SQOOP-365      ...
Upcoming SlideShare
Loading in...5
×

Highlights Of Sqoop2

2,559

Published on

The highlights of Sqoop2, Presentation from Kathleen Ting, Sqoop Commiter.

Published in: Technology, Business
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,559
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
125
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "Highlights Of Sqoop2"

  1. 1. Apache Sqoop: Highlights of Sqoop 2 Kathleen Ting (kathleen at cloudera dot com) Alexander Alten-Lorenz (alexander at cloudera dot com) May 2012Wednesday, May 23, 12
  2. 2. What is Sqoop? • Bulk data transfer tool o Import/Export from relational database, enterprise data warehouse, NoSQL systems o Populate tables in Hive, HBase o Schedule Oozie automated import/export tasks o Support plugins via Connector based architectureWednesday, May 23, 12
  3. 3. Sqoop 1 ArchitectureWednesday, May 23, 12
  4. 4. Sqoop 1 Challenges • Cryptic, contextual command line arguments • Tight coupling between data transfer and serialization format • Security concerns with openly shared credentials • Not easy to manage config/install • Not easy to monitor map job • Connectors are forced to follow JDBC modelWednesday, May 23, 12
  5. 5. Sqoop 2 ArchitectureWednesday, May 23, 12
  6. 6. Agenda • Ease of Use o Sqoop 1: Client-side Tool o Sqoop 2: Sqoop as a Service o Client Interface o Sqoop 1: Service Level Integration o Sqoop 2: Service Level Integration • Ease of Extension o Sqoop 1: Implementing Connectors o Sqoop 2: Implementing Connectors o Sqoop 1: Using Connectors o Sqoop 2: Using Connectors • Security o Sqoop 1: Security o Sqoop 2: Security o Sqoop 1: Accessing External Systems o Sqoop 2: Accessing External Systems o Sqoop 1: Resource Management o Sqoop 2: Resource ManagementWednesday, May 23, 12
  7. 7. Sqoop 1: Client-side Tool • Sqoop 1 is a client-side tool o Client-side installation + configuration  Connectors locally installed  Local configuration, requiring root privileges  JDBC drivers needed locally  Database connectivity needed locallyWednesday, May 23, 12
  8. 8. Sqoop 2: Sqoop as a Service • Server-side installation + configuration  Connectors configured in one place, managed by Admin/ run by Operator  JDBC drivers in one place  Database connectivity needed on the serverWednesday, May 23, 12
  9. 9. Client Interface • Sqoop 1 client interface: o Command-Line Interface (CLI) based, thus scriptable • Sqoop 2 client interface: o CLI based, thus scriptable o Web based, thus accessible o REST API exposed for external tool integrationWednesday, May 23, 12
  10. 10. Sqoop 1: Service Level Integration • Hive, HBase o Requires local installation • Oozie o von Neumann(esque) integration:  Packaged Sqoop as an action  Then ran Sqoop from node machines, causing one MR job to be dependent on another MR job  Error-prone, difficult to debugWednesday, May 23, 12
  11. 11. Sqoop 2: Service Level Integration • Hive, HBase o Server-side integration • Oozie o REST API integrationWednesday, May 23, 12
  12. 12. Ease of Use (summary) Sqoop 1 Sqoop 2 Client-side install Server-side install CLI based CLI + Web based Client access to Hive, HBase Server access to Hive, HBase Oozie and Sqoop tightly coupled Oozie finds REST APIWednesday, May 23, 12
  13. 13. Agenda • Ease of Use o Sqoop 1: Client-side Tool o Sqoop 2: Sqoop as a Service o Client Interface o Sqoop 1: Service Level Integration o Sqoop 2: Service Level Integration • Ease of Extension o Sqoop 1: Implementing Connectors o Sqoop 2: Implementing Connectors o Sqoop 1: Using Connectors o Sqoop 2: Using Connectors • Security o Sqoop 1: Security o Sqoop 2: Security o Sqoop 1: Accessing External Systems o Sqoop 2: Accessing External Systems o Sqoop 1: Resource Management o Sqoop 2: Resource ManagementWednesday, May 23, 12
  14. 14. Sqoop 1: Implementing Connectors • Connectors forced to follow JDBC model o Connectors limited/required to use common JDBC vocabulary (URL, database, table, etc) • Connectors must implement all Sqoop functionality that they want to support o New functionality not avail for old connectorsWednesday, May 23, 12
  15. 15. Sqoop 2: Implementing Connectors • Connectors are not restricted to JDBC model o Connectors can define own vocabulary • Common functionality abstracted out of connectors o Connectors only responsible for data transport o Common Reduce phase implements functionality o Ensures that connectors benefit from future dev of functionalityWednesday, May 23, 12
  16. 16. Different Options, Different Results Which is running MySQL? $ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST $ sqoop import --connect jdbc:mysql://localhost/db --driver com.mysql.jdbc.Driver --username foo --table TEST – Different options can lead to unpredictable results – Sqoop 2 requires explicit selection of connector thus disambiguating the processWednesday, May 23, 12
  17. 17. Sqoop 1: Using Connectors • Choice of connector is implicit o In a simple case, based on the URL in the --connect string used to access the database o Specification of different options can lead to different connector selection o Error-prone but good for power users • Requires knowledge of database idiosyncrasies o e.g. Couchbase doesn’t need to specify a table name, which is required causing --table to get overloaded as backfill or dump operation o e.g. --null-string representation not supported by all connectors • Functionality limited to what the implicitly chosen connector supportsWednesday, May 23, 12
  18. 18. Sqoop 2: Using Connectors • User makes explicit connector choice o Less error-prone, more predictable • User need not be aware of the functionality of all connectors o Couchbase users need not care that other connectors use tables • Common functionality available to all connectors o Connectors need not worry about downstream functionality, transformations, integration with other systemsWednesday, May 23, 12
  19. 19. Ease of Extension (summary) Sqoop 1 Sqoop 2 Connector forced to follow JDBC model Connector given free rein Connectors must implement functionality Connectors benefit from common framework of functionality Connector selection is implicit Connector selection is explicitWednesday, May 23, 12
  20. 20. Agenda • Ease of Use o Sqoop 1: Client-side Tool o Sqoop 2: Sqoop as a Service o Client Interface o Sqoop 1: Service Level Integration o Sqoop 2: Service Level Integration • Ease of Extension o Sqoop 1: Implementing Connectors o Sqoop 2: Implementing Connectors o Sqoop 1: Using Connectors o Sqoop 2: Using Connectors • Security o Sqoop 1: Security o Sqoop 2: Security o Sqoop 1: Accessing External Systems o Sqoop 2: Accessing External Systems o Sqoop 1: Resource Management o Sqoop 2: Resource ManagementWednesday, May 23, 12
  21. 21. Sqoop 1: Security • Inherits/propagates Kerberos principal for the jobs it launches • Access to files on HDFS can be controlled via HDFS security • Sqoop operates as command line Hadoop client • No support for securing access to external systems o E.g. relational databaseWednesday, May 23, 12
  22. 22. Sqoop 2: Security • Inherits/propagates Kerberos principal for the jobs it launches • Access to files on HDFS can be controlled via HDFS security • Sqoop operates as server based application • Support for securing access to external systems via role-based access to Connection objects o Admins create/edit/delete Connections o Operators use Connections • Audit trail loggingWednesday, May 23, 12
  23. 23. Sqoop 1: Accessing External Systems • Every invocation requires necessary credentials to access external systems (e.g. relational database) o Workaround: Admin creates a limited access user in lieu of giving out password  Doesnt scale  Permission granularity is hard to obtain • Hard to prevent misuse once credentials are givenWednesday, May 23, 12
  24. 24. Sqoop 2: Accessing External Systems • Sqoop 2 introduces Connections as First-Class Objects o Connection encompass credentials o Connections created once, then used many times for various import/export Jobs o Connections created by Admin, used by Operator  Safeguard credential access from end user • Restrict scope: connections can be restricted based on operation (import/export) o Operators cannot abuse credentialsWednesday, May 23, 12
  25. 25. Sqoop 1: Resource Management • No explicit resource management policy o User specifies number of map jobs to run o Can’t throttle load on external systems o Can overwhelmed a cluster o Can acquire the whole stack memory on complex queriesWednesday, May 23, 12
  26. 26. Sqoop 2: Resource Management • Connections allow specification of resource policy o Admin can limit the total number of physical Connections open at one time o Connections can be disabledWednesday, May 23, 12
  27. 27. Security (summary) Sqoop 1 Sqoop 2 Support only for Hadoop security Support for Hadoop security and role-based access control to external systems High risk of abusing access to external Reduced risk of abusing access to external systems systems No resource management policy Resource management policyWednesday, May 23, 12
  28. 28. Takeaway Sqoop 2 Highights: • Ease of Use: Sqoop as a Service • Ease of Extension: Connectors benefit from shared functionality • Security: Connections as First-Class objects, Role- based SecurityWednesday, May 23, 12
  29. 29. Current Status: work-in-progress • Sqoop 2 Development: https:// issues.apache.org/jira/browse/SQOOP-365 • Sqoop 2 Design: https://cwiki.apache.org/ confluence/display/SQOOP/Sqoop+2Wednesday, May 23, 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×