• Save
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
 

Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2

on

  • 2,068 views

Apache Sqoop (incubating) was created to efficiently transfer big data between Hadoop related systems (such as HDFS, Hive, and HBase) and structured data stores (such as relational databases, data ...

Apache Sqoop (incubating) was created to efficiently transfer big data between Hadoop related systems (such as HDFS, Hive, and HBase) and structured data stores (such as relational databases, data warehouses, and NoSQL systems). The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. In the meantime, we have encountered many new challenges that have outgrown the abilities of the current infrastructure. To fulfill more data integration use cases as well as become easier to manage and operate, a new generation of Sqoop, also known as Sqoop 2, is currently undergoing development to address several key areas, including ease of use, ease of extension, and security. This session will talk about Sqoop 2 from both the development and operations perspectives.

Statistics

Views

Total Views
2,068
Views on SlideShare
1,950
Embed Views
118

Actions

Likes
4
Downloads
0
Comments
0

2 Embeds 118

http://www.cloudera.com 115
http://blog.cloudera.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Apache Sqoop was created to efficiently transfer bulk data between Hadoop and external structured datastores because databases are not easily accessible by Hadoop. The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, to enhance its functionality, Sqoop needs to fulfill data integration use-cases as well as become easier to manage and operate.
  • It’s a Apache TLP now
  • Different connectors interpret these options differently. Some options are not understood for the same operation by different connectors, while some connectors have custom options that do not apply to others. Confusing for users and detrimental to effective use.Some connectors may support a certain data format while others don’tRequired to use common JDBC vocabulary (URL, database, table, etc.)Cryptic and contextual command line arguments can lead to error-prone connector matching, resulting in user errorsDue to tight coupling between data transfer and the serialization format, some connectors may support a certain data format that others don't (e.g. direct MySQL connector can't support sequence files)There are security concerns with openly shared credentialsBy requiring root privileges, local configuration and installation are not easy to manageDebugging the map job is limited to turning on the verbose flagConnectors are forced to follow the JDBC model and are required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable
  • Sqoop 1’s challenges will be addressed by Sqoop 2 Architecture Connector only focuses on connectivitySerialization, format conversion, Hive/HBase integration should be uniformly available via frameworkRepository Manager: creates/edits connectors, connections, jobsConnector Manager: register new connectors, enables/disables connectorsJob Manager: submit new jobs to MR, gets job progress, kills specific jobs
  • Kathleen to take over.Sqoop 2 is a work in progress and you are welcome to join our weekly conference calls discussing its’ design and implementation. Please see me afterwards for more details if interested. During our discussions we’ve identified three pain points for Sqoop 2 to address. Those of you who have used Sqoop will find those points apparent: ease of use, ease of extension, and security.
  • Pause for Q. Kathleen to take over.Sqoop 2 is a work in progress and you are welcome to join our weekly conference calls discussing its’ design and implementation. Please see me afterwards for more details if interested. During our discussions we’ve identified three pain points for Sqoop 2 to address. Those of you who have used Sqoop will find those points apparent: ease of use, ease of extension, and security.
  • There are 4 points we want to address with Sqoop 2’s ease of use.
  • Like other client-side tools, Sqoop 1 requires everything – connectors, root privileges, drivers, db connectivity – to be installed and configured locally.
  • Sqoop 2 will be a service and as such you can install once and then run everywhere. This means that connectors will be configured in one place, managed by the Admin role and run by the Operator role, which will be discussed in detail later. Likewise, JDBC drivers will be in one place and database connectivity will only be needed on the server.Sqoop as a web-based service, exposes the REST APIFront-ended by CLI and browserBack-ended by a metadata repositoryExample of document based system is couchbase.Sqoop 1 has something called a sqoopmetastore, which is similar to a repository for metadata but not quite. That said, the model of operation for Sqoop 1 and Sqoop 2 is very different: Sqoop 1 was a limited vocabulary tool while Sqoop 2 is more metadata driven. The design of Sqoop 2’s metadata repository is such that it can be replaced by other providers.
  • Sqoop 1 was intended for power users as evidenced by its CLI. Sqoop 2 has two modes: one for the power user and one for the newbie. Those new to Sqoop will appreciate the interactive UI, which walks you through import/export setup, eliminating redundant/incorrect options. Various connectors are added in one place, with connectors exposing necessary options to the Sqoop framework and with the user only required to provide info relevant to their use-case.Not bound by terminal, well documented return codes
  • In Sqoop 1, Hive and HBase require local installation. Currently Oozie launches Sqoop by bundling it and running it on the cluster, which is error-prone and difficult to debug.
  • With Sqoop 2, Hive and HBase integration happens not from client but from the backend. Hive does not need to be installed on Sqoop at all. What Sqoop will do is submit requests to the HiveServer over the wire. Exposing a REST API for operation and management will help Sqoop integrate better with external systems such as Oozie.Oozie and Sqoop will be decoupled: if you install a new Sqoop connector then don’t need to install it in Oozie also.Hive will not invoke anything in Sqoop, while Oozie does invoke Sqoop so the REST API does not benefit Hive in any way but it does benefit OozieWhich Hive/HBase server the data will be put into is the responsibility of the reduce phase which will have its own configuration and since both these systems have are on Hadoop - we don't need any added security besides passing down the Kerberos principal
  • 4 pts
  • Pause for questions.
  • Because Sqoop is heavily JDBC centric, it’s not easy to work with non relational dbCouchbase implementation required different interpretationInconsistencies between connectors
  • Two-phases: first, transfer; second, transform/integration with other componentsOption to opt-out of downstream processing (i.e. revert to Sqoop 1)Trade-off between ease of connector/tooling development vs faster performanceSeparating data transfer (Map) from data transform (Reduce) allows connectors to specializeConnectors benefit from a common framework of functionality – don’t have to worry about forward compabilityFunctionally, Sqoop 2 is a superset of Sqoop 1 but does it in a different way Too early in the design process to tell if the same CLI commands could be used but most likely not primarily because it is a fundamentally incompatible changeReduce phase limited to stream transformations (no aggregation to start with)
  • Former is running MySQL b/c specifying driver option prevents the MySQL connector from working i.e. would end up using generic JDBC connector
  • Based on the URL which is JDBC centric and something Sqoop 2 is moving away from in the connect string used to access the database, Sqoop attempts to predict which driver it should load.What are connectors?Plugin components based on Sqoop’s extension frameworkEfficiently transfer data between Hadoop and external storeMeant for optimized import/export or don’t support native JDBCBundled connectors: MySQL, PostgreSQL, Oracle, SQLServer, JDBCHigh-performance data transfer: Direct MySQL, Direct PostgreSQL
  • Cryptic and contextual command line arguments can lead to error-prone connector matching, resulting in user errors. Due to tight coupling between data transfer and the serialization format, some connectors may support a certain data format that others don't (e.g. direct MySQL connector can't support sequence files).
  • With the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more predictable. Connectors are no longer forced to follow the JDBC model and are no longer required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable.
  • Common functionality will be abstracted out of connectors, holding them responsible only for data transport. The reduce phase will implement common functionality, ensuring that connectors benefit from future development of functionality.
  • Pause for questions. Bilung to take over.
  • No code generation, no compilation allows Sqoop to run where there are no compilers, which makes it more secure by preventing bad code from runningPreviously required direct access to Hive/HBaseMore secure because routed through Sqoop server rather than opening up access to all clients to perform jobs
  • Connection is only for external systems
  • No need to disable user in database

Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2 Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2 Presentation Transcript

  • A New Generation of Data Transfer Tools for Hadoop: Sqoop 2 Bilung Lee (blee at cloudera dot com) Kathleen Ting (kathleen at cloudera dot com) Hadoop Summit 2012. 6/13/12 Apache Sqoop Copyright 2012 The Apache Software Foundation
  • Who Are We?• Bilung Lee – Apache Sqoop Committer – Software Engineer, Cloudera• Kathleen Ting – Apache Sqoop Committer – Support Manager, Cloudera Hadoop Summit 2012. 6/13/12 Apache Sqoop 2 Copyright 2012 The Apache Software Foundation
  • What is Sqoop?• Bulk data transfer tool – Import/Export from/to relational databases, enterprise data warehouses, and NoSQL systems – Populate tables in HDFS, Hive, and HBase – Integrate with Oozie as an action – Support plugins via connector based architecture May ‘09 March ‘10 August ‘11 April ‘12 First version Moved to Moved to Apache(HADOOP-5815) GitHub Apache Top Level Project Hadoop Summit 2012. 6/13/12 Apache Sqoop 3 Copyright 2012 The Apache Software Foundation
  • Sqoop 1 Architecture Document Enterprise Based Data Systems Warehouse Relational Databasecommand Hadoop Map Task Sqoop HDFS/HBase/ Hive Hadoop Summit 2012. 6/13/12 Apache Sqoop 4 Copyright 2012 The Apache Software Foundation
  • Sqoop 1 Challenges• Cryptic, contextual command line arguments• Tight coupling between data transfer and output format• Security concerns with openly shared credentials• Not easy to manage installation/configuration• Connectors are forced to follow JDBC model Hadoop Summit 2012. 6/13/12 Apache Sqoop 5 Copyright 2012 The Apache Software Foundation
  • Sqoop 2 Architecture Hadoop Summit 2012. 6/13/12 Apache Sqoop 6 Copyright 2012 The Apache Software Foundation
  • Sqoop 2 Themes• Ease of Use• Ease of Extension• Security Hadoop Summit 2012. 6/13/12 Apache Sqoop 7 Copyright 2012 The Apache Software Foundation
  • Sqoop 2 Themes• Ease of Use• Ease of Extension• Security Hadoop Summit 2012. 6/13/12 Apache Sqoop 8 Copyright 2012 The Apache Software Foundation
  • Ease of UseSqoop 1 Sqoop 2Client-only Architecture Client/Server ArchitectureCLI based CLI + Web basedClient access to Hive, HBase Server access to Hive, HBaseOozie and Sqoop tightly coupled Oozie finds REST API Hadoop Summit 2012. 6/13/12 Apache Sqoop 9 Copyright 2012 The Apache Software Foundation
  • Sqoop 1: Client-side Tool• Client-side installation + configuration – Connectors are installed/configured locally – Local requires root privileges – JDBC drivers are needed locally – Database connectivity is needed locally Hadoop Summit 2012. 6/13/12 Apache Sqoop 10 Copyright 2012 The Apache Software Foundation
  • Sqoop 2: Sqoop as a Service• Server-side installation + configuration – Connectors are installed/configured in one place – Managed by administrator and run by operator – JDBC drivers are needed in one place – Database connectivity is needed on the server Hadoop Summit 2012. 6/13/12 Apache Sqoop 11 Copyright 2012 The Apache Software Foundation
  • Client Interface• Sqoop 1 client interface: – Command line interface (CLI) based – Can be automated via scripting• Sqoop 2 client interface: – CLI based (in either interactive or script mode) – Web based (remotely accessible) – REST API is exposed for external tool integration Hadoop Summit 2012. 6/13/12 Apache Sqoop 12 Copyright 2012 The Apache Software Foundation
  • Sqoop 1: Service Level Integration• Hive, HBase – Require local installation• Oozie – von Neumann(esque) integration: • Package Sqoop as an action • Then run Sqoop from node machines, causing one MR job to be dependent on another MR job • Error-prone, difficult to debug Hadoop Summit 2012. 6/13/12 Apache Sqoop 13 Copyright 2012 The Apache Software Foundation
  • Sqoop 2: Service Level Integration• Hive, HBase – Server-side integration• Oozie – REST API integration Hadoop Summit 2012. 6/13/12 Apache Sqoop 14 Copyright 2012 The Apache Software Foundation
  • Ease of UseSqoop 1 Sqoop 2Client-only Architecture Client/Server ArchitectureCLI based CLI + Web basedClient access to Hive, HBase Server access to Hive, HBaseOozie and Sqoop tightly coupled Oozie finds REST API Hadoop Summit 2012. 6/13/12 Apache Sqoop 15 Copyright 2012 The Apache Software Foundation
  • Sqoop 2 Themes• Ease of Use• Ease of Extension• Security Hadoop Summit 2012. 6/13/12 Apache Sqoop 16 Copyright 2012 The Apache Software Foundation
  • Ease of ExtensionSqoop 1 Sqoop 2Connector forced to follow JDBC model Connector given free reinConnectors must implement functionality Connectors benefit from common framework of functionalityConnector selection is implicit Connector selection is explicit Hadoop Summit 2012. 6/13/12 Apache Sqoop 17 Copyright 2012 The Apache Software Foundation
  • Sqoop 1: Implementing Connectors• Connectors are forced to follow JDBC model – Connectors are limited/required to use common JDBC vocabulary (URL, database, table, etc)• Connectors must implement all Sqoop functionality they want to support – New functionality may not be available for previously implemented connectors Hadoop Summit 2012. 6/13/12 Apache Sqoop 18 Copyright 2012 The Apache Software Foundation
  • Sqoop 2: Implementing Connectors• Connectors are not restricted to JDBC model – Connectors can define own domain• Common functionality are abstracted out of connectors – Connectors are only responsible for data transfer – Common Reduce phase implements data transformation and system integration – Connectors can benefit from future development of common functionality Hadoop Summit 2012. 6/13/12 Apache Sqoop 19 Copyright 2012 The Apache Software Foundation
  • Different Options, Different ResultsWhich is running MySQL?$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST$ sqoop import --connect jdbc:mysql://localhost/db --driver com.mysql.jdbc.Driver --username foo --table TEST• Different options may lead to unpredictable results – Sqoop 2 requires explicit selection of a connector, thus disambiguating the process Hadoop Summit 2012. 6/13/12 Apache Sqoop 20 Copyright 2012 The Apache Software Foundation
  • Sqoop 1: Using Connectors• Choice of connector is implicit – In a simple case, based on the URL in --connect string to access the database – Specification of different options can lead to different connector selection – Error-prone but good for power users Hadoop Summit 2012. 6/13/12 Apache Sqoop 21 Copyright 2012 The Apache Software Foundation
  • Sqoop 1: Using Connectors• Require knowledge of database idiosyncrasies – e.g. Couchbase does not need to specify a table name, which is required, causing --table to get overloaded as backfill or dump operation – e.g. --null-string representation is not supported by all connectors• Functionality is limited to what the implicitly chosen connector supports Hadoop Summit 2012. 6/13/12 Apache Sqoop 22 Copyright 2012 The Apache Software Foundation
  • Sqoop 2: Using Connectors• Users make explicit connector choice – Less error-prone, more predictable• Users need not be aware of the functionality of all connectors – Couchbase users need not care that other connectors use tables Hadoop Summit 2012. 6/13/12 Apache Sqoop 23 Copyright 2012 The Apache Software Foundation
  • Sqoop 2: Using Connectors• Common functionality is available to all connectors – Connectors need not worry about common downstream functionality, such as transformation into various formats and integration with other systems Hadoop Summit 2012. 6/13/12 Apache Sqoop 24 Copyright 2012 The Apache Software Foundation
  • Ease of ExtensionSqoop 1 Sqoop 2Connector forced to follow JDBC model Connector given free reinConnectors must implement functionality Connectors benefit from common framework of functionalityConnector selection is implicit Connector selection is explicit Hadoop Summit 2012. 6/13/12 Apache Sqoop 25 Copyright 2012 The Apache Software Foundation
  • Sqoop 2 Themes• Ease of Use• Ease of Extension• Security Hadoop Summit 2012. 6/13/12 Apache Sqoop 26 Copyright 2012 The Apache Software Foundation
  • SecuritySqoop 1 Sqoop 2Support only for Hadoop security Support for Hadoop security and role- based access control to external systemsHigh risk of abusing access to external Reduced risk of abusing access to externalsystems systemsNo resource management policy Resource management policy Hadoop Summit 2012. 6/13/12 Apache Sqoop 27 Copyright 2012 The Apache Software Foundation
  • Sqoop 1: Security• Inherit/Propagate Kerberos principal for the jobs it launches• Access to files on HDFS can be controlled via HDFS security• Limited support (user/password) for secure access to external systems Hadoop Summit 2012. 6/13/12 Apache Sqoop 28 Copyright 2012 The Apache Software Foundation
  • Sqoop 2: Security• Inherit/Propagate Kerberos principal for the jobs it launches• Access to files on HDFS can be controlled via HDFS security• Support for secure access to external systems via role-based access to connection objects – Administrators create/edit/delete connections – Operators use connections Hadoop Summit 2012. 6/13/12 Apache Sqoop 29 Copyright 2012 The Apache Software Foundation
  • Sqoop 1: External System Access• Every invocation requires necessary credentials to access external systems (e.g. relational database) – Workaround: create a user with limited access in lieu of giving out password • Does not scale • Permission granularity is hard to obtain• Hard to prevent misuse once credentials are given Hadoop Summit 2012. 6/13/12 Apache Sqoop 30 Copyright 2012 The Apache Software Foundation
  • Sqoop 2: External System Access• Connections are enabled as first-class objects – Connections encompass credentials – Connections are created once and then used many times for various import/export jobs – Connections are created by administrator and used by operator • Safeguard credential access from end users• Connections can be restricted in scope based on operation (import/export) – Operators cannot abuse credentials Hadoop Summit 2012. 6/13/12 Apache Sqoop 31 Copyright 2012 The Apache Software Foundation
  • Sqoop 1: Resource Management• No explicit resource management policy – Users specify the number of map jobs to run – Cannot throttle load on external systems Hadoop Summit 2012. 6/13/12 Apache Sqoop 32 Copyright 2012 The Apache Software Foundation
  • Sqoop 2: Resource Management• Connections allow specification of resource management policy – Administrators can limit the total number of physical connections open at one time – Connections can also be disabled Hadoop Summit 2012. 6/13/12 Apache Sqoop 33 Copyright 2012 The Apache Software Foundation
  • SecuritySqoop 1 Sqoop 2Support only for Hadoop security Support for Hadoop security and role- based access control to external systemsHigh risk of abusing access to external Reduced risk of abusing access to externalsystems systemsNo resource management policy Resource management policy Hadoop Summit 2012. 6/13/12 Apache Sqoop 34 Copyright 2012 The Apache Software Foundation
  • Demo Screenshots Hadoop Summit 2012. 6/13/12 Apache Sqoop 35 Copyright 2012 The Apache Software Foundation
  • Demo Screenshots Hadoop Summit 2012. 6/13/12 Apache Sqoop 36 Copyright 2012 The Apache Software Foundation
  • Demo Screenshots Hadoop Summit 2012. 6/13/12 Apache Sqoop 37 Copyright 2012 The Apache Software Foundation
  • Demo Screenshots Hadoop Summit 2012. 6/13/12 Apache Sqoop 38 Copyright 2012 The Apache Software Foundation
  • Demo Screenshots Hadoop Summit 2012. 6/13/12 Apache Sqoop 39 Copyright 2012 The Apache Software Foundation
  • TakeawaySqoop 2 Highights: – Ease of Use: Sqoop as a Service – Ease of Extension: Connectors benefit from shared functionality – Security: Connections as first-class objects and role-based security Hadoop Summit 2012. 6/13/12 Apache Sqoop 40 Copyright 2012 The Apache Software Foundation
  • Current Status: work-in-progress• Sqoop2 Development: http://issues.apache.org/jira/browse/SQOOP-365• Sqoop2 Blog Post: http://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop• Sqoop2 Design: http://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2 Hadoop Summit 2012. 6/13/12 Apache Sqoop 41 Copyright 2012 The Apache Software Foundation
  • Current Status: work-in-progress• Sqoop2 Quickstart: http://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Quickstart• Sqoop2 Resource Layout: http://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+-+Resource+Layout• Sqoop2 Feature Requests: http://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Feature+Requests Hadoop Summit 2012. 6/13/12 Apache Sqoop 42 Copyright 2012 The Apache Software Foundation
  • Hadoop Summit 2012. 6/13/12 Apache Sqoop 43Copyright 2012 The Apache Software Foundation