May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
Upcoming SlideShare
Loading in...5
×
 

May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

on

  • 1,976 views

Apache Sqoop 2 is the next generation of the massively successful open source tool designed to transfer data between traditional SQL databases and warehouses into Apache Hadoop. Sqoop 2 is designed as ...

Apache Sqoop 2 is the next generation of the massively successful open source tool designed to transfer data between traditional SQL databases and warehouses into Apache Hadoop. Sqoop 2 is designed as a client-server system with a repository which stores connection and job information. Sqoop 2 is designed to support secure job submission and multiple different roles for users. In this talk, we will discuss the issues users faced in Sqoop 1, and the design of Sqoop 2 and how the issues faced in Sqoop 1 are being handled in Sqoop 2.

Presenter(s): Hari Shreedharan, Software Engineer, Cloudera

Statistics

Views

Total Views
1,976
Views on SlideShare
1,976
Embed Views
0

Actions

Likes
2
Downloads
69
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Acommon problem that new users of Hadoop face is how to get their data into Hadoop. Using a tool called distcp, Hadoop users can efficiently copy data between clusters. This begs the question: how do I get all my data into Hadoop in the first place? One such ingest tool is Sqoop, which was created to efficiently transfer bulk data between Hadoop and external structured datastores because databases are not easily accessible by Hadoop. The canonical use case is performing a nightly dump of all the data in a transactional relational database into Hadoop for offline analysis. The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, there are further data integration use-cases that Sqoop needs to address as well as become easier to manage and operate.
  • Bulk data transfer toolImport/Export from/to relational databases, enterprise data warehouses, and NoSQL systemsPopulate tables in HDFS, Hive, and Hbase
  • * You might ask why Sqoop? What is the additional value?* Why not simply mysqldump
  • Command line toolClient applicationConfigured via command line arguments (60+!)+ Connector specific extra arguments+ Hadoop -D argumentsPluggable connectors/db specificUsing local Hadoop binaries and configurationWorkflow: command input from client, connect and fetch metadata from db, submit MR job, code generation
  • * Strength of Sqoop is one tool for all db systems* Heterogenous environments* Scope of connector is not limited to metadata operations* Going back to my previous example, Sqoop to run mysqldump for you* One drawback of this approach, too much power, not using wisely
  • Driver option is non-intuitive: specifying it means you want the generic connector instead of the specialized built-in connectorSecurity concerns: Openly shared cred, All clients needs to know credentials to DBs, Database access can't be centrally throttledSome connectors may support a certain data format that others don't (e.g. direct MySQL connector can't support sequence files) Sqoop users should not need to concern themselves with Sqoop admin responsibilitiesCouchbase does not need to specify a table name, which is required, causing --table to get overloaded as backfill or dump operation
  • * Sqoop is a great tool that is doing the job very well* Challenges that both user and developer needs to overcome were opened on previous slide* Too much power means too much responsibility
  • In Sqoop 2's client server architecture, the client will interact with the server and no longer directly interact with Hadoop nor the database. You'll install and configure connectors and drivers on the server, which will interface with the metastore repository and do resource throttling. Serialization, format conversion, and Hive/HBase integration should be uniformly available via the Sqoop framework.Sqoop 2 will be a service and as such you can install once and then run everywhere. This means that connectors will be configured in one place, managed by the Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database connectivity will only be needed on the server.With Sqoop 2, Hive and HBase integration happens not from the client but on the server. In fact, Hive does not need to be installed on Sqoop at all - what Sqoop will do is submit requests to the HiveServer over the wire. Exposing a REST API for operation and management will help Sqoop integrate better with external systems such as Oozie.Oozie and Sqoop will be decoupled: if you install a new Sqoop connector then don’t need to install it in Oozie also.Hive will not invoke anything in Sqoop, while Oozie does invoke Sqoop so the REST API does not benefit Hive in any way but it does benefit Oozie. Which Hive/HBase server the data will be put into is the responsibility of the reduce phase which will have its own configuration and since both these systems have are on Hadoop - we don't need any added security besides passing down the Kerberos principal.
  • * Cost of improving existing Sqoop 1 would be too high, without clear path to address some challenges
  • What happens when you submit a Sqoop job – you create a connection and submit a jobWith the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more predictable. Users need not be aware of the functionality of all connectors e.g. Couchbase users need not care that other connectors use tablesConnectors are no longer forced to follow the JDBC model and are no longer required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable.Connections are enabled as first-class objects,encompass credentials,are created once and then used many times for various import/export jobsMetadata is divided into two groups: Connections and Jobs.Sqoop 2 saves the metadata in the server, which houses a metadata repository – so you don’t have to keep typing it nor be responsible for securing it.
  • * So far we've talked a lot about high level structures, so let's take a look on the workflow of creating the job and how it differs from Sqoop1* Let's discuss how it will work from users perspective
  • No code generation, no compilation allows Sqoop to run where there are no compilers, which makes it more secure by preventing bad code from runningPreviously required direct access to Hive/HBaseMore secure because routed through Sqoop server rather than opening up access to all clients to perform jobsConnections are created by administrator and used by operator (safeguard credential access from end users),can be restricted in scope based on operation (import/export) - operators cannot abuse credentials. Operators are allowed to create Jobs to import/export specific tables.It prevents misuse and abuse.
  • Register for HBaseCon 2013If you have questions, Jarcec will have office hours in the Cloudera booth today at 5pm and Kate will have office hours tomorrow at 11am.

May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools Presentation Transcript

  • 1Apache Sqoop 2: The nextgeneration of data transferHari ShreedharanSoftware Engineer, ClouderaContributor, Apache SqoopCommitter/PMC Member, Apache Flume
  • 2What is Sqoop?• Apache Top-Level Project• SQl to hadOOP• Tool to transfer data from relational databases• Teradata, MySQL, PostgreSQL, Oracle, Netezza• To Hadoop ecosystem• HDFS (text, sequence file), Hive, HBase, Avro• And vice versa
  • Why Sqoop?• Efficient/Controlled resource utilization• Concurrent connections, Time of operation• Datatype mapping and conversion• Automatic, and User override• Metadata propagation• Sqoop Record• Hive Metastore• Avro3
  • Sqoop 14
  • Sqoop 1• Based on Connectors• Responsible for Metadata lookups, and Data Transfer• Majority of connectors are JDBC based• Non-JDBC (direct) connectors for optimized data transfer• Connectors responsible for all supported functionality• HBase Import, Avro Support, ...5
  • 6Sqoop 1 Challenges• Cryptic, contextual command line arguments• Security concerns• Type mapping is not clearly defined• Client needs access to Hadoop binaries/configurationand database• JDBC model is enforced
  • Sqoop 1 Challenges• Non-uniform functionality• Different connectors support different capabilities• Overlapped/Duplicated functionality• Different connectors may implement same capabilitiesdifferently• High coupling with Hadoop• Database vendors required to understand Hadoopidiosyncrasies in order to build connectors.7
  • Sqoop 28
  • Sqoop 2 – Design Goals• Security and Separation of Concerns• Role based access and use• Ease of extension• No low-level Hadoop knowledge needed• No functional overlap between Connectors• Ease of Use• Uniform functionality• Domain specific interactions9
  • 10Sqoop 2: Connection vs Job metadataThere are two distinct sets of options to pass into Sqoop:Job (distinct per table)Connection (distinct per database)
  • Sqoop 2: Workings• Connectors register metadata• Metadata enables creation of Connections and Jobs• Connections and Jobs stored in Metadata Repository• Operator runs Jobs that use appropriate connections• Admins set policy for connection use11
  • 12Sqoop 2: Security• Support for secure access to external systems via role-based access to connection objects• Administrators create/edit/delete connections• Operators use connections
  • 13Current Status: Sqoop 2• Primary focus of the Sqoop Community• Current Release: 1.99.2• bits and docs: http://sqoop.apache.org/
  • 14