Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Apache SqoopConnecting Couchbase with Hadoop Arvind Prabhakar, Cloudera Inc. July 29, 2011
  2. 2. Agenda• Background and Motivation• Design of Sqoop• Couchbase Plugin• Demo
  3. 3. Apache Hadoop Data HDFS Node 1• A framework for Data Name Node Data Node 2 Intensive and Distributed Data Node 3 Applications. Task Tracker 1• Inspired by Google’s Map Reduce Task MapReduce and Job Tracker Tracker 2 Google File System Task Papers Tracker 3
  4. 4. Data Storage Data storage is costly. Deleting data maybe costlier! Hadoop: • Data Archival • Open Data Formats • Healthy Ecosystem
  5. 5. Data Analysis • Structured Data Stores • Semi-Structured Data Stores • Ad-hoc structured Data • Unstructured Data
  6. 6. Introducing Sqoop• Easily Import Data into Hadoop• Generate Datatypes for use in MapReduce Applications• Integrate with Hive and Hbase• Easily export Data from Hadoop Sqoop
  7. 7. MotivationWithout Sqoop Using Sqoop• Requires direct access to • Data Locality data from within Hadoop • Efficient operation for• Loss of efficiency due to • Integration with Hadoop network overhead based systems – Hive,• Impedance mismatch. Map HBase Reduce requires fast data • Optimized transfer speeds access. based on native tools• Can overwhelm external systems
  8. 8. Key Features• Command Line Interface – Scriptable• Integrates with Hadoop Ecosystem – Hive, HBase, Oozie• Automatic code generation – Use your data in MapReduce work flows• Connector based architecture – Support for connector specific optimizations
  9. 9. Design Overview 1. Metadata Lookup Sqoop Datastore2. Generate 3. Submit Code MR Job Sqoop Record Map Map Map Map HDFS HDFS HDFS HDFS MapReduce Job
  10. 10. Design OverviewMap-Only Implementation InputFormat• InputFormat: – Selects Input Source RecordReader Split Split RecordReader – Defines Splits Split RecordReader – Creates Record Readers• OutputFormat: Map Map … Map – Selects Destination – Creates Record Writers OutputFormat
  11. 11. Metadata Management• Sqoop Record – Dynamically generated – Independently packaged • Maybe used without Sqoop – Maintains type mapping – Different Serial Formats • Text • Binary • Avro Data File
  12. 12. Import Operation• Generate SqoopRecord – Or use provided SqoopRecord• Create Input Splits• Spin Mappers to consume splits• Direct output to HDFS or HBase – Control compression, File type based on user input• Populate Hive Metastore
  13. 13. Export Operation• Generate SqoopRecord – Or use provided SqoopRecord• Spin Mappers to consume input files• Each Mapper writes straight to external store – Optionally stage data before final export
  14. 14. Typical Workflow• Data imported from external systems – Periodic / Incremental imports for new data• Hadoop Analytics Processing – Hive / HBase tables – MapReduce Processing• Processed Data exported to external systems – Periodic / Incremental exports for new data• Workflow automation using Oozie
  15. 15. Connectors• Drop-in Sqoop Extension• Specializes in connectivity with a particular system• Provides optimal data transfer mechanism• Based on Connector Mechanism of Sqoop – Varying degree of control
  16. 16. Couchbase Plugin• Based on the Couchbase Tap Interface• Allows importing and exporting of entire database or of future key mutations 1. Data imported via Tap mechanism 2. Hadoop Couchbase HDFS Processing 3. Data exported back to Couchbase
  17. 17. Couchbase Import$ sqoop import –-connect http://localhost:8091/pools --table DUMP$ sqoop import –-connect http://localhost:8091/pools --table BACKFILL_5$ sqoop export --connect http://localhost:8091/pools --table DUMP –export-dir DUMP• For Imports, table must be: – DUMP: All keys currently in Couchbase – BACKFILL_n: All key mutations for n minutes• For Exports, table option is ignored• Specified –username maps to bucket – By default set to “default” bucket
  18. 18. Demo
  19. 19. Thank You!• Couchbase: –• Hadoop: –• Sqoop: –• Cloudera: –