Couchconf-SF-Couchbase-Hadoop-Integration
 

Like this? Share it with your network

Share

Couchconf-SF-Couchbase-Hadoop-Integration

on

  • 2,646 views

 

Statistics

Views

Total Views
2,646
Views on SlideShare
2,642
Embed Views
4

Actions

Likes
0
Downloads
38
Comments
0

2 Embeds 4

https://twitter.com 3
http://dashboard.mindtickle.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Couchconf-SF-Couchbase-Hadoop-Integration Presentation Transcript

  • 1. Apache SqoopConnecting Couchbase with Hadoop Arvind Prabhakar, Cloudera Inc. July 29, 2011
  • 2. Agenda• Background and Motivation• Design of Sqoop• Couchbase Plugin• Demo
  • 3. Apache Hadoop Data HDFS Node 1• A framework for Data Name Node Data Node 2 Intensive and Distributed Data Node 3 Applications. Task Tracker 1• Inspired by Google’s Map Reduce Task MapReduce and Job Tracker Tracker 2 Google File System Task Papers Tracker 3
  • 4. Data Storage Data storage is costly. Deleting data maybe costlier! Hadoop: • Data Archival • Open Data Formats • Healthy Ecosystem
  • 5. Data Analysis • Structured Data Stores • Semi-Structured Data Stores • Ad-hoc structured Data • Unstructured Data
  • 6. Introducing Sqoop• Easily Import Data into Hadoop• Generate Datatypes for use in MapReduce Applications• Integrate with Hive and Hbase• Easily export Data from Hadoop Sqoop
  • 7. MotivationWithout Sqoop Using Sqoop• Requires direct access to • Data Locality data from within Hadoop • Efficient operation for• Loss of efficiency due to • Integration with Hadoop network overhead based systems – Hive,• Impedance mismatch. Map HBase Reduce requires fast data • Optimized transfer speeds access. based on native tools• Can overwhelm external systems
  • 8. Key Features• Command Line Interface – Scriptable• Integrates with Hadoop Ecosystem – Hive, HBase, Oozie• Automatic code generation – Use your data in MapReduce work flows• Connector based architecture – Support for connector specific optimizations
  • 9. Design Overview 1. Metadata Lookup Sqoop Datastore2. Generate 3. Submit Code MR Job Sqoop Record Map Map Map Map HDFS HDFS HDFS HDFS MapReduce Job
  • 10. Design OverviewMap-Only Implementation InputFormat• InputFormat: – Selects Input Source RecordReader Split Split RecordReader – Defines Splits Split RecordReader – Creates Record Readers• OutputFormat: Map Map … Map – Selects Destination – Creates Record Writers OutputFormat
  • 11. Metadata Management• Sqoop Record – Dynamically generated – Independently packaged • Maybe used without Sqoop – Maintains type mapping – Different Serial Formats • Text • Binary • Avro Data File
  • 12. Import Operation• Generate SqoopRecord – Or use provided SqoopRecord• Create Input Splits• Spin Mappers to consume splits• Direct output to HDFS or HBase – Control compression, File type based on user input• Populate Hive Metastore
  • 13. Export Operation• Generate SqoopRecord – Or use provided SqoopRecord• Spin Mappers to consume input files• Each Mapper writes straight to external store – Optionally stage data before final export
  • 14. Typical Workflow• Data imported from external systems – Periodic / Incremental imports for new data• Hadoop Analytics Processing – Hive / HBase tables – MapReduce Processing• Processed Data exported to external systems – Periodic / Incremental exports for new data• Workflow automation using Oozie
  • 15. Connectors• Drop-in Sqoop Extension• Specializes in connectivity with a particular system• Provides optimal data transfer mechanism• Based on Connector Mechanism of Sqoop – Varying degree of control
  • 16. Couchbase Plugin• Based on the Couchbase Tap Interface• Allows importing and exporting of entire database or of future key mutations 1. Data imported via Tap mechanism 2. Hadoop Couchbase HDFS Processing 3. Data exported back to Couchbase
  • 17. Couchbase Import$ sqoop import –-connect http://localhost:8091/pools --table DUMP$ sqoop import –-connect http://localhost:8091/pools --table BACKFILL_5$ sqoop export --connect http://localhost:8091/pools --table DUMP –export-dir DUMP• For Imports, table must be: – DUMP: All keys currently in Couchbase – BACKFILL_n: All key mutations for n minutes• For Exports, table option is ignored• Specified –username maps to bucket – By default set to “default” bucket
  • 18. Demo
  • 19. Thank You!• Couchbase: – www.couchbase.com• Hadoop: – hadoop.apache.org• Sqoop: – incubator.apache.org/projects/sqoop.html• Cloudera: – www.cloudera.com