• Save
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison, Quest Software
 

Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison, Quest Software

on

  • 3,081 views

As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We’ll ...

As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.

Statistics

Views

Total Views
3,081
Views on SlideShare
3,025
Embed Views
56

Actions

Likes
11
Downloads
0
Comments
0

1 Embed 56

http://www.scoop.it 56

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Insanely popular – literally millions of users

Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison, Quest Software Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison, Quest Software Presentation Transcript

  • Integrating Hadoop with EnterpriseRDBMS Using Apache SQOOP andOther ToolsGuy Harrison, Quest Software ©2011 Quest Software, Inc. All rights reserved..
  • Introductions 1 ©2011 Quest Software, Inc. All rights reserved..
  • 2©2011 Quest Software, Inc. All rights reserved..
  • 3
  • Agenda• Scenarios for RDBMS-Hadoop interaction• Case study: Quest extension to SQOOP• Other RDBMS-Hadoop integrations 4 ©2011 Quest Software, Inc. All rights reserved..
  • Hadoop meets RDBMS – scenarios 5 ©2011 Quest Software, Inc. All rights reserved..
  • Scenario #1: Reference data in RDBMS PRODUCTS CUSTOMERS HDFS WEBlOGS RDBMS
  • Scenario #2: Hadoop for off-line analytics PRODUCTS CUSTOMERS HDFS SALES HISTORY RDBMS
  • Scenario #3: MapReduce output to RDBMS DB QUERY TOOL WEBLOGS SUMMARY HDFS WEBlOGS RDBMS
  • Scenario #4: Hadoop as RDBMS “active archive” QUERY TOOL SALES 2011 SALES 2010 SALES 2009 SALES 2009 SALES 2008 SALES 2008 HDFS RDBMS
  • Case Study: extending SQOOP for Oracle 10 ©2011 Quest Software, Inc. All rights reserved..
  • SQOOP extensibility• SQOOP implements a generic approach to RDBMS/Hadoop data transfer• But database optimization is highly platform specific • Each RDBMS has distinct optimizations strategies• For Oracle, optimization requires: • Bypassing Oracle caching layers • Avoiding Oracle optimizer meddling • Exploiting Oracle metadata to balance mapper load 11 ©2011 Quest Software, Inc. All rights reserved..
  • Reading from Oracle – default SQOOP ID > 0 and ID < MAX/2 ID > MAX/2 MAPPER MAPPER CACHE ORACLE SESSSION ORACLE SESSION RANGE SCAN RANGE SCAN Index block Index block Index block Index block Index block Index block ORACLE TABLE
  • Oracle – parallelism gone bad (1) Hadoop Mapper Hadoop Mapper Oracle HDFS SALES table Hadoop Mapper Hadoop Mapper
  • Oracle – parallelism gone bad (2) HADOOP MAPPER HADOOP MAPPER HDFS ORACLE TABLE HADOOP MAPPER HADOOP MAPPER
  • Ideal architecture HADOOP ORACLE MAPPER SESSION HADOOP ORACLE MAPPER SESSION HDFS ORACLE TABLE HADOOP ORACLE MAPPER SESSION HADOOP ORACLE MAPPER SESSION
  • Design goals• Partition data based on physical storage• By-pass Oracle buffering• By-pass Oracle parallelism• Do not require or use indexes• Never read the same data block more than once• Support Oracle datatypes 16 ©2011 Quest Software, Inc. All rights reserved..
  • Import Throughput 7,000 SQOOP 6,000 SQOOP with Quest Connector 5,000Elapsed Time (ms) 4,000 3,000 2,000 1,000 0 0 5 10 15 20 25 30 35 Number of mappers 17 ©2011 Quest Software, Inc. All rights reserved..
  • 16 mappers, 50M rows, 50 GB clustered data IO time 98.71 IO requests 99.08Network round trips 98.95 CPU Time 89.72 Elasped time 80.84 0 20 40 60 80 100 Pct reduction 18 ©2011 Quest Software, Inc. All rights reserved..
  • Export Throughput 3,000 SQOOP SQOOP with Quest Connect 2,500 2,000 Seconds 1,500 1,000 500 0 5 10 15 20 25 No of mappers 19 ©2011 Quest Software, Inc. All rights reserved..
  • Export load 30000 SQOOP SQOOP with Quest connect 25000 20000 Database time (s) 15000 10000 5000 0 0 5 10 15 20 25 30 No of mappers 20 ©2011 Quest Software, Inc. All rights reserved..
  • Working with the SQOOP framework• SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing: • Extend ManagerFactory (what to handle) • Extend ConnManager (DB connection and metadata) • For imports: • Extend DataDrivenDBInputFormat (gets the data) • Data allocation (getSplits()) • Split serialization (“io.serializations” property) • Data access logic (createDBRecordReader(), getSelectQuery()) • Implement progress (nextKeyValue(), getProgress()) • Similar procedure for extending exports 21 ©2011 Quest Software, Inc. All rights reserved..
  • Extensions to native SQOOP• MERGE functionality • Update if exists, insert otherwise• Hive connector • Source defined as HQL query rather than HDFS directory• Eclipse UI 22 ©2011 Quest Software, Inc. All rights reserved..
  • Availability• Apache licensed source available from : https://github.com/QuestSoftwareTCD/OracleSQOOPconnector• Download from (Quest): http://www.quest.com/hadoop/• Download from (Cloudera): http://ccp.cloudera.com/display/SUPPORT/Downloads 23 ©2011 Quest Software, Inc. All rights reserved..
  • Other SQOOP connectors• Microsoft SQL Server: • http://www.microsoft.com/download/en/details.aspx?id=27584• Teradata: • https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide %2C+version+1.0-beta-u4• Microstrategy: • https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreem ent• Nettezza: • https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement• VoltDB: • http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration 24 ©2011 Quest Software, Inc. All rights reserved..
  • Other Hadoop – RDBMS integrations 25 ©2011 Quest Software, Inc. All rights reserved..
  • Oracle Big Data Appliance• 18 Sun X4270 M2 servers • 48GB per node (864GB total) • 2x6 Core CPU per node (216 total) • 12x2TB HDD per node (216 spindles, 864 TB) • 40Gb/s Infiniband between nodes • 10Gb/s Ethernet to datacenter• Apache Hadoop• Oracle NoSQL• Oracle loader for Hadoop • Multi-stage C-optimized unidirectional loader www.oracle.com/us/bigdata/index.html 26 ©2011 Quest Software, Inc. All rights reserved..
  • ORACLE ORACLE ORACLE BIG DATA EXALOGIC EXALYTICSAPPLIANCE ORACLE WEBLOGIC ORACLE ORACLE NOSQL ESSBASE ORACLE ORACLE EXADATA LOADER FOR HADOOP APACHE ORACLE HADOOP ORACLE RDBMS TIMES TEN
  • Microsoft 28 ©2011 Quest Software, Inc. All rights reserved..
  • Hadapt • Formally HadoopDB – Hadoop/Postgres hybrid • Postgres servers on data nodes allow for accelerated (indexed) HIVE queries • Extensions to the Hive optimizerhttp://www.hadapt.com/ 29 ©2011 Quest Software, Inc. All rights reserved..
  • Greenplum• SQL based access to HDFS data via in-DB MapReduce http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf 30 ©2011 Quest Software, Inc. All rights reserved..
  • Toad for Cloud Databases• Federated SQL queries across Hive, Hbase, NoSQL, RDBMS 31 ©2011 Quest Software, Inc. All rights reserved..
  • Conclusions• RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption• SQOOP provides a good general purpose framework for transferring data between any JDBC database and Hadoop • We’d like to see it become a standard• Each RDBMS offers distinct tuning opportunities, so optimized SQOOP extensions offer real value• Hadoop-RDBMS integration projects are proliferating rapidly 32 ©2011 Quest Software, Inc. All rights reserved..
  • 33©2011 Quest Software, Inc. All rights reserved..