Integrating Hadoop with EnterpriseRDBMS Using Apache SQOOP andOther ToolsGuy Harrison, Quest Software                     ...
Introductions                                                                   1                ©2011 Quest Software, Inc...
2©2011 Quest Software, Inc. All rights reserved..
3
Agenda• Scenarios for RDBMS-Hadoop interaction• Case study: Quest extension to SQOOP• Other RDBMS-Hadoop integrations     ...
Hadoop meets RDBMS – scenarios                                                                             5              ...
Scenario #1: Reference data in RDBMS                                        PRODUCTS                                      ...
Scenario #2: Hadoop for off-line analytics                                              PRODUCTS                          ...
Scenario #3: MapReduce output to RDBMS                                         DB QUERY                                   ...
Scenario #4: Hadoop as RDBMS “active archive”                                         QUERY                               ...
Case Study: extending SQOOP for Oracle                                                                                  10...
SQOOP extensibility• SQOOP implements a generic approach to  RDBMS/Hadoop data transfer• But database optimization is high...
Reading from Oracle – default SQOOP        ID > 0 and ID < MAX/2                                   ID > MAX/2           MA...
Oracle – parallelism gone bad (1)                  Hadoop Mapper                  Hadoop Mapper                           ...
Oracle – parallelism gone bad (2)                   HADOOP                   MAPPER                    HADOOP             ...
Ideal architecture                     HADOOP   ORACLE                     MAPPER   SESSION                     HADOOP   O...
Design goals• Partition data based on physical storage• By-pass Oracle buffering• By-pass Oracle parallelism• Do not requi...
Import Throughput                    7,000                                                             SQOOP              ...
16 mappers, 50M rows, 50 GB clustered data           IO time                                                              ...
Export Throughput           3,000                                                     SQOOP                               ...
Export load                      30000                                  SQOOP                                  SQOOP with ...
Working with the SQOOP framework• SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing: • Extend Manager...
Extensions to native SQOOP• MERGE  functionality • Update if   exists, insert   otherwise• Hive  connector • Source define...
Availability• Apache licensed source available from : https://github.com/QuestSoftwareTCD/OracleSQOOPconnector• Download f...
Other SQOOP connectors• Microsoft SQL Server: • http://www.microsoft.com/download/en/details.aspx?id=27584• Teradata: • ht...
Other Hadoop – RDBMS integrations                                                                               25        ...
Oracle Big Data Appliance• 18 Sun X4270 M2 servers • 48GB per node (864GB total) • 2x6 Core CPU per node (216 total) • 12x...
ORACLE               ORACLE     ORACLE BIG DATA            EXALOGIC   EXALYTICSAPPLIANCE                      ORACLE      ...
Microsoft                                                               28            ©2011 Quest Software, Inc. All right...
Hadapt • Formally HadoopDB – Hadoop/Postgres hybrid • Postgres servers on data nodes allow for accelerated   (indexed) HIV...
Greenplum• SQL based access to HDFS data via in-DB MapReduce                http://www.greenplum.com/sites/default/files/E...
Toad for Cloud Databases• Federated SQL queries across  Hive, Hbase, NoSQL, RDBMS                                         ...
Conclusions• RDBMS-Hadoop interoperability is key to Enterprise  Hadoop adoption• SQOOP provides a good general purpose fr...
33©2011 Quest Software, Inc. All rights reserved..
Upcoming SlideShare
Loading in...5
×

Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison, Quest Software

2,851

Published on

As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.

Published in: Technology, News & Politics
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,851
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide
  • Insanely popular – literally millions of users
  • Transcript of "Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison, Quest Software"

    1. 1. Integrating Hadoop with EnterpriseRDBMS Using Apache SQOOP andOther ToolsGuy Harrison, Quest Software ©2011 Quest Software, Inc. All rights reserved..
    2. 2. Introductions 1 ©2011 Quest Software, Inc. All rights reserved..
    3. 3. 2©2011 Quest Software, Inc. All rights reserved..
    4. 4. 3
    5. 5. Agenda• Scenarios for RDBMS-Hadoop interaction• Case study: Quest extension to SQOOP• Other RDBMS-Hadoop integrations 4 ©2011 Quest Software, Inc. All rights reserved..
    6. 6. Hadoop meets RDBMS – scenarios 5 ©2011 Quest Software, Inc. All rights reserved..
    7. 7. Scenario #1: Reference data in RDBMS PRODUCTS CUSTOMERS HDFS WEBlOGS RDBMS
    8. 8. Scenario #2: Hadoop for off-line analytics PRODUCTS CUSTOMERS HDFS SALES HISTORY RDBMS
    9. 9. Scenario #3: MapReduce output to RDBMS DB QUERY TOOL WEBLOGS SUMMARY HDFS WEBlOGS RDBMS
    10. 10. Scenario #4: Hadoop as RDBMS “active archive” QUERY TOOL SALES 2011 SALES 2010 SALES 2009 SALES 2009 SALES 2008 SALES 2008 HDFS RDBMS
    11. 11. Case Study: extending SQOOP for Oracle 10 ©2011 Quest Software, Inc. All rights reserved..
    12. 12. SQOOP extensibility• SQOOP implements a generic approach to RDBMS/Hadoop data transfer• But database optimization is highly platform specific • Each RDBMS has distinct optimizations strategies• For Oracle, optimization requires: • Bypassing Oracle caching layers • Avoiding Oracle optimizer meddling • Exploiting Oracle metadata to balance mapper load 11 ©2011 Quest Software, Inc. All rights reserved..
    13. 13. Reading from Oracle – default SQOOP ID > 0 and ID < MAX/2 ID > MAX/2 MAPPER MAPPER CACHE ORACLE SESSSION ORACLE SESSION RANGE SCAN RANGE SCAN Index block Index block Index block Index block Index block Index block ORACLE TABLE
    14. 14. Oracle – parallelism gone bad (1) Hadoop Mapper Hadoop Mapper Oracle HDFS SALES table Hadoop Mapper Hadoop Mapper
    15. 15. Oracle – parallelism gone bad (2) HADOOP MAPPER HADOOP MAPPER HDFS ORACLE TABLE HADOOP MAPPER HADOOP MAPPER
    16. 16. Ideal architecture HADOOP ORACLE MAPPER SESSION HADOOP ORACLE MAPPER SESSION HDFS ORACLE TABLE HADOOP ORACLE MAPPER SESSION HADOOP ORACLE MAPPER SESSION
    17. 17. Design goals• Partition data based on physical storage• By-pass Oracle buffering• By-pass Oracle parallelism• Do not require or use indexes• Never read the same data block more than once• Support Oracle datatypes 16 ©2011 Quest Software, Inc. All rights reserved..
    18. 18. Import Throughput 7,000 SQOOP 6,000 SQOOP with Quest Connector 5,000Elapsed Time (ms) 4,000 3,000 2,000 1,000 0 0 5 10 15 20 25 30 35 Number of mappers 17 ©2011 Quest Software, Inc. All rights reserved..
    19. 19. 16 mappers, 50M rows, 50 GB clustered data IO time 98.71 IO requests 99.08Network round trips 98.95 CPU Time 89.72 Elasped time 80.84 0 20 40 60 80 100 Pct reduction 18 ©2011 Quest Software, Inc. All rights reserved..
    20. 20. Export Throughput 3,000 SQOOP SQOOP with Quest Connect 2,500 2,000 Seconds 1,500 1,000 500 0 5 10 15 20 25 No of mappers 19 ©2011 Quest Software, Inc. All rights reserved..
    21. 21. Export load 30000 SQOOP SQOOP with Quest connect 25000 20000 Database time (s) 15000 10000 5000 0 0 5 10 15 20 25 30 No of mappers 20 ©2011 Quest Software, Inc. All rights reserved..
    22. 22. Working with the SQOOP framework• SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing: • Extend ManagerFactory (what to handle) • Extend ConnManager (DB connection and metadata) • For imports: • Extend DataDrivenDBInputFormat (gets the data) • Data allocation (getSplits()) • Split serialization (“io.serializations” property) • Data access logic (createDBRecordReader(), getSelectQuery()) • Implement progress (nextKeyValue(), getProgress()) • Similar procedure for extending exports 21 ©2011 Quest Software, Inc. All rights reserved..
    23. 23. Extensions to native SQOOP• MERGE functionality • Update if exists, insert otherwise• Hive connector • Source defined as HQL query rather than HDFS directory• Eclipse UI 22 ©2011 Quest Software, Inc. All rights reserved..
    24. 24. Availability• Apache licensed source available from : https://github.com/QuestSoftwareTCD/OracleSQOOPconnector• Download from (Quest): http://www.quest.com/hadoop/• Download from (Cloudera): http://ccp.cloudera.com/display/SUPPORT/Downloads 23 ©2011 Quest Software, Inc. All rights reserved..
    25. 25. Other SQOOP connectors• Microsoft SQL Server: • http://www.microsoft.com/download/en/details.aspx?id=27584• Teradata: • https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide %2C+version+1.0-beta-u4• Microstrategy: • https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreem ent• Nettezza: • https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement• VoltDB: • http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration 24 ©2011 Quest Software, Inc. All rights reserved..
    26. 26. Other Hadoop – RDBMS integrations 25 ©2011 Quest Software, Inc. All rights reserved..
    27. 27. Oracle Big Data Appliance• 18 Sun X4270 M2 servers • 48GB per node (864GB total) • 2x6 Core CPU per node (216 total) • 12x2TB HDD per node (216 spindles, 864 TB) • 40Gb/s Infiniband between nodes • 10Gb/s Ethernet to datacenter• Apache Hadoop• Oracle NoSQL• Oracle loader for Hadoop • Multi-stage C-optimized unidirectional loader www.oracle.com/us/bigdata/index.html 26 ©2011 Quest Software, Inc. All rights reserved..
    28. 28. ORACLE ORACLE ORACLE BIG DATA EXALOGIC EXALYTICSAPPLIANCE ORACLE WEBLOGIC ORACLE ORACLE NOSQL ESSBASE ORACLE ORACLE EXADATA LOADER FOR HADOOP APACHE ORACLE HADOOP ORACLE RDBMS TIMES TEN
    29. 29. Microsoft 28 ©2011 Quest Software, Inc. All rights reserved..
    30. 30. Hadapt • Formally HadoopDB – Hadoop/Postgres hybrid • Postgres servers on data nodes allow for accelerated (indexed) HIVE queries • Extensions to the Hive optimizerhttp://www.hadapt.com/ 29 ©2011 Quest Software, Inc. All rights reserved..
    31. 31. Greenplum• SQL based access to HDFS data via in-DB MapReduce http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf 30 ©2011 Quest Software, Inc. All rights reserved..
    32. 32. Toad for Cloud Databases• Federated SQL queries across Hive, Hbase, NoSQL, RDBMS 31 ©2011 Quest Software, Inc. All rights reserved..
    33. 33. Conclusions• RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption• SQOOP provides a good general purpose framework for transferring data between any JDBC database and Hadoop • We’d like to see it become a standard• Each RDBMS offers distinct tuning opportunities, so optimized SQOOP extensions offer real value• Hadoop-RDBMS integration projects are proliferating rapidly 32 ©2011 Quest Software, Inc. All rights reserved..
    34. 34. 33©2011 Quest Software, Inc. All rights reserved..

    ×