Your SlideShare is downloading. ×
0
November 2011<br />Apache Sqoop (Incubating)<br />Integrating Hadoop with Enterprise RDBS – Part I<br />Arvind Prabhakar (...
Hadoop Data Processing<br />1<br />
Hadoop Data Processing<br />2<br />
Hadoop Data Processing<br />3<br />
Hadoop Data Processing<br />4<br />
In This Session…	<br />How Sqoop Works<br />Roadmap<br />5<br />
Data Import<br />6<br />
Data Import<br />7<br />
Data Import<br />8<br />
Data Import<br />9<br />
Data Import<br />10<br />
Sqoop Overview<br />11<br />
Pre-processing<br />12<br />
Code Generation<br />13<br />
Type Mapping<br />14<br />
Data Transfer<br />15<br />
Data Transfer<br />16<br />
Data Transfer<br />17<br />
Post-Processing<br />18<br />
Sqoop Connectors<br />Oracle – Developed by Quest Software<br />Couchbase – Developed by Couchbase<br />Netezza – Develope...
Sqoop Roadmap<br />SQOOP-365: Proposal for Sqoop 2.0<br />https://issues.apache.org/jira/browse/SQOOP-365<br />Highlights<...
Sqoop 2 Architecture (proposed)<br />21<br />
For More Information<br />Website:<br />http://incubator.apache.org/sqoop/<br />Mailing Lists:<br />incubator-sqoop-user-s...
Thank You!<br />Q & A will be after part II of this session. <br />23<br />
Guy Harrison, Quest Software<br />Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools<br />
Introductions<br />
27<br />
Agenda<br />Scenarios for RDBMS-Hadoop interaction<br />Case study: Quest extension to SQOOP<br />Other RDBMS-Hadoop integ...
Hadoop meets RDBMS – scenarios<br />
Scenario #1: Reference data in RDBMS <br />PRODUCTS<br />HDFS<br />CUSTOMERS<br />WEBlOGS<br />RDBMS<br />
Scenario #2: Hadoop for off-line analytics<br />PRODUCTS<br />HDFS<br />CUSTOMERS<br />SALES<br />HISTORY<br />RDBMS<br />
Scenario #3: MapReduce output to RDBMS <br />DB QUERY<br />TOOL<br />WEBLOGS<br />SUMMARY<br />HDFS<br />WEBlOGS<br />RDBM...
Scenario #4: Hadoop as RDBMS “active archive”<br />QUERY<br />TOOL<br />SALES 2011<br />SALES 2010<br />HDFS<br />SALES 20...
Case Study: extending SQOOP for Oracle<br />
SQOOP extensibility<br />SQOOP implements a generic approach to RDBMS/Hadoop data transfer<br />But database optimization ...
Reading from Oracle – default SQOOP<br />ID > MAX/2<br />ID > 0 and ID < MAX/2<br />MAPPER<br />MAPPER<br />CACHE<br />ORA...
Oracle – parallelism gone bad (1) <br />HDFS<br />OracleSALEStable<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mappe...
Oracle – parallelism gone bad (2) <br />HDFS<br />Oracletable<br />Hadoop<br /> Mapper<br />Hadoop <br />Mapper<br />Hadoo...
Ideal architecture <br />HDFS<br />ORACLE<br />TABLE<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />...
Design goals<br />Partition data based on physical storage<br />By-pass Oracle buffering<br />By-pass Oracle parallelism<b...
Import Throughput <br />
Export Throughput<br />
Export load<br />
Working with the SQOOP framework<br /> SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing:<br />Extend...
Extensions to native SQOOP<br />MERGE functionality<br />Update if exists, insert otherwise<br />Hive connector<br />Sourc...
Availability<br />Apache licensed source available from :<br />https://github.com/QuestSoftwareTCD/OracleSQOOPconnector<br...
Other SQOOP connectors<br />Microsoft SQL Server:<br />http://www.microsoft.com/download/en/details.aspx?id=27584<br />Ter...
Other Hadoop – RDBMS integrations<br />
Oracle Big Data Appliance <br />18 Sun X4270 M2 servers<br />48GB per node (864GB total)<br />2x6 Core CPU per node (216 t...
ORACLE EXALYTICS<br />ORACLE<br />EXALOGIC<br />ORACLE<br />Big Data Appliance<br />Oracle WEBLOGIC<br />Oracle ESSBASE<br...
Microsoft<br />
Hadapt<br />Formally HadoopDB – Hadoop/Postgres hybrid<br />Postgres servers on data nodes allow for accelerated (indexed)...
Greenplum<br />SQL based access to HDFS data via in-DB MapReduce<br />http://www.greenplum.com/sites/default/files/EMC_Gre...
Toad for Cloud Databases<br />Federated SQL queries across Hive, Hbase, NoSQL, RDBMS<br />
Conclusions<br />RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption<br />SQOOP provides a good general pur...
Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Softw...
Upcoming SlideShare
Loading in...5
×

Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software & Arvind Prabhakar, Cloudera

4,937

Published on

Published in: Technology
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,937
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
294
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide
  • Insanely popular – literally millions of users
  • Transcript of "Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software & Arvind Prabhakar, Cloudera"

    1. 1. November 2011<br />Apache Sqoop (Incubating)<br />Integrating Hadoop with Enterprise RDBS – Part I<br />Arvind Prabhakar (arvind at apache dot org)<br />Apache Sqoop Committer and Software Engineer at Cloudera<br />
    2. 2. Hadoop Data Processing<br />1<br />
    3. 3. Hadoop Data Processing<br />2<br />
    4. 4. Hadoop Data Processing<br />3<br />
    5. 5. Hadoop Data Processing<br />4<br />
    6. 6. In This Session… <br />How Sqoop Works<br />Roadmap<br />5<br />
    7. 7. Data Import<br />6<br />
    8. 8. Data Import<br />7<br />
    9. 9. Data Import<br />8<br />
    10. 10. Data Import<br />9<br />
    11. 11. Data Import<br />10<br />
    12. 12. Sqoop Overview<br />11<br />
    13. 13. Pre-processing<br />12<br />
    14. 14. Code Generation<br />13<br />
    15. 15. Type Mapping<br />14<br />
    16. 16. Data Transfer<br />15<br />
    17. 17. Data Transfer<br />16<br />
    18. 18. Data Transfer<br />17<br />
    19. 19. Post-Processing<br />18<br />
    20. 20. Sqoop Connectors<br />Oracle – Developed by Quest Software<br />Couchbase – Developed by Couchbase<br />Netezza – Developed by Cloudera<br />Teradata – Developed by Cloudera<br />SQL Server – Developed by Microsoft<br />Microsoft PDW – Developed by Microsoft<br />Volt DB – Developed by Volt DB<br />19<br />
    21. 21. Sqoop Roadmap<br />SQOOP-365: Proposal for Sqoop 2.0<br />https://issues.apache.org/jira/browse/SQOOP-365<br />Highlights<br />Sqoop as a Service<br />Connections as First Class Objects<br />Role based Security<br />20<br />
    22. 22. Sqoop 2 Architecture (proposed)<br />21<br />
    23. 23. For More Information<br />Website:<br />http://incubator.apache.org/sqoop/<br />Mailing Lists:<br />incubator-sqoop-user-subscribe@apache.org<br />incubator-sqoop-dev-subscribe@apache.org<br /><ul><li>Issue Tracker:</li></ul>http://issues.apache.org/jira/browse/SQOOP<br />22<br />
    24. 24. Thank You!<br />Q & A will be after part II of this session. <br />23<br />
    25. 25. Guy Harrison, Quest Software<br />Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools<br />
    26. 26. Introductions<br />
    27. 27.
    28. 28. 27<br />
    29. 29. Agenda<br />Scenarios for RDBMS-Hadoop interaction<br />Case study: Quest extension to SQOOP<br />Other RDBMS-Hadoop integrations<br />
    30. 30. Hadoop meets RDBMS – scenarios<br />
    31. 31. Scenario #1: Reference data in RDBMS <br />PRODUCTS<br />HDFS<br />CUSTOMERS<br />WEBlOGS<br />RDBMS<br />
    32. 32. Scenario #2: Hadoop for off-line analytics<br />PRODUCTS<br />HDFS<br />CUSTOMERS<br />SALES<br />HISTORY<br />RDBMS<br />
    33. 33. Scenario #3: MapReduce output to RDBMS <br />DB QUERY<br />TOOL<br />WEBLOGS<br />SUMMARY<br />HDFS<br />WEBlOGS<br />RDBMS<br />
    34. 34. Scenario #4: Hadoop as RDBMS “active archive”<br />QUERY<br />TOOL<br />SALES 2011<br />SALES 2010<br />HDFS<br />SALES 2009<br />SALES 2009<br />SALES 2008<br />SALES 2008<br />RDBMS<br />
    35. 35. Case Study: extending SQOOP for Oracle<br />
    36. 36. SQOOP extensibility<br />SQOOP implements a generic approach to RDBMS/Hadoop data transfer<br />But database optimization is highly platform specific<br />Each RDBMS has distinct optimizations strategies<br />For Oracle, optimization requires:<br />Bypassing Oracle caching layers<br />Avoiding Oracle optimizer meddling <br />Exploiting Oracle metadata to balance mapper load<br />
    37. 37. Reading from Oracle – default SQOOP<br />ID > MAX/2<br />ID > 0 and ID < MAX/2<br />MAPPER<br />MAPPER<br />CACHE<br />ORACLE SESSSION<br />ORACLE SESSION<br />RANGE SCAN<br />RANGE SCAN<br />Index block<br />Index block<br />Index block<br />Index block<br />Index block<br />Index block<br />ORACLE TABLE<br />
    38. 38. Oracle – parallelism gone bad (1) <br />HDFS<br />OracleSALEStable<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />
    39. 39. Oracle – parallelism gone bad (2) <br />HDFS<br />Oracletable<br />Hadoop<br /> Mapper<br />Hadoop <br />Mapper<br />Hadoop <br />Mapper<br />Hadoop <br />Mapper<br />
    40. 40. Ideal architecture <br />HDFS<br />ORACLE<br />TABLE<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />
    41. 41. Design goals<br />Partition data based on physical storage<br />By-pass Oracle buffering<br />By-pass Oracle parallelism<br />Do not require or use indexes<br />Never read the same data block more than once<br />Support Oracle datatypes<br />
    42. 42. Import Throughput <br />
    43. 43.
    44. 44. Export Throughput<br />
    45. 45. Export load<br />
    46. 46. Working with the SQOOP framework<br /> SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing:<br />Extend ManagerFactory (what to handle)<br />Extend ConnManager (DB connection and metadata)<br />For imports:<br />Extend DataDrivenDBInputFormat (gets the data)<br />Data allocation (getSplits())<br />Split serialization (“io.serializations” property)<br />Data access logic (createDBRecordReader(), getSelectQuery())<br />Implement progress (nextKeyValue(), getProgress())<br />Similar procedure for extending exports<br />
    47. 47. Extensions to native SQOOP<br />MERGE functionality<br />Update if exists, insert otherwise<br />Hive connector<br />Source defined as HQL query rather than HDFS directory<br />Eclipse UI <br />
    48. 48. Availability<br />Apache licensed source available from :<br />https://github.com/QuestSoftwareTCD/OracleSQOOPconnector<br />Download from (Quest):<br />http://www.quest.com/hadoop/<br />Download from (Cloudera):<br />http://ccp.cloudera.com/display/SUPPORT/Downloads<br />
    49. 49. Other SQOOP connectors<br />Microsoft SQL Server:<br />http://www.microsoft.com/download/en/details.aspx?id=27584<br />Teradata:<br />https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide%2C+version+1.0-beta-u4<br />Microstrategy:<br />https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreement<br />Nettezza:<br />https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement<br />VoltDB:<br />http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration<br />
    50. 50. Other Hadoop – RDBMS integrations<br />
    51. 51. Oracle Big Data Appliance <br />18 Sun X4270 M2 servers<br />48GB per node (864GB total)<br />2x6 Core CPU per node (216 total)<br />12x2TB HDD per node (216 spindles, 864 TB)<br />40Gb/s Infiniband between nodes<br />10Gb/s Ethernet to datacenter<br />Apache Hadoop<br />Oracle NoSQL <br />Oracle loader for Hadoop<br />Multi-stage C-optimized unidirectional loader<br />www.oracle.com/us/bigdata/index.html<br />
    52. 52. ORACLE EXALYTICS<br />ORACLE<br />EXALOGIC<br />ORACLE<br />Big Data Appliance<br />Oracle WEBLOGIC<br />Oracle ESSBASE<br />Oracle NoSQL<br />ORACLE<br />EXADATA<br />ORACLE LOADER FOR HADOOP<br />ApACHE<br />HADOOP<br />Oracle RDBMS<br />Oracle TIMES TEN<br />
    53. 53. Microsoft<br />
    54. 54. Hadapt<br />Formally HadoopDB – Hadoop/Postgres hybrid<br />Postgres servers on data nodes allow for accelerated (indexed) HIVE queries<br />Extensions to the Hive optimizer <br />http://www.hadapt.com/<br />
    55. 55. Greenplum<br />SQL based access to HDFS data via in-DB MapReduce<br />http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf<br />
    56. 56. Toad for Cloud Databases<br />Federated SQL queries across Hive, Hbase, NoSQL, RDBMS<br />
    57. 57. Conclusions<br />RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption<br />SQOOP provides a good general purpose framework for transferring data between any JDBC database and Hadoop<br />We’d like to see it become a standard<br />Each RDBMS offers distinct tuning opportunities, so optimized SQOOP extensions offer real value <br />Hadoop-RDBMS integration projects are proliferating rapidly<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×