Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
November 2011<br />Apache Sqoop (Incubating)<br />Integrating Hadoop with Enterprise RDBS – Part I<br />Arvind Prabhakar (...
Hadoop Data Processing<br />1<br />
Hadoop Data Processing<br />2<br />
Hadoop Data Processing<br />3<br />
Hadoop Data Processing<br />4<br />
In This Session…	<br />How Sqoop Works<br />Roadmap<br />5<br />
Data Import<br />6<br />
Data Import<br />7<br />
Data Import<br />8<br />
Data Import<br />9<br />
Data Import<br />10<br />
Sqoop Overview<br />11<br />
Pre-processing<br />12<br />
Code Generation<br />13<br />
Type Mapping<br />14<br />
Data Transfer<br />15<br />
Data Transfer<br />16<br />
Data Transfer<br />17<br />
Post-Processing<br />18<br />
Sqoop Connectors<br />Oracle – Developed by Quest Software<br />Couchbase – Developed by Couchbase<br />Netezza – Develope...
Sqoop Roadmap<br />SQOOP-365: Proposal for Sqoop 2.0<br />https://issues.apache.org/jira/browse/SQOOP-365<br />Highlights<...
Sqoop 2 Architecture (proposed)<br />21<br />
For More Information<br />Website:<br />http://incubator.apache.org/sqoop/<br />Mailing Lists:<br />incubator-sqoop-user-s...
Thank You!<br />Q & A will be after part II of this session. <br />23<br />
Guy Harrison, Quest Software<br />Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools<br />
Introductions<br />
27<br />
Agenda<br />Scenarios for RDBMS-Hadoop interaction<br />Case study: Quest extension to SQOOP<br />Other RDBMS-Hadoop integ...
Hadoop meets RDBMS – scenarios<br />
Scenario #1: Reference data in RDBMS <br />PRODUCTS<br />HDFS<br />CUSTOMERS<br />WEBlOGS<br />RDBMS<br />
Scenario #2: Hadoop for off-line analytics<br />PRODUCTS<br />HDFS<br />CUSTOMERS<br />SALES<br />HISTORY<br />RDBMS<br />
Scenario #3: MapReduce output to RDBMS <br />DB QUERY<br />TOOL<br />WEBLOGS<br />SUMMARY<br />HDFS<br />WEBlOGS<br />RDBM...
Scenario #4: Hadoop as RDBMS “active archive”<br />QUERY<br />TOOL<br />SALES 2011<br />SALES 2010<br />HDFS<br />SALES 20...
Case Study: extending SQOOP for Oracle<br />
SQOOP extensibility<br />SQOOP implements a generic approach to RDBMS/Hadoop data transfer<br />But database optimization ...
Reading from Oracle – default SQOOP<br />ID > MAX/2<br />ID > 0 and ID < MAX/2<br />MAPPER<br />MAPPER<br />CACHE<br />ORA...
Oracle – parallelism gone bad (1) <br />HDFS<br />OracleSALEStable<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mappe...
Oracle – parallelism gone bad (2) <br />HDFS<br />Oracletable<br />Hadoop<br /> Mapper<br />Hadoop <br />Mapper<br />Hadoo...
Ideal architecture <br />HDFS<br />ORACLE<br />TABLE<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />...
Design goals<br />Partition data based on physical storage<br />By-pass Oracle buffering<br />By-pass Oracle parallelism<b...
Import Throughput <br />
Export Throughput<br />
Export load<br />
Working with the SQOOP framework<br /> SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing:<br />Extend...
Extensions to native SQOOP<br />MERGE functionality<br />Update if exists, insert otherwise<br />Hive connector<br />Sourc...
Availability<br />Apache licensed source available from :<br />https://github.com/QuestSoftwareTCD/OracleSQOOPconnector<br...
Other SQOOP connectors<br />Microsoft SQL Server:<br />http://www.microsoft.com/download/en/details.aspx?id=27584<br />Ter...
Other Hadoop – RDBMS integrations<br />
Oracle Big Data Appliance <br />18 Sun X4270 M2 servers<br />48GB per node (864GB total)<br />2x6 Core CPU per node (216 t...
ORACLE EXALYTICS<br />ORACLE<br />EXALOGIC<br />ORACLE<br />Big Data Appliance<br />Oracle WEBLOGIC<br />Oracle ESSBASE<br...
Microsoft<br />
Hadapt<br />Formally HadoopDB – Hadoop/Postgres hybrid<br />Postgres servers on data nodes allow for accelerated (indexed)...
Greenplum<br />SQL based access to HDFS data via in-DB MapReduce<br />http://www.greenplum.com/sites/default/files/EMC_Gre...
Toad for Cloud Databases<br />Federated SQL queries across Hive, Hbase, NoSQL, RDBMS<br />
Conclusions<br />RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption<br />SQOOP provides a good general pur...
Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Softw...
Upcoming SlideShare
Loading in …5
×

Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software & Arvind Prabhakar, Cloudera

6,727 views

Published on

Published in: Technology
  • More than 5000 registered IT consultants and Corporates.Search for IT online training Providers at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software & Arvind Prabhakar, Cloudera

  1. 1. November 2011<br />Apache Sqoop (Incubating)<br />Integrating Hadoop with Enterprise RDBS – Part I<br />Arvind Prabhakar (arvind at apache dot org)<br />Apache Sqoop Committer and Software Engineer at Cloudera<br />
  2. 2. Hadoop Data Processing<br />1<br />
  3. 3. Hadoop Data Processing<br />2<br />
  4. 4. Hadoop Data Processing<br />3<br />
  5. 5. Hadoop Data Processing<br />4<br />
  6. 6. In This Session… <br />How Sqoop Works<br />Roadmap<br />5<br />
  7. 7. Data Import<br />6<br />
  8. 8. Data Import<br />7<br />
  9. 9. Data Import<br />8<br />
  10. 10. Data Import<br />9<br />
  11. 11. Data Import<br />10<br />
  12. 12. Sqoop Overview<br />11<br />
  13. 13. Pre-processing<br />12<br />
  14. 14. Code Generation<br />13<br />
  15. 15. Type Mapping<br />14<br />
  16. 16. Data Transfer<br />15<br />
  17. 17. Data Transfer<br />16<br />
  18. 18. Data Transfer<br />17<br />
  19. 19. Post-Processing<br />18<br />
  20. 20. Sqoop Connectors<br />Oracle – Developed by Quest Software<br />Couchbase – Developed by Couchbase<br />Netezza – Developed by Cloudera<br />Teradata – Developed by Cloudera<br />SQL Server – Developed by Microsoft<br />Microsoft PDW – Developed by Microsoft<br />Volt DB – Developed by Volt DB<br />19<br />
  21. 21. Sqoop Roadmap<br />SQOOP-365: Proposal for Sqoop 2.0<br />https://issues.apache.org/jira/browse/SQOOP-365<br />Highlights<br />Sqoop as a Service<br />Connections as First Class Objects<br />Role based Security<br />20<br />
  22. 22. Sqoop 2 Architecture (proposed)<br />21<br />
  23. 23. For More Information<br />Website:<br />http://incubator.apache.org/sqoop/<br />Mailing Lists:<br />incubator-sqoop-user-subscribe@apache.org<br />incubator-sqoop-dev-subscribe@apache.org<br /><ul><li>Issue Tracker:</li></ul>http://issues.apache.org/jira/browse/SQOOP<br />22<br />
  24. 24. Thank You!<br />Q & A will be after part II of this session. <br />23<br />
  25. 25. Guy Harrison, Quest Software<br />Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools<br />
  26. 26. Introductions<br />
  27. 27.
  28. 28. 27<br />
  29. 29. Agenda<br />Scenarios for RDBMS-Hadoop interaction<br />Case study: Quest extension to SQOOP<br />Other RDBMS-Hadoop integrations<br />
  30. 30. Hadoop meets RDBMS – scenarios<br />
  31. 31. Scenario #1: Reference data in RDBMS <br />PRODUCTS<br />HDFS<br />CUSTOMERS<br />WEBlOGS<br />RDBMS<br />
  32. 32. Scenario #2: Hadoop for off-line analytics<br />PRODUCTS<br />HDFS<br />CUSTOMERS<br />SALES<br />HISTORY<br />RDBMS<br />
  33. 33. Scenario #3: MapReduce output to RDBMS <br />DB QUERY<br />TOOL<br />WEBLOGS<br />SUMMARY<br />HDFS<br />WEBlOGS<br />RDBMS<br />
  34. 34. Scenario #4: Hadoop as RDBMS “active archive”<br />QUERY<br />TOOL<br />SALES 2011<br />SALES 2010<br />HDFS<br />SALES 2009<br />SALES 2009<br />SALES 2008<br />SALES 2008<br />RDBMS<br />
  35. 35. Case Study: extending SQOOP for Oracle<br />
  36. 36. SQOOP extensibility<br />SQOOP implements a generic approach to RDBMS/Hadoop data transfer<br />But database optimization is highly platform specific<br />Each RDBMS has distinct optimizations strategies<br />For Oracle, optimization requires:<br />Bypassing Oracle caching layers<br />Avoiding Oracle optimizer meddling <br />Exploiting Oracle metadata to balance mapper load<br />
  37. 37. Reading from Oracle – default SQOOP<br />ID > MAX/2<br />ID > 0 and ID < MAX/2<br />MAPPER<br />MAPPER<br />CACHE<br />ORACLE SESSSION<br />ORACLE SESSION<br />RANGE SCAN<br />RANGE SCAN<br />Index block<br />Index block<br />Index block<br />Index block<br />Index block<br />Index block<br />ORACLE TABLE<br />
  38. 38. Oracle – parallelism gone bad (1) <br />HDFS<br />OracleSALEStable<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />
  39. 39. Oracle – parallelism gone bad (2) <br />HDFS<br />Oracletable<br />Hadoop<br /> Mapper<br />Hadoop <br />Mapper<br />Hadoop <br />Mapper<br />Hadoop <br />Mapper<br />
  40. 40. Ideal architecture <br />HDFS<br />ORACLE<br />TABLE<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />
  41. 41. Design goals<br />Partition data based on physical storage<br />By-pass Oracle buffering<br />By-pass Oracle parallelism<br />Do not require or use indexes<br />Never read the same data block more than once<br />Support Oracle datatypes<br />
  42. 42. Import Throughput <br />
  43. 43.
  44. 44. Export Throughput<br />
  45. 45. Export load<br />
  46. 46. Working with the SQOOP framework<br /> SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing:<br />Extend ManagerFactory (what to handle)<br />Extend ConnManager (DB connection and metadata)<br />For imports:<br />Extend DataDrivenDBInputFormat (gets the data)<br />Data allocation (getSplits())<br />Split serialization (“io.serializations” property)<br />Data access logic (createDBRecordReader(), getSelectQuery())<br />Implement progress (nextKeyValue(), getProgress())<br />Similar procedure for extending exports<br />
  47. 47. Extensions to native SQOOP<br />MERGE functionality<br />Update if exists, insert otherwise<br />Hive connector<br />Source defined as HQL query rather than HDFS directory<br />Eclipse UI <br />
  48. 48. Availability<br />Apache licensed source available from :<br />https://github.com/QuestSoftwareTCD/OracleSQOOPconnector<br />Download from (Quest):<br />http://www.quest.com/hadoop/<br />Download from (Cloudera):<br />http://ccp.cloudera.com/display/SUPPORT/Downloads<br />
  49. 49. Other SQOOP connectors<br />Microsoft SQL Server:<br />http://www.microsoft.com/download/en/details.aspx?id=27584<br />Teradata:<br />https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide%2C+version+1.0-beta-u4<br />Microstrategy:<br />https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreement<br />Nettezza:<br />https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement<br />VoltDB:<br />http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration<br />
  50. 50. Other Hadoop – RDBMS integrations<br />
  51. 51. Oracle Big Data Appliance <br />18 Sun X4270 M2 servers<br />48GB per node (864GB total)<br />2x6 Core CPU per node (216 total)<br />12x2TB HDD per node (216 spindles, 864 TB)<br />40Gb/s Infiniband between nodes<br />10Gb/s Ethernet to datacenter<br />Apache Hadoop<br />Oracle NoSQL <br />Oracle loader for Hadoop<br />Multi-stage C-optimized unidirectional loader<br />www.oracle.com/us/bigdata/index.html<br />
  52. 52. ORACLE EXALYTICS<br />ORACLE<br />EXALOGIC<br />ORACLE<br />Big Data Appliance<br />Oracle WEBLOGIC<br />Oracle ESSBASE<br />Oracle NoSQL<br />ORACLE<br />EXADATA<br />ORACLE LOADER FOR HADOOP<br />ApACHE<br />HADOOP<br />Oracle RDBMS<br />Oracle TIMES TEN<br />
  53. 53. Microsoft<br />
  54. 54. Hadapt<br />Formally HadoopDB – Hadoop/Postgres hybrid<br />Postgres servers on data nodes allow for accelerated (indexed) HIVE queries<br />Extensions to the Hive optimizer <br />http://www.hadapt.com/<br />
  55. 55. Greenplum<br />SQL based access to HDFS data via in-DB MapReduce<br />http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf<br />
  56. 56. Toad for Cloud Databases<br />Federated SQL queries across Hive, Hbase, NoSQL, RDBMS<br />
  57. 57. Conclusions<br />RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption<br />SQOOP provides a good general purpose framework for transferring data between any JDBC database and Hadoop<br />We’d like to see it become a standard<br />Each RDBMS offers distinct tuning opportunities, so optimized SQOOP extensions offer real value <br />Hadoop-RDBMS integration projects are proliferating rapidly<br />

×