Successfully reported this slideshow.
Your SlideShare is downloading. ×

Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software & Arvind Prabhakar, Cloudera

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 58 Ad
Advertisement

More Related Content

Viewers also liked (16)

More from Cloudera, Inc. (20)

Advertisement

Recently uploaded (20)

Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software & Arvind Prabhakar, Cloudera

  1. 1. November 2011<br />Apache Sqoop (Incubating)<br />Integrating Hadoop with Enterprise RDBS – Part I<br />Arvind Prabhakar (arvind at apache dot org)<br />Apache Sqoop Committer and Software Engineer at Cloudera<br />
  2. 2. Hadoop Data Processing<br />1<br />
  3. 3. Hadoop Data Processing<br />2<br />
  4. 4. Hadoop Data Processing<br />3<br />
  5. 5. Hadoop Data Processing<br />4<br />
  6. 6. In This Session… <br />How Sqoop Works<br />Roadmap<br />5<br />
  7. 7. Data Import<br />6<br />
  8. 8. Data Import<br />7<br />
  9. 9. Data Import<br />8<br />
  10. 10. Data Import<br />9<br />
  11. 11. Data Import<br />10<br />
  12. 12. Sqoop Overview<br />11<br />
  13. 13. Pre-processing<br />12<br />
  14. 14. Code Generation<br />13<br />
  15. 15. Type Mapping<br />14<br />
  16. 16. Data Transfer<br />15<br />
  17. 17. Data Transfer<br />16<br />
  18. 18. Data Transfer<br />17<br />
  19. 19. Post-Processing<br />18<br />
  20. 20. Sqoop Connectors<br />Oracle – Developed by Quest Software<br />Couchbase – Developed by Couchbase<br />Netezza – Developed by Cloudera<br />Teradata – Developed by Cloudera<br />SQL Server – Developed by Microsoft<br />Microsoft PDW – Developed by Microsoft<br />Volt DB – Developed by Volt DB<br />19<br />
  21. 21. Sqoop Roadmap<br />SQOOP-365: Proposal for Sqoop 2.0<br />https://issues.apache.org/jira/browse/SQOOP-365<br />Highlights<br />Sqoop as a Service<br />Connections as First Class Objects<br />Role based Security<br />20<br />
  22. 22. Sqoop 2 Architecture (proposed)<br />21<br />
  23. 23. For More Information<br />Website:<br />http://incubator.apache.org/sqoop/<br />Mailing Lists:<br />incubator-sqoop-user-subscribe@apache.org<br />incubator-sqoop-dev-subscribe@apache.org<br /><ul><li>Issue Tracker:</li></ul>http://issues.apache.org/jira/browse/SQOOP<br />22<br />
  24. 24. Thank You!<br />Q & A will be after part II of this session. <br />23<br />
  25. 25. Guy Harrison, Quest Software<br />Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools<br />
  26. 26. Introductions<br />
  27. 27.
  28. 28. 27<br />
  29. 29. Agenda<br />Scenarios for RDBMS-Hadoop interaction<br />Case study: Quest extension to SQOOP<br />Other RDBMS-Hadoop integrations<br />
  30. 30. Hadoop meets RDBMS – scenarios<br />
  31. 31. Scenario #1: Reference data in RDBMS <br />PRODUCTS<br />HDFS<br />CUSTOMERS<br />WEBlOGS<br />RDBMS<br />
  32. 32. Scenario #2: Hadoop for off-line analytics<br />PRODUCTS<br />HDFS<br />CUSTOMERS<br />SALES<br />HISTORY<br />RDBMS<br />
  33. 33. Scenario #3: MapReduce output to RDBMS <br />DB QUERY<br />TOOL<br />WEBLOGS<br />SUMMARY<br />HDFS<br />WEBlOGS<br />RDBMS<br />
  34. 34. Scenario #4: Hadoop as RDBMS “active archive”<br />QUERY<br />TOOL<br />SALES 2011<br />SALES 2010<br />HDFS<br />SALES 2009<br />SALES 2009<br />SALES 2008<br />SALES 2008<br />RDBMS<br />
  35. 35. Case Study: extending SQOOP for Oracle<br />
  36. 36. SQOOP extensibility<br />SQOOP implements a generic approach to RDBMS/Hadoop data transfer<br />But database optimization is highly platform specific<br />Each RDBMS has distinct optimizations strategies<br />For Oracle, optimization requires:<br />Bypassing Oracle caching layers<br />Avoiding Oracle optimizer meddling <br />Exploiting Oracle metadata to balance mapper load<br />
  37. 37. Reading from Oracle – default SQOOP<br />ID > MAX/2<br />ID > 0 and ID < MAX/2<br />MAPPER<br />MAPPER<br />CACHE<br />ORACLE SESSSION<br />ORACLE SESSION<br />RANGE SCAN<br />RANGE SCAN<br />Index block<br />Index block<br />Index block<br />Index block<br />Index block<br />Index block<br />ORACLE TABLE<br />
  38. 38. Oracle – parallelism gone bad (1) <br />HDFS<br />OracleSALEStable<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />
  39. 39. Oracle – parallelism gone bad (2) <br />HDFS<br />Oracletable<br />Hadoop<br /> Mapper<br />Hadoop <br />Mapper<br />Hadoop <br />Mapper<br />Hadoop <br />Mapper<br />
  40. 40. Ideal architecture <br />HDFS<br />ORACLE<br />TABLE<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />ORACLE <br />SESSION<br />Hadoop <br />Mapper<br />
  41. 41. Design goals<br />Partition data based on physical storage<br />By-pass Oracle buffering<br />By-pass Oracle parallelism<br />Do not require or use indexes<br />Never read the same data block more than once<br />Support Oracle datatypes<br />
  42. 42. Import Throughput <br />
  43. 43.
  44. 44. Export Throughput<br />
  45. 45. Export load<br />
  46. 46. Working with the SQOOP framework<br /> SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing:<br />Extend ManagerFactory (what to handle)<br />Extend ConnManager (DB connection and metadata)<br />For imports:<br />Extend DataDrivenDBInputFormat (gets the data)<br />Data allocation (getSplits())<br />Split serialization (“io.serializations” property)<br />Data access logic (createDBRecordReader(), getSelectQuery())<br />Implement progress (nextKeyValue(), getProgress())<br />Similar procedure for extending exports<br />
  47. 47. Extensions to native SQOOP<br />MERGE functionality<br />Update if exists, insert otherwise<br />Hive connector<br />Source defined as HQL query rather than HDFS directory<br />Eclipse UI <br />
  48. 48. Availability<br />Apache licensed source available from :<br />https://github.com/QuestSoftwareTCD/OracleSQOOPconnector<br />Download from (Quest):<br />http://www.quest.com/hadoop/<br />Download from (Cloudera):<br />http://ccp.cloudera.com/display/SUPPORT/Downloads<br />
  49. 49. Other SQOOP connectors<br />Microsoft SQL Server:<br />http://www.microsoft.com/download/en/details.aspx?id=27584<br />Teradata:<br />https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide%2C+version+1.0-beta-u4<br />Microstrategy:<br />https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreement<br />Nettezza:<br />https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement<br />VoltDB:<br />http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration<br />
  50. 50. Other Hadoop – RDBMS integrations<br />
  51. 51. Oracle Big Data Appliance <br />18 Sun X4270 M2 servers<br />48GB per node (864GB total)<br />2x6 Core CPU per node (216 total)<br />12x2TB HDD per node (216 spindles, 864 TB)<br />40Gb/s Infiniband between nodes<br />10Gb/s Ethernet to datacenter<br />Apache Hadoop<br />Oracle NoSQL <br />Oracle loader for Hadoop<br />Multi-stage C-optimized unidirectional loader<br />www.oracle.com/us/bigdata/index.html<br />
  52. 52. ORACLE EXALYTICS<br />ORACLE<br />EXALOGIC<br />ORACLE<br />Big Data Appliance<br />Oracle WEBLOGIC<br />Oracle ESSBASE<br />Oracle NoSQL<br />ORACLE<br />EXADATA<br />ORACLE LOADER FOR HADOOP<br />ApACHE<br />HADOOP<br />Oracle RDBMS<br />Oracle TIMES TEN<br />
  53. 53. Microsoft<br />
  54. 54. Hadapt<br />Formally HadoopDB – Hadoop/Postgres hybrid<br />Postgres servers on data nodes allow for accelerated (indexed) HIVE queries<br />Extensions to the Hive optimizer <br />http://www.hadapt.com/<br />
  55. 55. Greenplum<br />SQL based access to HDFS data via in-DB MapReduce<br />http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf<br />
  56. 56. Toad for Cloud Databases<br />Federated SQL queries across Hive, Hbase, NoSQL, RDBMS<br />
  57. 57. Conclusions<br />RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption<br />SQOOP provides a good general purpose framework for transferring data between any JDBC database and Hadoop<br />We’d like to see it become a standard<br />Each RDBMS offers distinct tuning opportunities, so optimized SQOOP extensions offer real value <br />Hadoop-RDBMS integration projects are proliferating rapidly<br />

Editor's Notes

  • Insanely popular – literally millions of users

×