Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop and rdbms with sqoop


Published on

Presentation given at Hadoop World NYC 2011. Moving data between Hadoop and RDBMS with SQOOP

Published in: Technology
  • Be the first to comment

Hadoop and rdbms with sqoop

  1. 1. Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP<br />Guy Harrison<br />Director, R&D Melbourne<br /><br /><br />@guyharrison<br />
  2. 2. Introductions <br />
  3. 3.
  4. 4. Agenda<br />RDBMS-Hadoop interoperability scenarios<br />Interoperability options<br />Cloudera SQOOP<br />Extending SQOOP <br />Quest OraOop extension for Cloudera SQOOP<br />Performance comparisons <br />Lessons learned and best practices<br />
  5. 5. Scenario #1: Reference data in RDBMS <br />RDBMS<br />Customers<br />HDFS<br />Products<br />WebLogs<br />
  6. 6. Scenario #2: Hadoop for off-line analytics<br />RDBMS<br />Customers<br />HDFS<br />Products<br />Sales History<br />
  7. 7. Scenario #3: Hadoop for RDBMS archive<br />RDBMS<br />HDFS<br />Sales 2010<br />Sales 2009 <br />Sales 2008 <br />Sales 2008 <br />
  8. 8. Scenario #4: MapReduce results to RDBMS <br />RDBMS<br />HDFS<br />WebLog<br />Summary<br />WebLogs<br />
  9. 9. Options for RDBMS inter-op<br />DBInputFormat:<br />Allows database records to be used as mapper inputs.<br />BUT:<br />Not inherently scalable or efficient<br />For repeated analysis, better to stage in Hadoop<br />Tedious coding of DBWritable classes for each table<br />SQOOP<br />Open source utility provided by Cloudera<br />Configurable command line interface to copy RDBMS->HDFS<br />Support for Hive, Hbase<br />Generates java classes for future M-R tasks<br />Extensible to provide optimized adaptors for specific targets<br />Bi-Directional <br />
  10. 10. SQOOP Details <br />SQOOP import <br />Divide table into ranges using primary key max/min<br />Create mappers for each range <br />Mappers write to multiple HDFS nodes<br />Creates text or sequence files <br />Generates Java class for resulting HDFS file<br />Generates Hive definition and auto-loads into HIVE<br />SQOOP export<br />Read files in HDFS directory via MapReduce<br />Bulk parallel insert into database table <br />
  11. 11. SQOOP details<br />SQOOP features:<br />Compatible with almost any JDBC enabled database<br />Auto load into HIVE <br />Hbase support <br />Special handling for database LOBs<br />Job management <br />Cluster configuration (jar file distribution)<br />WHERE clause support<br />Open source, and included in Cloudera distributions<br />SQOOP fast paths & plug ins<br />Invokemysqldump, mysqlimport for MySQL jobs <br />Similar fast paths for PostgreSQL<br />Extensibility architecture for 3rd parties (like Quest )<br />Teradata, Netezza, etc.<br />
  12. 12. Working with Oracle <br />SQOOP approach is generic and applicable to all RDBMS<br />However for Oracle, sub-optimal in some respects:<br />Oracle may parallelize and serialize individual mappers<br />Oracle optimizer may decline to use index range scans<br />Oracle physical storage often deliberately not in primary key order (reverse key indexes, hash partitioning, etc)<br />Primary keys often not be evenly distributed<br />Index range scans use single block random reads<br />vs.faster multi-block table scans<br />Index range scans load into Oracle buffer cache <br />Pollutes cache increasing IO for other users<br />Limited help to SQOOP since rows are only read once <br />Luckily, SQOOP extensibility allows us to add optimizations for specific targets <br />
  13. 13. Oracle – parallelism <br />Aggregate<br />Sort<br />Scan<br />OracleSALEStable<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle Master(QC)<br />Client (JDBC)<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />SELECT cust_id, SUM (amount_sold)<br />FROM sh.sales<br />GROUP BY cust_id<br />ORDER BY 2 DESC<br />
  14. 14. Oracle – parallelism <br />HDFS<br />OracleSALEStable<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />
  15. 15. Oracle – parallelism <br />HDFS<br />OracleSALEStable<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />
  16. 16. Index range scans <br />Hadoop Mapper<br />ID > 0 and ID < MAX/2<br />Hadoop Mapper<br />ID > MAX/2<br />Oracle<br />Session<br />Oracle<br />Session<br />Index range scan<br />Buffer cache<br />Index range scan<br />Index block<br />Index block<br />Index block<br />Index block<br />Index block<br />Index block<br />Oracle table<br />
  17. 17. Ideal architecture <br />HDFS<br />OracleSALEStable<br />Oracle<br />Session<br />Hadoop Mapper<br />Oracle<br />Session<br />Hadoop Mapper<br />Oracle<br />Session<br />Hadoop Mapper<br />Oracle<br />Session<br />Hadoop Mapper<br />
  18. 18. Quest/Cloudera OraOop for SQOOP<br />Design goals <br />Partition data based on physical storage<br />By-pass Oracle buffering<br />By-pass Oracle parallelism<br />Do not require or use indexes<br />Never read the same data block more than once<br />Support esoteric datatypes (eventually) <br />Support RAC clusters <br />Availability:<br />Freely available from<br />Packaged with Cloudera Enterprise <br />Commercial support from Quest/Cloudera within Enterprise distribution<br />
  19. 19. OraOop Throughput <br />
  20. 20. Oracle overhead <br />
  21. 21. Extending SQOOP<br /> SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing:<br />Extend ManagerFactory (what to handle)<br />Extend ConnManager (DB connection and metadata)<br />For imports:<br />Extend DataDrivenDBInputFormat (gets the data)<br />Data allocation (getSplits())<br />Split serialization (“io.serializations” property)<br />Data access logic (createDBRecordReader(), getSelectQuery())<br />Implement progress (nextKeyValue(), getProgress())<br />Similar procedure for extending exports<br />
  22. 22. SQOOP/OraOop best practices <br />Use sequence files for LOBs OR<br />Set inline-lob-limit <br />Directly control datanodes for widest destination bandwidth<br />Can’t rely on mapred.max.maps.per.node<br />Set number of mappers realistically <br />Disable speculative execution (our default)<br />Leads to duplicate DB reads <br />Set Oracle row fetch size extra high <br />Keeps the mappers streaming to HDFS<br />
  23. 23. Conclusion<br />RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption<br />SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop<br />SQOOP extensions can provide optimizations for specific targets<br />Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value <br />Try out OraOop for SQOOP!<br />