Hadoop and rdbms with sqoop


Published on

Presentation given at Hadoop World NYC 2011. Moving data between Hadoop and RDBMS with SQOOP

Published in: Technology

Hadoop and rdbms with sqoop

  1. 1. Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP<br />Guy Harrison<br />Director, R&D Melbourne<br />www.guyharrison.net<br />Guy.harrison@quest.com<br />@guyharrison<br />
  2. 2. Introductions <br />
  3. 3.
  4. 4. Agenda<br />RDBMS-Hadoop interoperability scenarios<br />Interoperability options<br />Cloudera SQOOP<br />Extending SQOOP <br />Quest OraOop extension for Cloudera SQOOP<br />Performance comparisons <br />Lessons learned and best practices<br />
  5. 5. Scenario #1: Reference data in RDBMS <br />RDBMS<br />Customers<br />HDFS<br />Products<br />WebLogs<br />
  6. 6. Scenario #2: Hadoop for off-line analytics<br />RDBMS<br />Customers<br />HDFS<br />Products<br />Sales History<br />
  7. 7. Scenario #3: Hadoop for RDBMS archive<br />RDBMS<br />HDFS<br />Sales 2010<br />Sales 2009 <br />Sales 2008 <br />Sales 2008 <br />
  8. 8. Scenario #4: MapReduce results to RDBMS <br />RDBMS<br />HDFS<br />WebLog<br />Summary<br />WebLogs<br />
  9. 9. Options for RDBMS inter-op<br />DBInputFormat:<br />Allows database records to be used as mapper inputs.<br />BUT:<br />Not inherently scalable or efficient<br />For repeated analysis, better to stage in Hadoop<br />Tedious coding of DBWritable classes for each table<br />SQOOP<br />Open source utility provided by Cloudera<br />Configurable command line interface to copy RDBMS->HDFS<br />Support for Hive, Hbase<br />Generates java classes for future M-R tasks<br />Extensible to provide optimized adaptors for specific targets<br />Bi-Directional <br />
  10. 10. SQOOP Details <br />SQOOP import <br />Divide table into ranges using primary key max/min<br />Create mappers for each range <br />Mappers write to multiple HDFS nodes<br />Creates text or sequence files <br />Generates Java class for resulting HDFS file<br />Generates Hive definition and auto-loads into HIVE<br />SQOOP export<br />Read files in HDFS directory via MapReduce<br />Bulk parallel insert into database table <br />
  11. 11. SQOOP details<br />SQOOP features:<br />Compatible with almost any JDBC enabled database<br />Auto load into HIVE <br />Hbase support <br />Special handling for database LOBs<br />Job management <br />Cluster configuration (jar file distribution)<br />WHERE clause support<br />Open source, and included in Cloudera distributions<br />SQOOP fast paths & plug ins<br />Invokemysqldump, mysqlimport for MySQL jobs <br />Similar fast paths for PostgreSQL<br />Extensibility architecture for 3rd parties (like Quest )<br />Teradata, Netezza, etc.<br />
  12. 12. Working with Oracle <br />SQOOP approach is generic and applicable to all RDBMS<br />However for Oracle, sub-optimal in some respects:<br />Oracle may parallelize and serialize individual mappers<br />Oracle optimizer may decline to use index range scans<br />Oracle physical storage often deliberately not in primary key order (reverse key indexes, hash partitioning, etc)<br />Primary keys often not be evenly distributed<br />Index range scans use single block random reads<br />vs.faster multi-block table scans<br />Index range scans load into Oracle buffer cache <br />Pollutes cache increasing IO for other users<br />Limited help to SQOOP since rows are only read once <br />Luckily, SQOOP extensibility allows us to add optimizations for specific targets <br />
  13. 13. Oracle – parallelism <br />Aggregate<br />Sort<br />Scan<br />OracleSALEStable<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle Master(QC)<br />Client (JDBC)<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />Oracle PQ <br />Slave<br />SELECT cust_id, SUM (amount_sold)<br />FROM sh.sales<br />GROUP BY cust_id<br />ORDER BY 2 DESC<br />
  14. 14. Oracle – parallelism <br />HDFS<br />OracleSALEStable<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />
  15. 15. Oracle – parallelism <br />HDFS<br />OracleSALEStable<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />Hadoop Mapper<br />
  16. 16. Index range scans <br />Hadoop Mapper<br />ID > 0 and ID < MAX/2<br />Hadoop Mapper<br />ID > MAX/2<br />Oracle<br />Session<br />Oracle<br />Session<br />Index range scan<br />Buffer cache<br />Index range scan<br />Index block<br />Index block<br />Index block<br />Index block<br />Index block<br />Index block<br />Oracle table<br />
  17. 17. Ideal architecture <br />HDFS<br />OracleSALEStable<br />Oracle<br />Session<br />Hadoop Mapper<br />Oracle<br />Session<br />Hadoop Mapper<br />Oracle<br />Session<br />Hadoop Mapper<br />Oracle<br />Session<br />Hadoop Mapper<br />
  18. 18. Quest/Cloudera OraOop for SQOOP<br />Design goals <br />Partition data based on physical storage<br />By-pass Oracle buffering<br />By-pass Oracle parallelism<br />Do not require or use indexes<br />Never read the same data block more than once<br />Support esoteric datatypes (eventually) <br />Support RAC clusters <br />Availability:<br />Freely available from www.quest.com/ora-oop<br />Packaged with Cloudera Enterprise <br />Commercial support from Quest/Cloudera within Enterprise distribution<br />
  19. 19. OraOop Throughput <br />
  20. 20. Oracle overhead <br />
  21. 21. Extending SQOOP<br /> SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing:<br />Extend ManagerFactory (what to handle)<br />Extend ConnManager (DB connection and metadata)<br />For imports:<br />Extend DataDrivenDBInputFormat (gets the data)<br />Data allocation (getSplits())<br />Split serialization (“io.serializations” property)<br />Data access logic (createDBRecordReader(), getSelectQuery())<br />Implement progress (nextKeyValue(), getProgress())<br />Similar procedure for extending exports<br />
  22. 22. SQOOP/OraOop best practices <br />Use sequence files for LOBs OR<br />Set inline-lob-limit <br />Directly control datanodes for widest destination bandwidth<br />Can’t rely on mapred.max.maps.per.node<br />Set number of mappers realistically <br />Disable speculative execution (our default)<br />Leads to duplicate DB reads <br />Set Oracle row fetch size extra high <br />Keeps the mappers streaming to HDFS<br />
  23. 23. Conclusion<br />RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption<br />SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop<br />SQOOP extensions can provide optimizations for specific targets<br />Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value <br />Try out OraOop for SQOOP!<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.