Hadoop and rdbms with sqoop

Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP Guy Harrison Director, R&D Melbourne www.guyharrison.net Guy.harrison@quest.com @guyharrison

Agenda RDBMS-Hadoop interoperability scenarios Interoperability options Cloudera SQOOP Extending SQOOP Quest OraOop extension for Cloudera SQOOP Performance comparisons Lessons learned and best practices

Scenario #1: Reference data in RDBMS RDBMS Customers HDFS Products WebLogs

Scenario #2: Hadoop for off-line analytics RDBMS Customers HDFS Products Sales History

Scenario #3: Hadoop for RDBMS archive RDBMS HDFS Sales 2010 Sales 2009 Sales 2008 Sales 2008

Scenario #4: MapReduce results to RDBMS RDBMS HDFS WebLog Summary WebLogs

Options for RDBMS inter-op DBInputFormat: Allows database records to be used as mapper inputs. BUT: Not inherently scalable or efficient For repeated analysis, better to stage in Hadoop Tedious coding of DBWritable classes for each table SQOOP Open source utility provided by Cloudera Configurable command line interface to copy RDBMS->HDFS Support for Hive, Hbase Generates java classes for future M-R tasks Extensible to provide optimized adaptors for specific targets Bi-Directional

SQOOP Details SQOOP import Divide table into ranges using primary key max/min Create mappers for each range Mappers write to multiple HDFS nodes Creates text or sequence files Generates Java class for resulting HDFS file Generates Hive definition and auto-loads into HIVE SQOOP export Read files in HDFS directory via MapReduce Bulk parallel insert into database table

SQOOP details SQOOP features: Compatible with almost any JDBC enabled database Auto load into HIVE Hbase support Special handling for database LOBs Job management Cluster configuration (jar file distribution) WHERE clause support Open source, and included in Cloudera distributions SQOOP fast paths & plug ins Invokemysqldump, mysqlimport for MySQL jobs Similar fast paths for PostgreSQL Extensibility architecture for 3rd parties (like Quest ) Teradata, Netezza, etc.

Working with Oracle SQOOP approach is generic and applicable to all RDBMS However for Oracle, sub-optimal in some respects: Oracle may parallelize and serialize individual mappers Oracle optimizer may decline to use index range scans Oracle physical storage often deliberately not in primary key order (reverse key indexes, hash partitioning, etc) Primary keys often not be evenly distributed Index range scans use single block random reads vs.faster multi-block table scans Index range scans load into Oracle buffer cache Pollutes cache increasing IO for other users Limited help to SQOOP since rows are only read once Luckily, SQOOP extensibility allows us to add optimizations for specific targets

Oracle – parallelism Aggregate Sort Scan OracleSALEStable Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle Master(QC) Client (JDBC) Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave SELECT cust_id, SUM (amount_sold) FROM sh.sales GROUP BY cust_id ORDER BY 2 DESC

Oracle – parallelism HDFS OracleSALEStable Hadoop Mapper Hadoop Mapper Hadoop Mapper Hadoop Mapper

Index range scans Hadoop Mapper ID > 0 and ID < MAX/2 Hadoop Mapper ID > MAX/2 Oracle Session Oracle Session Index range scan Buffer cache Index range scan Index block Index block Index block Index block Index block Index block Oracle table

Ideal architecture HDFS OracleSALEStable Oracle Session Hadoop Mapper Oracle Session Hadoop Mapper Oracle Session Hadoop Mapper Oracle Session Hadoop Mapper

Quest/Cloudera OraOop for SQOOP Design goals Partition data based on physical storage By-pass Oracle buffering By-pass Oracle parallelism Do not require or use indexes Never read the same data block more than once Support esoteric datatypes (eventually) Support RAC clusters Availability: Freely available from www.quest.com/ora-oop Packaged with Cloudera Enterprise Commercial support from Quest/Cloudera within Enterprise distribution

Extending SQOOP SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing: Extend ManagerFactory (what to handle) Extend ConnManager (DB connection and metadata) For imports: Extend DataDrivenDBInputFormat (gets the data) Data allocation (getSplits()) Split serialization (“io.serializations” property) Data access logic (createDBRecordReader(), getSelectQuery()) Implement progress (nextKeyValue(), getProgress()) Similar procedure for extending exports

SQOOP/OraOop best practices Use sequence files for LOBs OR Set inline-lob-limit Directly control datanodes for widest destination bandwidth Can’t rely on mapred.max.maps.per.node Set number of mappers realistically Disable speculative execution (our default) Leads to duplicate DB reads Set Oracle row fetch size extra high Keeps the mappers streaming to HDFS

Conclusion RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop SQOOP extensions can provide optimizations for specific targets Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value Try out OraOop for SQOOP!

Hadoop and rdbms with sqoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop and rdbms with sqoop

Similar to Hadoop and rdbms with sqoop (20)

More from Guy Harrison

More from Guy Harrison (20)

Recently uploaded

Recently uploaded (20)

Hadoop and rdbms with sqoop