Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera
 

Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software & Arvind Prabhakar, Cloudera

on

  • 4,718 views

 

Statistics

Views

Total Views
4,718
Views on SlideShare
3,914
Embed Views
804

Actions

Likes
12
Downloads
254
Comments
0

9 Embeds 804

http://www.cloudera.com 369
http://d.hatena.ne.jp 215
http://www.scoop.it 173
http://www.eformat.co.nz 27
http://www.petari.com 10
http://webcache.googleusercontent.com 5
http://blog.cloudera.com 2
http://garagekidztweetz.hatenablog.com 2
http://cloudera.matt.dev 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Insanely popular – literally millions of users

Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software & Arvind Prabhakar, Cloudera Presentation Transcript

  • November 2011
    Apache Sqoop (Incubating)
    Integrating Hadoop with Enterprise RDBS – Part I
    Arvind Prabhakar (arvind at apache dot org)
    Apache Sqoop Committer and Software Engineer at Cloudera
  • Hadoop Data Processing
    1
  • Hadoop Data Processing
    2
  • Hadoop Data Processing
    3
  • Hadoop Data Processing
    4
  • In This Session…
    How Sqoop Works
    Roadmap
    5
  • Data Import
    6
  • Data Import
    7
  • Data Import
    8
  • Data Import
    9
  • Data Import
    10
  • Sqoop Overview
    11
  • Pre-processing
    12
  • Code Generation
    13
  • Type Mapping
    14
  • Data Transfer
    15
  • Data Transfer
    16
  • Data Transfer
    17
  • Post-Processing
    18
  • Sqoop Connectors
    Oracle – Developed by Quest Software
    Couchbase – Developed by Couchbase
    Netezza – Developed by Cloudera
    Teradata – Developed by Cloudera
    SQL Server – Developed by Microsoft
    Microsoft PDW – Developed by Microsoft
    Volt DB – Developed by Volt DB
    19
  • Sqoop Roadmap
    SQOOP-365: Proposal for Sqoop 2.0
    https://issues.apache.org/jira/browse/SQOOP-365
    Highlights
    Sqoop as a Service
    Connections as First Class Objects
    Role based Security
    20
  • Sqoop 2 Architecture (proposed)
    21
  • For More Information
    Website:
    http://incubator.apache.org/sqoop/
    Mailing Lists:
    incubator-sqoop-user-subscribe@apache.org
    incubator-sqoop-dev-subscribe@apache.org
    • Issue Tracker:
    http://issues.apache.org/jira/browse/SQOOP
    22
  • Thank You!
    Q & A will be after part II of this session.
    23
  • Guy Harrison, Quest Software
    Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools
  • Introductions
  • 27
  • Agenda
    Scenarios for RDBMS-Hadoop interaction
    Case study: Quest extension to SQOOP
    Other RDBMS-Hadoop integrations
  • Hadoop meets RDBMS – scenarios
  • Scenario #1: Reference data in RDBMS
    PRODUCTS
    HDFS
    CUSTOMERS
    WEBlOGS
    RDBMS
  • Scenario #2: Hadoop for off-line analytics
    PRODUCTS
    HDFS
    CUSTOMERS
    SALES
    HISTORY
    RDBMS
  • Scenario #3: MapReduce output to RDBMS
    DB QUERY
    TOOL
    WEBLOGS
    SUMMARY
    HDFS
    WEBlOGS
    RDBMS
  • Scenario #4: Hadoop as RDBMS “active archive”
    QUERY
    TOOL
    SALES 2011
    SALES 2010
    HDFS
    SALES 2009
    SALES 2009
    SALES 2008
    SALES 2008
    RDBMS
  • Case Study: extending SQOOP for Oracle
  • SQOOP extensibility
    SQOOP implements a generic approach to RDBMS/Hadoop data transfer
    But database optimization is highly platform specific
    Each RDBMS has distinct optimizations strategies
    For Oracle, optimization requires:
    Bypassing Oracle caching layers
    Avoiding Oracle optimizer meddling
    Exploiting Oracle metadata to balance mapper load
  • Reading from Oracle – default SQOOP
    ID > MAX/2
    ID > 0 and ID < MAX/2
    MAPPER
    MAPPER
    CACHE
    ORACLE SESSSION
    ORACLE SESSION
    RANGE SCAN
    RANGE SCAN
    Index block
    Index block
    Index block
    Index block
    Index block
    Index block
    ORACLE TABLE
  • Oracle – parallelism gone bad (1)
    HDFS
    OracleSALEStable
    Hadoop Mapper
    Hadoop Mapper
    Hadoop Mapper
    Hadoop Mapper
  • Oracle – parallelism gone bad (2)
    HDFS
    Oracletable
    Hadoop
    Mapper
    Hadoop
    Mapper
    Hadoop
    Mapper
    Hadoop
    Mapper
  • Ideal architecture
    HDFS
    ORACLE
    TABLE
    ORACLE
    SESSION
    Hadoop
    Mapper
    ORACLE
    SESSION
    Hadoop
    Mapper
    ORACLE
    SESSION
    Hadoop
    Mapper
    ORACLE
    SESSION
    Hadoop
    Mapper
  • Design goals
    Partition data based on physical storage
    By-pass Oracle buffering
    By-pass Oracle parallelism
    Do not require or use indexes
    Never read the same data block more than once
    Support Oracle datatypes
  • Import Throughput
  • Export Throughput
  • Export load
  • Working with the SQOOP framework
     SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing:
    Extend ManagerFactory (what to handle)
    Extend ConnManager (DB connection and metadata)
    For imports:
    Extend DataDrivenDBInputFormat (gets the data)
    Data allocation (getSplits())
    Split serialization (“io.serializations” property)
    Data access logic (createDBRecordReader(), getSelectQuery())
    Implement progress (nextKeyValue(), getProgress())
    Similar procedure for extending exports
  • Extensions to native SQOOP
    MERGE functionality
    Update if exists, insert otherwise
    Hive connector
    Source defined as HQL query rather than HDFS directory
    Eclipse UI
  • Availability
    Apache licensed source available from :
    https://github.com/QuestSoftwareTCD/OracleSQOOPconnector
    Download from (Quest):
    http://www.quest.com/hadoop/
    Download from (Cloudera):
    http://ccp.cloudera.com/display/SUPPORT/Downloads
  • Other SQOOP connectors
    Microsoft SQL Server:
    http://www.microsoft.com/download/en/details.aspx?id=27584
    Teradata:
    https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide%2C+version+1.0-beta-u4
    Microstrategy:
    https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreement
    Nettezza:
    https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement
    VoltDB:
    http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration
  • Other Hadoop – RDBMS integrations
  • Oracle Big Data Appliance
    18 Sun X4270 M2 servers
    48GB per node (864GB total)
    2x6 Core CPU per node (216 total)
    12x2TB HDD per node (216 spindles, 864 TB)
    40Gb/s Infiniband between nodes
    10Gb/s Ethernet to datacenter
    Apache Hadoop
    Oracle NoSQL
    Oracle loader for Hadoop
    Multi-stage C-optimized unidirectional loader
    www.oracle.com/us/bigdata/index.html
  • ORACLE EXALYTICS
    ORACLE
    EXALOGIC
    ORACLE
    Big Data Appliance
    Oracle WEBLOGIC
    Oracle ESSBASE
    Oracle NoSQL
    ORACLE
    EXADATA
    ORACLE LOADER FOR HADOOP
    ApACHE
    HADOOP
    Oracle RDBMS
    Oracle TIMES TEN
  • Microsoft
  • Hadapt
    Formally HadoopDB – Hadoop/Postgres hybrid
    Postgres servers on data nodes allow for accelerated (indexed) HIVE queries
    Extensions to the Hive optimizer
    http://www.hadapt.com/
  • Greenplum
    SQL based access to HDFS data via in-DB MapReduce
    http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf
  • Toad for Cloud Databases
    Federated SQL queries across Hive, Hbase, NoSQL, RDBMS
  • Conclusions
    RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption
    SQOOP provides a good general purpose framework for transferring data between any JDBC database and Hadoop
    We’d like to see it become a standard
    Each RDBMS offers distinct tuning opportunities, so optimized SQOOP extensions offer real value
    Hadoop-RDBMS integration projects are proliferating rapidly