Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software & Arvind Prabhakar, Cloudera

4,678
views

Published on

Published in: Technology

0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,678
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
279
Comments
0
Likes
13
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Insanely popular – literally millions of users
  • Transcript

    • 1. November 2011
      Apache Sqoop (Incubating)
      Integrating Hadoop with Enterprise RDBS – Part I
      Arvind Prabhakar (arvind at apache dot org)
      Apache Sqoop Committer and Software Engineer at Cloudera
    • 2. Hadoop Data Processing
      1
    • 3. Hadoop Data Processing
      2
    • 4. Hadoop Data Processing
      3
    • 5. Hadoop Data Processing
      4
    • 6. In This Session…
      How Sqoop Works
      Roadmap
      5
    • 7. Data Import
      6
    • 8. Data Import
      7
    • 9. Data Import
      8
    • 10. Data Import
      9
    • 11. Data Import
      10
    • 12. Sqoop Overview
      11
    • 13. Pre-processing
      12
    • 14. Code Generation
      13
    • 15. Type Mapping
      14
    • 16. Data Transfer
      15
    • 17. Data Transfer
      16
    • 18. Data Transfer
      17
    • 19. Post-Processing
      18
    • 20. Sqoop Connectors
      Oracle – Developed by Quest Software
      Couchbase – Developed by Couchbase
      Netezza – Developed by Cloudera
      Teradata – Developed by Cloudera
      SQL Server – Developed by Microsoft
      Microsoft PDW – Developed by Microsoft
      Volt DB – Developed by Volt DB
      19
    • 21. Sqoop Roadmap
      SQOOP-365: Proposal for Sqoop 2.0
      https://issues.apache.org/jira/browse/SQOOP-365
      Highlights
      Sqoop as a Service
      Connections as First Class Objects
      Role based Security
      20
    • 22. Sqoop 2 Architecture (proposed)
      21
    • 23. For More Information
      Website:
      http://incubator.apache.org/sqoop/
      Mailing Lists:
      incubator-sqoop-user-subscribe@apache.org
      incubator-sqoop-dev-subscribe@apache.org
      • Issue Tracker:
      http://issues.apache.org/jira/browse/SQOOP
      22
    • 24. Thank You!
      Q & A will be after part II of this session.
      23
    • 25. Guy Harrison, Quest Software
      Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools
    • 26. Introductions
    • 27.
    • 28. 27
    • 29. Agenda
      Scenarios for RDBMS-Hadoop interaction
      Case study: Quest extension to SQOOP
      Other RDBMS-Hadoop integrations
    • 30. Hadoop meets RDBMS – scenarios
    • 31. Scenario #1: Reference data in RDBMS
      PRODUCTS
      HDFS
      CUSTOMERS
      WEBlOGS
      RDBMS
    • 32. Scenario #2: Hadoop for off-line analytics
      PRODUCTS
      HDFS
      CUSTOMERS
      SALES
      HISTORY
      RDBMS
    • 33. Scenario #3: MapReduce output to RDBMS
      DB QUERY
      TOOL
      WEBLOGS
      SUMMARY
      HDFS
      WEBlOGS
      RDBMS
    • 34. Scenario #4: Hadoop as RDBMS “active archive”
      QUERY
      TOOL
      SALES 2011
      SALES 2010
      HDFS
      SALES 2009
      SALES 2009
      SALES 2008
      SALES 2008
      RDBMS
    • 35. Case Study: extending SQOOP for Oracle
    • 36. SQOOP extensibility
      SQOOP implements a generic approach to RDBMS/Hadoop data transfer
      But database optimization is highly platform specific
      Each RDBMS has distinct optimizations strategies
      For Oracle, optimization requires:
      Bypassing Oracle caching layers
      Avoiding Oracle optimizer meddling
      Exploiting Oracle metadata to balance mapper load
    • 37. Reading from Oracle – default SQOOP
      ID > MAX/2
      ID > 0 and ID < MAX/2
      MAPPER
      MAPPER
      CACHE
      ORACLE SESSSION
      ORACLE SESSION
      RANGE SCAN
      RANGE SCAN
      Index block
      Index block
      Index block
      Index block
      Index block
      Index block
      ORACLE TABLE
    • 38. Oracle – parallelism gone bad (1)
      HDFS
      OracleSALEStable
      Hadoop Mapper
      Hadoop Mapper
      Hadoop Mapper
      Hadoop Mapper
    • 39. Oracle – parallelism gone bad (2)
      HDFS
      Oracletable
      Hadoop
      Mapper
      Hadoop
      Mapper
      Hadoop
      Mapper
      Hadoop
      Mapper
    • 40. Ideal architecture
      HDFS
      ORACLE
      TABLE
      ORACLE
      SESSION
      Hadoop
      Mapper
      ORACLE
      SESSION
      Hadoop
      Mapper
      ORACLE
      SESSION
      Hadoop
      Mapper
      ORACLE
      SESSION
      Hadoop
      Mapper
    • 41. Design goals
      Partition data based on physical storage
      By-pass Oracle buffering
      By-pass Oracle parallelism
      Do not require or use indexes
      Never read the same data block more than once
      Support Oracle datatypes
    • 42. Import Throughput
    • 43.
    • 44. Export Throughput
    • 45. Export load
    • 46. Working with the SQOOP framework
       SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing:
      Extend ManagerFactory (what to handle)
      Extend ConnManager (DB connection and metadata)
      For imports:
      Extend DataDrivenDBInputFormat (gets the data)
      Data allocation (getSplits())
      Split serialization (“io.serializations” property)
      Data access logic (createDBRecordReader(), getSelectQuery())
      Implement progress (nextKeyValue(), getProgress())
      Similar procedure for extending exports
    • 47. Extensions to native SQOOP
      MERGE functionality
      Update if exists, insert otherwise
      Hive connector
      Source defined as HQL query rather than HDFS directory
      Eclipse UI
    • 48. Availability
      Apache licensed source available from :
      https://github.com/QuestSoftwareTCD/OracleSQOOPconnector
      Download from (Quest):
      http://www.quest.com/hadoop/
      Download from (Cloudera):
      http://ccp.cloudera.com/display/SUPPORT/Downloads
    • 49. Other SQOOP connectors
      Microsoft SQL Server:
      http://www.microsoft.com/download/en/details.aspx?id=27584
      Teradata:
      https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide%2C+version+1.0-beta-u4
      Microstrategy:
      https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreement
      Nettezza:
      https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement
      VoltDB:
      http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration
    • 50. Other Hadoop – RDBMS integrations
    • 51. Oracle Big Data Appliance
      18 Sun X4270 M2 servers
      48GB per node (864GB total)
      2x6 Core CPU per node (216 total)
      12x2TB HDD per node (216 spindles, 864 TB)
      40Gb/s Infiniband between nodes
      10Gb/s Ethernet to datacenter
      Apache Hadoop
      Oracle NoSQL
      Oracle loader for Hadoop
      Multi-stage C-optimized unidirectional loader
      www.oracle.com/us/bigdata/index.html
    • 52. ORACLE EXALYTICS
      ORACLE
      EXALOGIC
      ORACLE
      Big Data Appliance
      Oracle WEBLOGIC
      Oracle ESSBASE
      Oracle NoSQL
      ORACLE
      EXADATA
      ORACLE LOADER FOR HADOOP
      ApACHE
      HADOOP
      Oracle RDBMS
      Oracle TIMES TEN
    • 53. Microsoft
    • 54. Hadapt
      Formally HadoopDB – Hadoop/Postgres hybrid
      Postgres servers on data nodes allow for accelerated (indexed) HIVE queries
      Extensions to the Hive optimizer
      http://www.hadapt.com/
    • 55. Greenplum
      SQL based access to HDFS data via in-DB MapReduce
      http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf
    • 56. Toad for Cloud Databases
      Federated SQL queries across Hive, Hbase, NoSQL, RDBMS
    • 57. Conclusions
      RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption
      SQOOP provides a good general purpose framework for transferring data between any JDBC database and Hadoop
      We’d like to see it become a standard
      Each RDBMS offers distinct tuning opportunities, so optimized SQOOP extensions offer real value
      Hadoop-RDBMS integration projects are proliferating rapidly