SlideShare a Scribd company logo
November 2011 Apache Sqoop (Incubating) Integrating Hadoop with Enterprise RDBS – Part I Arvind Prabhakar (arvind at apache dot org) Apache Sqoop Committer and Software Engineer at Cloudera
Hadoop Data Processing 1
Hadoop Data Processing 2
Hadoop Data Processing 3
Hadoop Data Processing 4
In This Session…	 How Sqoop Works Roadmap 5
Data Import 6
Data Import 7
Data Import 8
Data Import 9
Data Import 10
Sqoop Overview 11
Pre-processing 12
Code Generation 13
Type Mapping 14
Data Transfer 15
Data Transfer 16
Data Transfer 17
Post-Processing 18
Sqoop Connectors Oracle – Developed by Quest Software Couchbase – Developed by Couchbase Netezza – Developed by Cloudera Teradata – Developed by Cloudera SQL Server – Developed by Microsoft Microsoft PDW – Developed by Microsoft Volt DB – Developed by Volt DB 19
Sqoop Roadmap SQOOP-365: Proposal for Sqoop 2.0 https://issues.apache.org/jira/browse/SQOOP-365 Highlights Sqoop as a Service Connections as First Class Objects Role based Security 20
Sqoop 2 Architecture (proposed) 21
For More Information Website: http://incubator.apache.org/sqoop/ Mailing Lists: incubator-sqoop-user-subscribe@apache.org incubator-sqoop-dev-subscribe@apache.org ,[object Object],http://issues.apache.org/jira/browse/SQOOP 22
Thank You! Q & A will be after part II of this session.  23
Guy Harrison, Quest Software Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools
Introductions
27
Agenda Scenarios for RDBMS-Hadoop interaction Case study: Quest extension to SQOOP Other RDBMS-Hadoop integrations
Hadoop meets RDBMS – scenarios
Scenario #1: Reference data in RDBMS  PRODUCTS HDFS CUSTOMERS WEBlOGS RDBMS
Scenario #2: Hadoop for off-line analytics PRODUCTS HDFS CUSTOMERS SALES HISTORY RDBMS
Scenario #3: MapReduce output to RDBMS  DB QUERY TOOL WEBLOGS SUMMARY HDFS WEBlOGS RDBMS
Scenario #4: Hadoop as RDBMS “active archive” QUERY TOOL SALES 2011 SALES 2010 HDFS SALES 2009 SALES 2009 SALES 2008 SALES 2008 RDBMS
Case Study: extending SQOOP for Oracle
SQOOP extensibility SQOOP implements a generic approach to RDBMS/Hadoop data transfer But database optimization is highly platform specific Each RDBMS has distinct optimizations strategies For Oracle, optimization requires: Bypassing Oracle caching layers Avoiding Oracle optimizer meddling  Exploiting Oracle metadata to balance mapper load
Reading from Oracle – default SQOOP ID > MAX/2 ID > 0 and ID < MAX/2 MAPPER MAPPER CACHE ORACLE SESSSION ORACLE SESSION RANGE SCAN RANGE SCAN Index block Index block Index block Index block Index block Index block ORACLE TABLE
Oracle – parallelism gone bad (1)  HDFS OracleSALEStable Hadoop Mapper Hadoop Mapper Hadoop Mapper Hadoop Mapper
Oracle – parallelism gone bad (2)  HDFS Oracletable Hadoop  Mapper Hadoop  Mapper Hadoop  Mapper Hadoop  Mapper
Ideal architecture  HDFS ORACLE TABLE ORACLE  SESSION Hadoop  Mapper ORACLE  SESSION Hadoop  Mapper ORACLE  SESSION Hadoop  Mapper ORACLE  SESSION Hadoop  Mapper
Design goals Partition data based on physical storage By-pass Oracle buffering By-pass Oracle parallelism Do not require or use indexes Never read the same data block more than once Support Oracle datatypes
Import Throughput
Export Throughput
Export load
Working with the SQOOP framework  SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing: Extend ManagerFactory (what to handle) Extend ConnManager (DB connection and metadata) For imports: Extend DataDrivenDBInputFormat (gets the data) Data allocation (getSplits()) Split serialization (“io.serializations” property) Data access logic (createDBRecordReader(), getSelectQuery()) Implement progress (nextKeyValue(), getProgress()) Similar procedure for extending exports
Extensions to native SQOOP MERGE functionality Update if exists, insert otherwise Hive connector Source defined as HQL query rather than HDFS directory Eclipse UI
Availability Apache licensed source available from : https://github.com/QuestSoftwareTCD/OracleSQOOPconnector Download from (Quest): http://www.quest.com/hadoop/ Download from (Cloudera): http://ccp.cloudera.com/display/SUPPORT/Downloads
Other SQOOP connectors Microsoft SQL Server: http://www.microsoft.com/download/en/details.aspx?id=27584 Teradata: https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide%2C+version+1.0-beta-u4 Microstrategy: https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreement Nettezza: https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement VoltDB: http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration
Other Hadoop – RDBMS integrations
Oracle Big Data Appliance  18 Sun X4270 M2 servers 48GB per node (864GB total) 2x6 Core CPU per node (216 total) 12x2TB HDD per node (216 spindles, 864 TB) 40Gb/s Infiniband between nodes 10Gb/s Ethernet to datacenter Apache Hadoop Oracle NoSQL  Oracle loader for Hadoop Multi-stage C-optimized unidirectional loader www.oracle.com/us/bigdata/index.html
ORACLE EXALYTICS ORACLE EXALOGIC ORACLE Big Data Appliance Oracle WEBLOGIC Oracle ESSBASE Oracle NoSQL ORACLE EXADATA ORACLE LOADER FOR HADOOP ApACHE HADOOP Oracle RDBMS Oracle TIMES TEN
Microsoft
Hadapt Formally HadoopDB – Hadoop/Postgres hybrid Postgres servers on data nodes allow for accelerated (indexed) HIVE queries Extensions to the Hive optimizer  http://www.hadapt.com/
Greenplum SQL based access to HDFS data via in-DB MapReduce http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf
Toad for Cloud Databases Federated SQL queries across Hive, Hbase, NoSQL, RDBMS
Conclusions RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption SQOOP provides a good general purpose framework for transferring data between any JDBC database and Hadoop We’d like to see it become a standard Each RDBMS offers distinct tuning opportunities, so optimized SQOOP extensions offer real value  Hadoop-RDBMS integration projects are proliferating rapidly
Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

More Related Content

Viewers also liked

Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
Christophe Marchal
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
Guy Harrison
 
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Cloudera, Inc.
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
Jeyamariappan Guru
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Someshwar Kale
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
zenyk
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
Avkash Chauhan
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
Tanel Poder
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
Njain85
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
Bernard Marr
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 

Viewers also liked (16)

Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
Fwdays
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
Sunil Jagani
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Ukraine
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 

Recently uploaded (20)

Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 

Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software & Arvind Prabhakar, Cloudera

  • 1. November 2011 Apache Sqoop (Incubating) Integrating Hadoop with Enterprise RDBS – Part I Arvind Prabhakar (arvind at apache dot org) Apache Sqoop Committer and Software Engineer at Cloudera
  • 6. In This Session… How Sqoop Works Roadmap 5
  • 20. Sqoop Connectors Oracle – Developed by Quest Software Couchbase – Developed by Couchbase Netezza – Developed by Cloudera Teradata – Developed by Cloudera SQL Server – Developed by Microsoft Microsoft PDW – Developed by Microsoft Volt DB – Developed by Volt DB 19
  • 21. Sqoop Roadmap SQOOP-365: Proposal for Sqoop 2.0 https://issues.apache.org/jira/browse/SQOOP-365 Highlights Sqoop as a Service Connections as First Class Objects Role based Security 20
  • 22. Sqoop 2 Architecture (proposed) 21
  • 23.
  • 24. Thank You! Q & A will be after part II of this session. 23
  • 25. Guy Harrison, Quest Software Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools
  • 27.
  • 28. 27
  • 29. Agenda Scenarios for RDBMS-Hadoop interaction Case study: Quest extension to SQOOP Other RDBMS-Hadoop integrations
  • 30. Hadoop meets RDBMS – scenarios
  • 31. Scenario #1: Reference data in RDBMS PRODUCTS HDFS CUSTOMERS WEBlOGS RDBMS
  • 32. Scenario #2: Hadoop for off-line analytics PRODUCTS HDFS CUSTOMERS SALES HISTORY RDBMS
  • 33. Scenario #3: MapReduce output to RDBMS DB QUERY TOOL WEBLOGS SUMMARY HDFS WEBlOGS RDBMS
  • 34. Scenario #4: Hadoop as RDBMS “active archive” QUERY TOOL SALES 2011 SALES 2010 HDFS SALES 2009 SALES 2009 SALES 2008 SALES 2008 RDBMS
  • 35. Case Study: extending SQOOP for Oracle
  • 36. SQOOP extensibility SQOOP implements a generic approach to RDBMS/Hadoop data transfer But database optimization is highly platform specific Each RDBMS has distinct optimizations strategies For Oracle, optimization requires: Bypassing Oracle caching layers Avoiding Oracle optimizer meddling Exploiting Oracle metadata to balance mapper load
  • 37. Reading from Oracle – default SQOOP ID > MAX/2 ID > 0 and ID < MAX/2 MAPPER MAPPER CACHE ORACLE SESSSION ORACLE SESSION RANGE SCAN RANGE SCAN Index block Index block Index block Index block Index block Index block ORACLE TABLE
  • 38. Oracle – parallelism gone bad (1) HDFS OracleSALEStable Hadoop Mapper Hadoop Mapper Hadoop Mapper Hadoop Mapper
  • 39. Oracle – parallelism gone bad (2) HDFS Oracletable Hadoop Mapper Hadoop Mapper Hadoop Mapper Hadoop Mapper
  • 40. Ideal architecture HDFS ORACLE TABLE ORACLE SESSION Hadoop Mapper ORACLE SESSION Hadoop Mapper ORACLE SESSION Hadoop Mapper ORACLE SESSION Hadoop Mapper
  • 41. Design goals Partition data based on physical storage By-pass Oracle buffering By-pass Oracle parallelism Do not require or use indexes Never read the same data block more than once Support Oracle datatypes
  • 43.
  • 46. Working with the SQOOP framework  SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing: Extend ManagerFactory (what to handle) Extend ConnManager (DB connection and metadata) For imports: Extend DataDrivenDBInputFormat (gets the data) Data allocation (getSplits()) Split serialization (“io.serializations” property) Data access logic (createDBRecordReader(), getSelectQuery()) Implement progress (nextKeyValue(), getProgress()) Similar procedure for extending exports
  • 47. Extensions to native SQOOP MERGE functionality Update if exists, insert otherwise Hive connector Source defined as HQL query rather than HDFS directory Eclipse UI
  • 48. Availability Apache licensed source available from : https://github.com/QuestSoftwareTCD/OracleSQOOPconnector Download from (Quest): http://www.quest.com/hadoop/ Download from (Cloudera): http://ccp.cloudera.com/display/SUPPORT/Downloads
  • 49. Other SQOOP connectors Microsoft SQL Server: http://www.microsoft.com/download/en/details.aspx?id=27584 Teradata: https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide%2C+version+1.0-beta-u4 Microstrategy: https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreement Nettezza: https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement VoltDB: http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration
  • 50. Other Hadoop – RDBMS integrations
  • 51. Oracle Big Data Appliance 18 Sun X4270 M2 servers 48GB per node (864GB total) 2x6 Core CPU per node (216 total) 12x2TB HDD per node (216 spindles, 864 TB) 40Gb/s Infiniband between nodes 10Gb/s Ethernet to datacenter Apache Hadoop Oracle NoSQL Oracle loader for Hadoop Multi-stage C-optimized unidirectional loader www.oracle.com/us/bigdata/index.html
  • 52. ORACLE EXALYTICS ORACLE EXALOGIC ORACLE Big Data Appliance Oracle WEBLOGIC Oracle ESSBASE Oracle NoSQL ORACLE EXADATA ORACLE LOADER FOR HADOOP ApACHE HADOOP Oracle RDBMS Oracle TIMES TEN
  • 54. Hadapt Formally HadoopDB – Hadoop/Postgres hybrid Postgres servers on data nodes allow for accelerated (indexed) HIVE queries Extensions to the Hive optimizer http://www.hadapt.com/
  • 55. Greenplum SQL based access to HDFS data via in-DB MapReduce http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf
  • 56. Toad for Cloud Databases Federated SQL queries across Hive, Hbase, NoSQL, RDBMS
  • 57. Conclusions RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption SQOOP provides a good general purpose framework for transferring data between any JDBC database and Hadoop We’d like to see it become a standard Each RDBMS offers distinct tuning opportunities, so optimized SQOOP extensions offer real value Hadoop-RDBMS integration projects are proliferating rapidly

Editor's Notes

  1. Insanely popular – literally millions of users