YEAR- 2018-19
SUBMMITTED TO
Er. SHRIYA MAM
Asst. Prof.
SUBMMITTED BY
DUSHHYANT KUMAR
ROLL NUMBER - 06
1
BIGDATAANALYSIS
DUSHHYANTKUMAR
1. What is Sqoop?
2. Why we use Sqoop?
3. Sqoop Architecture
4. What Is REST API
5. Difference Between Sqoop1 & Sqoop 2
6. Features of Sqoop
7. Five stage of Sqoop Import Overview
8. Importing & Exporting Data using Sqoop
9. Sqoop Limitations
INDEX
2
BIGDATAANALYSIS
DUSHHYANTKUMAR
What is Sqoop?
▪ Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data
between HDFS (Hadoop storage) and relational database servers like MySQL, Oracle
RDB, SQLite, Teradata, Netezza, Postgres etc.
▪ It efficiently transfers bulk data between Hadoop and external data stores such as
enterprise data warehouses, relational databases, etc.
▪ This is how Sqoop got its name – “SQL to Hadoop & Hadoop to SQL”.
▪ Sqoop transfer data between hadoop and relational DB servers.
▪ Sqoop is used to import data from relational DB such as MySQL, Oracle.
▪ Sqoop is used to export data from HDFS to relational DB.
3
BIGDATAANALYSIS
DUSHHYANTKUMAR
✓Big data developer’s works start once the data are in Hadoop system like in HDFS,
Hive or Hbase. They do their magical stuff to find all the golden information hidden on
such a huge amount of data.
✓Before Sqoop came, developers used to write to import and export data between
Hadoop and RDBMS and a tool was needed to the same.
✓Again, Sqoop uses the MapReduce mechanism for its operations like import and
export work and work on a parallel mechanism as well as fault tolerance.
✓In Sqoop, developers just need to mention the source, destination and the rest of the
work will be done by the Sqoop tool.
✓Sqoop came and filled the gap between the transfer between relational databases
and Hadoop system.
Why we use Sqoop?
4
BIGDATAANALYSIS
DUSHHYANTKUMAR
Here is one source which is RDBMS like MySQL and other is a destination like Hbase or HDFS
etc. and Sqoop performs the operation to perform import and export.
Sqoop Architecture
5
BIGDATAANALYSIS
RDBMS
(MYSQL,ORACLE,)
HADOOP FILE
SYSTEM
(HDFS, Hbase
,Hive)
SQOOP TOOL
IMPORT
EXPORT
DUSHHYANTKUMAR
6
Difference Between Sqoop1 & Sqoop 2
S.N. Sqoop1 Sqoop2
1 Client--‐only Architecture Client/Server Architecture
2 CLI based CLI + Web based
3 Client access to Hive,HBase Server access to Hive,HBase
4 Oozie and Sqoop tightly coupled Oozie finds REST API
DATA
WAREHOUSE
RELATIONAL
DATABASE
DOCUMENT
BASED
SYSTEM
HADOOP
MAP TASK
HDFS/HBase
/Hive
SQOOP
User
COMMAND
REST
UI
Job
Manager
SQOOP
SERVER
Connector
Manager
Connectors
Metadata
Metadata
Repository
User
Browser
Sqoop Client
DATA
WAREH
OUSE
RELATI
ONAL
DATAB
ASE
DOCUME
NT BASED
SYSTEM
HADOOP
MAP TASK
HDFS/HBase/
Hive
CLI
BIGDATAANALYSIS
DUSHHYANTKUMAR
Sqoop1 Architecture
7
DATA
WAREHOUSE
RELATIONAL
DATABASE
DOCUMENT
BASED
SYSTEM
MAP TASK
HDFS/HBase
SQOOP
User
COMMAND
When Sqoop starts functioning, only mapper job will run and reducer is not required. Here is a
detailed view of Sqoop architecture with mapper-
1. Sqoop provides command line
interface to the end users and
can also be accessed using Java
API.
2. Here only Map phase will run
and reduce is not required
because the complete import
and export process doesn’t
require any aggregation and so
there is no need of reducers in
Sqoop.
BIGDATAANALYSIS
Sqoop 2 Architecture
8
CLI
REST
UI
Job Manager
SQOOP SERVER
Connector Manager
Connectors
Metadata
Metadata
Repository
SOURCES
DATA
WAREHOUSE
RELATIONAL
DATABASE
DOCUMENT
BASED
SYSTEM
HADOOP
MAP TASK
HDFS/HBase/Hive
User
Browser
Sqoop Client
REDUCETASK
BIGDATAANALYSIS
DUSHHYANTKUMAR
9
What Is REST API
REST :- (Representational State Transfer application programming interface).
Rest is Architectural Style(Rest Written in Scala,ajav,php etc.).
Its used Http Protocol.
HTTP METHODS:-
1. GET
2. POST
3. PUT
4. DELETE
This document will explain you how to use Sqoop Network API to allow external applications
interacting with Sqoop server.
The REST API is a lower level API than the Sqoop client API, which gives you the freedom to
execute commands in Sqoop server with any tools or programming language.
The REST API is leveraged via HTTP requests and use JSON format to encode data content.
BIGDATAANALYSISBIGDATAANALYSIS
DUSHHYANTKUMAR
10
What Is Client-Server REST APIBIGDATAANALYSIS
DUSHHYANTKUMAR
11
What Is REST APIBIGDATAANALYSIS
DUSHHYANTKUMAR
12
What Is REST APIBIGDATAANALYSIS
DUSHHYANTKUMAR
13
REST API ExampleBIGDATAANALYSIS
DUSHHYANTKUMAR
14
REST API ExampleBIGDATAANALYSIS
DUSHHYANTKUMAR
1.Full Load: Apache Sqoop can load the whole table by a single command. You can also load all the tables
from a database using a single command.
2.Incremental Load: Apache Sqoop also provides the facility of incremental load where you can load parts of
table whenever it is updated. Sqoop import supports two types of incremental imports: 1. Append 2. Last modified.
3.Parallel import/export: Sqoop uses YARN framework to import and export the data, which provides fault
tolerance on top of parallelism.
4.Import results of SQL query: You can also import the result returned from an SQL query in HDFS.
5.Compression: You can compress your data by using deflate(gzip) algorithm with –compress argument, or by
specifying –compression-codec argument. You can also load compressed table in Apache Hive.
6.Connectors for all major RDBMS Databases: Apache Sqoop provides connectors for multiple RDBMS
databases, covering almost the entire circumference.
7.Load data directly into HIVE/HBase: You can load data directly into Apache Hive for analysis and also
dump your data in HBase, which is a NoSQL database.
Features of Sqoop
15
BIGDATAANALYSIS
$ Sqoop import --connect - -table - -username - -password - -incremental - -check-column - -last-value
DUSHHYANTKUMAR
Five stage Sqoop Import Overview
16
DATA STRUCTURE
MYSQL
SQL
SERVER
ORACLE DB2
RDBMS
Map
Sqoop
Map
Sqoop
Map
Sqoop
Map
Sqoop
DATA SINKS
HDFS Hive HBase DATA
SINK
DATA
SINK
DATA
SINK
DATA
SINK
MAP REDUCE
SQOOP
CLIENT
Run Import
Pull Metadata
Launch
MapReduce
Job
Pull Data
From
Database
Write To
Data Sink
BIGDATAANALYSIS
DUSHHYANTKUMAR
ORDERS
RDBMS,ORACLE
HDFS
Map
Map
Map
Map
File
File
File
File
Importing Data using Sqoop
Sqoop Import
Sqoop Job
HADOOP CLUSTER
1.Gather Metadata
2.Submit Map Only
Request
User
17
BIGDATAANALYSIS
DUSHHYANTKUMAR
Sqoop Import Data
18
BIGDATAANALYSIS
▪ SQOOP Import
▪ Import individual tables from RDBMS to HDFS.
▪ Each row in a table is treated as records in HDFS.
▪ All record are stored as text data in text files or binary files.
▪ Generic Syntax:
▪ Importing a Table into HDFS Syntax:
▪ Takes JDBC url and connects to database
--table - Source table name to be imported.
--username - Username to connect to database.
--password - Password of the connecting user.
--target-dir - Imports data to the specified directory.
DUSHHYANTKUMAR
Exporting Data using Sqoop
19
BIGDATAANALYSIS
ORDERS
RDBMS,ORACLE
HDFS
Map
Map
Map
Map
File
File
File
File
Sqoop Export
Sqoop Job
HADOOP CLUSTER
1.Gather Metadata
2.Submit Map Only
Request
User
DUSHHYANTKUMAR
Sqoop Export Data
20
BIGDATAANALYSIS
▪ SQOOP Export
▪ Export a set of files from HDFS back to RDBMS.
▪ Files given an input to SQOOP contains records called as rows in table.
▪ Generic Syntax:
▪ Exporting a Table into RDBMS Syntax:
▪ Takes JDBC url and connects to database
--table - Source table name to be exported.
--username - Username to connect to database.
--password - Password of the connecting user.
--target-dir - Imports data to the specified directory.
DUSHHYANTKUMAR
Limitations of Sqoop
1. Sqoop cannot be paused and resumed. It is an atomic step. If it is failed we need to clear
things up and start again.
2. Sqoop Export performance also depends upon the hardware configuration (Memory, Hard disk)
of RDBMS server.
3. Sqoop is slow because it still uses Map Reduce in backend processing.
4. Failures need special handling in case of partial import or export.
5. For few databases Sqoop provides bulk connector which has faster performance. It uses a
JDBC connection to connect with RDBMS based on data stores, and this can be inefficient and
less performance.
21
BIGDATAANALYSIS
DUSHHYANTKUMAR
22
BIGDATAANALYSIS
DUSHHYANTKUMAR

SQOOP PPT

  • 1.
    YEAR- 2018-19 SUBMMITTED TO Er.SHRIYA MAM Asst. Prof. SUBMMITTED BY DUSHHYANT KUMAR ROLL NUMBER - 06 1 BIGDATAANALYSIS DUSHHYANTKUMAR
  • 2.
    1. What isSqoop? 2. Why we use Sqoop? 3. Sqoop Architecture 4. What Is REST API 5. Difference Between Sqoop1 & Sqoop 2 6. Features of Sqoop 7. Five stage of Sqoop Import Overview 8. Importing & Exporting Data using Sqoop 9. Sqoop Limitations INDEX 2 BIGDATAANALYSIS DUSHHYANTKUMAR
  • 3.
    What is Sqoop? ▪Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data between HDFS (Hadoop storage) and relational database servers like MySQL, Oracle RDB, SQLite, Teradata, Netezza, Postgres etc. ▪ It efficiently transfers bulk data between Hadoop and external data stores such as enterprise data warehouses, relational databases, etc. ▪ This is how Sqoop got its name – “SQL to Hadoop & Hadoop to SQL”. ▪ Sqoop transfer data between hadoop and relational DB servers. ▪ Sqoop is used to import data from relational DB such as MySQL, Oracle. ▪ Sqoop is used to export data from HDFS to relational DB. 3 BIGDATAANALYSIS DUSHHYANTKUMAR
  • 4.
    ✓Big data developer’sworks start once the data are in Hadoop system like in HDFS, Hive or Hbase. They do their magical stuff to find all the golden information hidden on such a huge amount of data. ✓Before Sqoop came, developers used to write to import and export data between Hadoop and RDBMS and a tool was needed to the same. ✓Again, Sqoop uses the MapReduce mechanism for its operations like import and export work and work on a parallel mechanism as well as fault tolerance. ✓In Sqoop, developers just need to mention the source, destination and the rest of the work will be done by the Sqoop tool. ✓Sqoop came and filled the gap between the transfer between relational databases and Hadoop system. Why we use Sqoop? 4 BIGDATAANALYSIS DUSHHYANTKUMAR
  • 5.
    Here is onesource which is RDBMS like MySQL and other is a destination like Hbase or HDFS etc. and Sqoop performs the operation to perform import and export. Sqoop Architecture 5 BIGDATAANALYSIS RDBMS (MYSQL,ORACLE,) HADOOP FILE SYSTEM (HDFS, Hbase ,Hive) SQOOP TOOL IMPORT EXPORT DUSHHYANTKUMAR
  • 6.
    6 Difference Between Sqoop1& Sqoop 2 S.N. Sqoop1 Sqoop2 1 Client--‐only Architecture Client/Server Architecture 2 CLI based CLI + Web based 3 Client access to Hive,HBase Server access to Hive,HBase 4 Oozie and Sqoop tightly coupled Oozie finds REST API DATA WAREHOUSE RELATIONAL DATABASE DOCUMENT BASED SYSTEM HADOOP MAP TASK HDFS/HBase /Hive SQOOP User COMMAND REST UI Job Manager SQOOP SERVER Connector Manager Connectors Metadata Metadata Repository User Browser Sqoop Client DATA WAREH OUSE RELATI ONAL DATAB ASE DOCUME NT BASED SYSTEM HADOOP MAP TASK HDFS/HBase/ Hive CLI BIGDATAANALYSIS DUSHHYANTKUMAR
  • 7.
    Sqoop1 Architecture 7 DATA WAREHOUSE RELATIONAL DATABASE DOCUMENT BASED SYSTEM MAP TASK HDFS/HBase SQOOP User COMMAND WhenSqoop starts functioning, only mapper job will run and reducer is not required. Here is a detailed view of Sqoop architecture with mapper- 1. Sqoop provides command line interface to the end users and can also be accessed using Java API. 2. Here only Map phase will run and reduce is not required because the complete import and export process doesn’t require any aggregation and so there is no need of reducers in Sqoop. BIGDATAANALYSIS
  • 8.
    Sqoop 2 Architecture 8 CLI REST UI JobManager SQOOP SERVER Connector Manager Connectors Metadata Metadata Repository SOURCES DATA WAREHOUSE RELATIONAL DATABASE DOCUMENT BASED SYSTEM HADOOP MAP TASK HDFS/HBase/Hive User Browser Sqoop Client REDUCETASK BIGDATAANALYSIS DUSHHYANTKUMAR
  • 9.
    9 What Is RESTAPI REST :- (Representational State Transfer application programming interface). Rest is Architectural Style(Rest Written in Scala,ajav,php etc.). Its used Http Protocol. HTTP METHODS:- 1. GET 2. POST 3. PUT 4. DELETE This document will explain you how to use Sqoop Network API to allow external applications interacting with Sqoop server. The REST API is a lower level API than the Sqoop client API, which gives you the freedom to execute commands in Sqoop server with any tools or programming language. The REST API is leveraged via HTTP requests and use JSON format to encode data content. BIGDATAANALYSISBIGDATAANALYSIS DUSHHYANTKUMAR
  • 10.
    10 What Is Client-ServerREST APIBIGDATAANALYSIS DUSHHYANTKUMAR
  • 11.
    11 What Is RESTAPIBIGDATAANALYSIS DUSHHYANTKUMAR
  • 12.
    12 What Is RESTAPIBIGDATAANALYSIS DUSHHYANTKUMAR
  • 13.
  • 14.
  • 15.
    1.Full Load: ApacheSqoop can load the whole table by a single command. You can also load all the tables from a database using a single command. 2.Incremental Load: Apache Sqoop also provides the facility of incremental load where you can load parts of table whenever it is updated. Sqoop import supports two types of incremental imports: 1. Append 2. Last modified. 3.Parallel import/export: Sqoop uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism. 4.Import results of SQL query: You can also import the result returned from an SQL query in HDFS. 5.Compression: You can compress your data by using deflate(gzip) algorithm with –compress argument, or by specifying –compression-codec argument. You can also load compressed table in Apache Hive. 6.Connectors for all major RDBMS Databases: Apache Sqoop provides connectors for multiple RDBMS databases, covering almost the entire circumference. 7.Load data directly into HIVE/HBase: You can load data directly into Apache Hive for analysis and also dump your data in HBase, which is a NoSQL database. Features of Sqoop 15 BIGDATAANALYSIS $ Sqoop import --connect - -table - -username - -password - -incremental - -check-column - -last-value DUSHHYANTKUMAR
  • 16.
    Five stage SqoopImport Overview 16 DATA STRUCTURE MYSQL SQL SERVER ORACLE DB2 RDBMS Map Sqoop Map Sqoop Map Sqoop Map Sqoop DATA SINKS HDFS Hive HBase DATA SINK DATA SINK DATA SINK DATA SINK MAP REDUCE SQOOP CLIENT Run Import Pull Metadata Launch MapReduce Job Pull Data From Database Write To Data Sink BIGDATAANALYSIS DUSHHYANTKUMAR
  • 17.
    ORDERS RDBMS,ORACLE HDFS Map Map Map Map File File File File Importing Data usingSqoop Sqoop Import Sqoop Job HADOOP CLUSTER 1.Gather Metadata 2.Submit Map Only Request User 17 BIGDATAANALYSIS DUSHHYANTKUMAR
  • 18.
    Sqoop Import Data 18 BIGDATAANALYSIS ▪SQOOP Import ▪ Import individual tables from RDBMS to HDFS. ▪ Each row in a table is treated as records in HDFS. ▪ All record are stored as text data in text files or binary files. ▪ Generic Syntax: ▪ Importing a Table into HDFS Syntax: ▪ Takes JDBC url and connects to database --table - Source table name to be imported. --username - Username to connect to database. --password - Password of the connecting user. --target-dir - Imports data to the specified directory. DUSHHYANTKUMAR
  • 19.
    Exporting Data usingSqoop 19 BIGDATAANALYSIS ORDERS RDBMS,ORACLE HDFS Map Map Map Map File File File File Sqoop Export Sqoop Job HADOOP CLUSTER 1.Gather Metadata 2.Submit Map Only Request User DUSHHYANTKUMAR
  • 20.
    Sqoop Export Data 20 BIGDATAANALYSIS ▪SQOOP Export ▪ Export a set of files from HDFS back to RDBMS. ▪ Files given an input to SQOOP contains records called as rows in table. ▪ Generic Syntax: ▪ Exporting a Table into RDBMS Syntax: ▪ Takes JDBC url and connects to database --table - Source table name to be exported. --username - Username to connect to database. --password - Password of the connecting user. --target-dir - Imports data to the specified directory. DUSHHYANTKUMAR
  • 21.
    Limitations of Sqoop 1.Sqoop cannot be paused and resumed. It is an atomic step. If it is failed we need to clear things up and start again. 2. Sqoop Export performance also depends upon the hardware configuration (Memory, Hard disk) of RDBMS server. 3. Sqoop is slow because it still uses Map Reduce in backend processing. 4. Failures need special handling in case of partial import or export. 5. For few databases Sqoop provides bulk connector which has faster performance. It uses a JDBC connection to connect with RDBMS based on data stores, and this can be inefficient and less performance. 21 BIGDATAANALYSIS DUSHHYANTKUMAR
  • 22.