Oozie & sqoop by pradeep

Introduction to Oozie
Tasks performed in Hadoop sometimes require
multiple Map/Reduce jobs to be chained together
to complete its goal.
Within the Hadoop ecosystem, there is a
relatively new component Oozie, which allows
one to combine multiple Map/Reduce jobs into a
logical unit of work.

What is Oozie ?
It is a Java Web-Application that runs in a Java
servlet-container - Tomcat and uses a database to
store:
• Workflow definitions
• Currently running workflow instances, including
instance states and variables
Oozie workflow is a collection of actions (i.e. Hadoop
Map/Reduce jobs, Pig jobs) arranged in a control
dependency DAG (Direct Acyclic Graph), specifying a
sequence of actions execution. This graph is specified
in hPDL (a XML Process Definition Language).

Applications of OOZIE
• Apache Oozie is a Java Web application used
to schedule Apache Hadoop jobs.
• Oozie combines multiple jobs sequentially
into one logical unit of work.
• It is integrated with the Hadoop stack, with
YARN as its architectural center, and supports
Hadoop jobs for Apache MapReduce, Apache
Pig, Apache Hive.
• Apache Oozie can also schedule jobs specific
to a system, like Java programs or shell
scripts.

Continue…
Apache Oozie is a tool for Hadoop operations that
allows cluster administrators to build complex
data transformations out of multiple component
tasks. This provides greater control over jobs and
also makes it easier to repeat those jobs at
predetermined intervals. At its core, Oozie helps
administrators derive more value from Hadoop.

Types of Oozie jobs:
• Oozie Workflow jobs are Directed Acyclical
Graphs (DAGs), specifying a sequence of actions
to execute. The Workflow job has to wait.
• Oozie Coordinator jobs are recurrent Oozie
Workflow jobs that are triggered by time and
data availability.
• Oozie Bundle provides a way to package multiple
coordinator and workflow jobs and to manage
the lifecycle of those jobs.

Installing Oozie
• Oozie can be installed, on existing Hadoop system, either
from a tarball , RPM or Debian Package. Our Hadoop
installation is Cloudera’s CDH3, which already contains
Oozie. As a result, we just used yum to pull it down and
perform the installation on an edge node.
• There are two components in Oozie distribution - Oozie-
client and Oozie-server.
• Depending on the size of your cluster, you may have both
components on the same edge server or on separate
machines. The Oozie server contains the components for
launching and controlling jobs, while the client contains
the components for a person to be able to launch Oozie
jobs and communicate with the Oozie server.

• Note: In addition to the installation process,
its recommended to add the shell variable
OOZIE_URL
(export OOZIE_URL=http://localhost:11000/oozie)

Limitations
• Only SAS DATA step batch programs can be scheduled.
• Only a single time event can be specified as a trigger
for the job.
• Subflows are not supported.
• Only AND conditions are supported in process flow
diagrams.
• Job events are the only supported type of dependency
for a scheduled flow.
• Deployed flows cannot be unscheduled in the Schedule
Manager plug-in, and flow history is not available.

• Jobs deployed to HDFS or MAPRFS cannot be
exported.
• XML(Verbose)
• Control flow is somehow restrictive
• Directed Acyclic Graph(Hard to rerun only a
component after failure, perfectly goes along
with Pig, though; Pig scripts also define DAG)
• User Interface

Sqoop
Sqoop is a tool designed to transfer data between
Hadoop and relational database servers. It is used
to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from
Hadoop file system to relational databases.

Introduction
• The traditional application management system, that is,
the interaction of applications with relational database
using RDBMS, is one of the sources that generate Big
Data. Such Big Data, generated by RDBMS, is stored in
Relational Database Servers in the relational database
structure.
• When Big Data storages and analyzers such as
MapReduce, Hive, HBase, Cassandra, Pig, etc. of the
Hadoop ecosystem came into picture, they required a
tool to interact with the relational database servers for
importing and exporting the Big Data residing in them.
Here, Sqoop occupies a place in the Hadoop ecosystem
to provide feasible interaction between relational
database server and Hadoop’s HDFS.

Continue...
Sqoop: “SQL to Hadoop and Hadoop to SQL”
Sqoop is a tool designed to transfer data
between Hadoop and relational database
servers. It is used to import data from relational
databases such as MySQL, Oracle to Hadoop
HDFS, and export from Hadoop file system to
relational databases. It is provided by the
Apache Software Foundation.

How Sqoop Works?
The following image describes the workflow of
Sqoop.

Sqoop Import & Sqoop Export
• The import tool imports individual tables from
RDBMS to HDFS. Each row in a table is treated
as a record in HDFS. All records are stored as
text data in text files or as binary data in Avro
and Sequence files.
• The export tool exports a set of files from HDFS
back to an RDBMS. The files given as input to
Sqoop contain records, which are called as rows
in table. Those are read and parsed into a set of
records and delimited with user-specified
delimiter.

Limitations
• all-tables option
• free-form query option
• Data imported into Hive or HBase
• table import with --where argument
• incremental imports
Sqoop has enjoyed enterprise adoption, and our
experiences have exposed some recurring ease-of-use
challenges, extensibility limitations, and security concerns
that are difficult to support in the original design:
- Cryptic and contextual command line arguments can
lead to error-prone connector matching, resulting in user
errors.

• Due to tight coupling between data transfer and the
serialization format, some connectors may support a
certain data format that others don't (e.g. direct
MySQL connector can't support sequence files).
• There are security concerns with openly shared
credentials.
• By requiring root privileges, local configuration and
installation are not easy to manage.
Debugging the map job is limited to turning on the
verbose flag.
• Connectors are forced to follow the JDBC model and
are required to use common JDBC vocabulary (URL,
database, table, etc), regardless if it is applicable.

Oozie & sqoop by pradeep

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Oozie & sqoop by pradeep

Similar to Oozie & sqoop by pradeep (20)

Recently uploaded

Recently uploaded (20)