Big data components - Introduction to Flume, Pig and Sqoop

Big Data Components
Flume, Pig and Sqoop

Data Management Without Hadoop

Components in Hadoop Architecture
• The Gray components are pure open source and Blue are Open Source and yet contributed by other companies

HDFS Components
• Node – A Computer ( Commodity Hardware)
• Rack – Collection of Nodes (30 to 40 in the same network) Bandwidth inside and Between Rack
Varies
• Cluster – Collection of Racks
• Distributed File System
• Hadoop Distributed FileSystem
• Map Reduce Engine
• Built in Resource Manager and Scheduler

Flume and Sqoop
• These both frameworks for transferring data to and from Hadoop File System (HDFS)
• The main difference between Flume and Sqoop is Flume will be used to capture a stream of moving data where as
Sqoop loads data from relational databases to HDFS

Flume
• This is an event driven framework used to capture data that continuously flows into the system
• Flume runs as one or more agents and each agent has three different components
• Source
• Channels
• Sinks

Flume Agent
• Source – This component retrieves the data from a particular application e.g. Web Server
• Channel – This simply acts as a pipe which temporarily stores the data if Output rate is lesser than the input rate.
• Sink – This components processes the data and stores it in a specific destination mostly a HDFS
Source Sink
Channel
Web Server
HDFS
AGENT
A Single Agent can
have multiple sources,
channels and Sinks

Use of a Channel
• Source will write events in a channel
• Channel maintains such events and removes it only when the sink completes
performing the event
• There are two types of Channel
• In-Memory – Processes the events faster, but it is volatile
• File Based – Processes the events slower, but permanent

Multiplexing and Serialization
• Output from one agent can serve as input to the other agent
• Avro is a remote call-and-serialization framework from Apache to do
this effectively

Fan out flow
• If the events from a single source is distributed to multiple channels, then it is called as Fanning out the flow
Source Channel 2
Channel 3
Channel 1
Source Channel 2
Channel 3
Channel 1
Replicating Fan Out
Source Channel 2
Channel 3
Channel 1
Multiplexing Fan Out

Flume Commands
• These are the commands listed out in Terminal

Why the name Pig?
• According to the Apache Pig philosophy, pigs eat anything, live anywhere and are domesticated
• In Hadoop pig is used for processing any kind of data (Structured, Unstructured and Semi Structured)

What’s so great about Pig
• Java is a low level language (Users must be aware of what the
program does and how the program does it)
• Whereas Pig is a high level language (Users must be aware of only
what the program does and need not worry about how it is done)
• Its extensible – Java classes can be defined separately and called
within a Pig program

Components of Pig
• Pig consists of two components
Pig
Language
Pig Latin
Complier

Data Flow Language
• Pig is called as a Data Flow Language
• Users will define a data stream
• Through out the stream several transformations are applied on the data
• Transformations includes mathematical operations, grouping, filtering etc.
Programs like ‘C’ are called Control flow
languages as they have loops and if
statements

Steps involved in Data Flow
Load
Transform
Dump/Save
Users can specify a single file or entire directory
Filter, Join, Group, Order etc
Dump the results somewhere or save in a file

Pig – Data Types
Pig has four different data types
• Atom – It can be a string or a number. This is similar to Int, long or char in other programming languages
• Tuple – It is a record that consists of a series of fields. Each field can contain a string or a number
• Bag – It is a collection of non-unique tuples. Each tuple can have different number of records
• Map – It is a collection of key value pairs. Any type can be stored in value and key has to be unique
If the value is unknown, the keyword “null” can be used as a place holder in the program

Pig - Operators
These are all the operators used at various levels

Pig – Debug and Troubleshoot
• There are few commands which can be used for debugging

Modes of Execution
Pig scripts can be executed in two different environments
Local Mode:
Pig is executed in a single node (Linux machine) and it does not requires Hadoop or HDFS.
This is used for testing pig logics.
pig -x local programname.pig
MapReduce Mode:
This is an actual Hadoop environment deployed along with HDFS.
pig -x mapreduce programname.pig

Packaging Pigs
Pig scripts can be packaged in three different ways
Script: This method is nothing more than a file containing Pig Latin commands, identified by the .pig suffix
(FlightData.pig, for example).Ending your Pig program with the .pig extension is a convention but not required.
Grunt: Grunt acts as a command interpreter where you can interactively enter Pig Latin at the Grunt command
line and immediately see the response. This method is helpful for prototyping during initial development and
with what-if scenarios.
Embedded: Pig Latin statements can be executed within Java, Python, or JavaScript programs.

User Defined Functions
• There are lot of User Defined Functions (UDFs) available for Pig
• These UDFs can be written in any languages and used with Pig
• Community members of open source have already posted several useful UDFs online
• Pig can be embedded in host languages like Java, Python and Java Script to integrate existing applications
with pig
• We can even make Pig to support control flow language by placing a Pig Latin script within “iF”
loop and it runs a MapReduce job until the condition is met

Sqoop
• It acts as SQL designed for Hadoop
• The main use of Sqoop is to load the data from other external data sources onto the Hadoop
Distributed File System (HDFS)
• Other data sources can be structured, semi-structured or even unstructured

Need for Sqoop
• Organizations have been storing data for many years in Relational Databases
• There are several types of RDBMS such as

Need for Sqoop
• Those data has to be fed into HDFS for distributed processing
• Sqoop is the best command line based (now web based as well) tool
to perform the import/export operations to and from HDFS
• Similar to Agents in Flume, Sqoop consists of different Connectors

Sqoop Architecture
• User/Administrator can control Sqoop

Sqoop job types
• Sqoop performs two important operations
Other Data
Source
(RDBMS,
Cassandra etc..)
Hadoop
Distributed
File System
Sqoop
Import
Other Data
Source
(RDBMS,
Cassandra etc..)
Hadoop
Distributed
File System
Sqoop
Export
Perform Data Processing
and Analysis
• This characteristic of Sqoop is called as bidirectional tool

How Sqoop Works?
• Sqoop communicates with the MapReduce engine and seeks help for copying data from
other Data sources into HDFS
• MapReduce will allocate mappers and performs the copy operation
• Types of operations
• Import one table
• Import complete database
• Import selected tables
• Import selected columns from a particular table
• Filter out certain rows from certain table etc

2 important features
Import Data in Compressed Format
While Sqoop imports data and stores on HDFS file system, it can be set to
compress the data and store it to reduce the overall utilization of the disk.
Well know compressed file formats are GZIP, BZ2 etc.
Parallelism
By default four mappers will be allocated to copy Data from Other DB into
HDFS. Users can increase the number of mappers to even 8 or 16

JDBC Drivers
• JDBC acts as an interface between an application and its database
• An application can send data into the database or it can retrieve
whenever it wants
• Sqoop connectors work along with the JDBC drivers

Sqoop latest version
• This is what is inside Sqoop

Sqoop Latest version
REST
Representational State Transfer – A software architecture style
UI
User Interface
Connectors
Interface that communicates with other data sources

JDBC drivers
My SQL
http://www.mysql.com/downloads/connector/j/5.1.html
Oracle
http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html
Microsoft SQL
http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774

Difference between Flume and Sqoop
Sqoop Flume
Sqoop is used for importing data from structured data
sources such as RDBMS.
Flume is used for moving bulk streaming data into HDFS.
Sqoop has a connector based architecture. Connectors
know how to connect to the respective data source and
fetch the data.
Flume has an agent based architecture. Here, code is
written (which is called as 'agent') which takes care of
fetching data.
HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels.
Sqoop data load is not event driven. Flume data load can be driven by event.
In order to import data from structured data sources, one
has to use Sqoop only, because its connectors know how
to interact with structured data sources and fetch data
from them.
In order to load streaming data such as tweets generated
on Twitter or log files of a web server, Flume should be
used. Flume agents are built for fetching streaming data.

Big data components - Introduction to Flume, Pig and Sqoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big data components - Introduction to Flume, Pig and Sqoop

Similar to Big data components - Introduction to Flume, Pig and Sqoop (20)

Recently uploaded

Recently uploaded (20)

Big data components - Introduction to Flume, Pig and Sqoop