Introduction to Big Data: HDFS, MapReduce & Hadoop

Unit-1
Introduction to Big Data
❑Big Data
❑Hadoop
❑HDFS
❑MapReduce

Big Data
⮚ Big dnalytics has been aata refers to data that is so large, fast or complex
that it’s difficult or impossible to process using traditional methods.
⮚ The act of accessing and storing large amounts of information for
around for a long time. But the concept of big data gained momentum in
the early 2000s.
⮚ Big Data is high-volume, high-velocity and/or high-variety information
asset that requires new forms of processing for enhanced decision
making, insight discovery and process optimization (Gartner 2012).
⮚ “Data of a very large size, typically to the extent that its manipulation
and management present significant logistical challenges”.

Types of Big Data
⮚ Big data is classified in three ways: Structured Data, Unstructured
Data and Semi-Structured Data.
⮚ Structured data is the easiest to work with. It is highly organized with
dimensions defined by set parameters. Structured data follows schemas:
essentially road maps to specific data points. These schemas outline
where each datum is and what it means. It’s all your quantitative data like
Age, Billing, Address etc.
⮚ Unstructured data is all your unorganized data. The hardest part of
analyzing unstructured data is teaching an application to understand the
information it’s extracting. More often than not, this means translating it
into some form of structured data.

⮚ Semi-structured data toes the line between structured and
unstructured. Most of the time, this translates to unstructured data with
metadata attached to it. Examples of this data are:- time, location, device
ID stamp or email address, or it can be a semantic tag attached to the data
later. Semi-structured data has no set schema.

What is Hadoop?
⮚Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of
computers using simple programming models.
⮚The Hadoop framework application works in an environment that
provides distributed storage and computation across clusters of
computers.
⮚Hadoop is a framework that uses distributed storage and parallel
processing to store and manage Big Data.
⮚
⮚Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.

Hadoop Applications
Social Network Analysis
Content Optimization
Network Analytics
Loyalty & Promotions Analysis
Fraud Analysis
Entity Analysis
Clickstream Sessionization
Clickstream Sessionization
Mediation
Data Factory
Trade Reconciliation
SIGINT
Application Application
Industry
Web
Media
Telco
Retail
Financial
Federal
Bioinformatics Genome Mapping
Sequencing Analysis
Use Case
Use Case

Hadoop Core Principles
⮚Scale-Out rather than Scale-Up
⮚Bring code to data rather than data to code
⮚Deal with failures – they are common
⮚Abstract complexity of distributed and concurrent
applications

Scale-Out rather than Scale-Up
1)It is harder and more expensive to Scale-Up
i. Add additional resources to an existing node (CPU, RAM)
ii.Moore’s Law can’t keep up with data growth
iii.New units must be purchased if required resources can not be added
iv.Also known as scale vertically
1)Scale-Out
i. Add more nodes/machines to an existing distributed application
ii.Software Layer is designed for node additions or removal
iii.Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
iv.Very easy to scale down as well

Bring Code to Data rather than Data to Code
◆Hadoop co-locates processors and storage
◆Code is moved to data (size is tiny, usually in KBs)
◆Processors execute code and access underlying local storage

Hadoop is designed to cope with node failures
⮚If a node fails, the master will detect that failure and re-assign the work to
a different node on the system.
⮚Restarting a task does not require communication with nodes working on
other portions of the data.
⮚If a failed node restarts, it is automatically added back to the system and
assigned new tasks.
⮚If a node appears to be running slowly, the master can redundantly
execute another instance of the same task
⮚ Results from the first to finish will be used

Block Size = 64MB
Replication Factor = 3
Cost is $400-$500/TB
1
2
3
4
5 2
3
4
5
2
4
5
1
3
5
1
2
5
1
3
4
HDFS
HDFS Replication

What is File System (FS)?
⮚ File management system is used by the operating system to access the
files and folders stored in a computer or any external storage devices.
⮚ A file system stores and organizes data and can be thought of as a type
of index for all the data contained in a storage device. These devices
can include hard drives, optical drives and flash drives.
⮚ Imagine file management system as a big dictionary that contains
information about file names, locations and types.
⮚ File systems specify conventions for naming files, including the
maximum number of characters in a name, which characters can be
used etc.
⮚ File management system is capable of handling files within one

What is Distributed File System
(DFS)?
⮚A Distributed File System (DFS) as the name suggests, is a file system
that is distributed on multiple file servers or multiple locations.
⮚It allows programs to access or store isolated files as they do with the
local ones, allowing programmers to access files from any network or
computer.
⮚The main purpose of the Distributed File System (DFS) is to allows
users of physically distributed systems to share their data and resources
by using a Common File System.
⮚A collection of workstations and mainframes connected by a Local
Area Network (LAN) is a configuration on Distributed File System.

How Distributed file system (DFS)
works?
?Distributed file system works as follows:
a) Distribution: Distribute blocks of data sets across multiple nodes. Each
node has its own computing power; which gives the ability of DFS to parallel
processing data blocks.
b) Replication: Distributed file system will also replicate data blocks on
different clusters by copy the same pieces of information into multiple
clusters on different racks. This will help to achieve the following:
c) Fault Tolerance: recover data block in case of cluster failure or Rack
failure.
d) High Concurrency: avail same piece of data to be processed by multiple
clients at the same time. It is done using the computation power of each node
to parallel process data blocks.

DFS Advantages
a) Scalability: You can scale up your infrastructure by adding more racks or
clusters to your system.
b) Fault Tolerance: Data replication will help to achieve fault tolerance in
the following cases: Cluster is down, Rack is down, Rack is disconnected
from the network and Job failure or restart.
c) High Concurrency: utilize the compute power of each node to handle
multiple client requests (in a parallel way) at the same time.

DFS Disadvantages
a) In Distributed File System nodes and connections needs to be secured
therefore we can say that security is at stake.
b) There is a possibility of lose of messages and data in the network
while movement from one node to another.
c) Database connection in case of Distributed File System is complicated.
d) Also handling of the database is not easy in Distributed File System as
compared to a single user system.

Hadoop
Distributed File
System (HDFS)

HDFS Basics
⮚ The Hadoop Distributed File System (HDFS) is based on the Google File
System (GFS)
⮚ Hadoop Distributed File System is responsible for storing data on the
cluster.
⮚ Data files are split into blocks and distributed across multiple nodes in the
cluster.
⮚ Each block is replicated multiple times
⮚--Default is to replicate each block three times
⮚--Replicas are stored on different nodes
⮚--This ensures both reliability and availability
⮚ A distributed file system that provides high-throughput access to
application data.

Hadoop Daemons
▪ Hadoop is comprised of five separate daemons
▪ NameNode: Holds the metadata for HDFS
▪ Secondary NameNode
– Performs housekeeping functions for the NameNode
– Is not a backup or hot standby for the NameNode!
▪ DataNode: Stores actual HDFS data blocks
▪ JobTracker: Manages MapReduce jobs, distributes individual tasks
▪ TaskTracker: Responsible for instantiating and monitoring individual Map and
Reduce tasks

Functions of Namenode
⮚ It is the master daemon that maintains and manages the DataNodes
(slave nodes)
⮚ It records the metadata of all the files stored in the cluster, e.g.
The location of blocks stored, the size of the files, permissions,
hierarchy, etc. There are two files associated with the metadata:
● FsImage: Complete state of the file system namespace since the start
of the NameNode.
● EditLogs: All the recent modifications made to the file system with
respect to the most recent FsImage.
⮚ It records each change that takes place to the file system metadata.

Functions of Namenode (Continued..)
⮚ It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are live.
⮚ It keeps a record of all the blocks in HDFS and in which nodes these
blocks are located.
⮚ The NameNode is also responsible to take care of
the replication factor .
⮚ In case of the DataNode failure, the NameNode chooses new
DataNodes for new replicas, balance disk usage and manages the
communication traffic to the DataNodes.

⮚These are slave daemons or process which runs on each slave machine.
⮚The actual data is stored on DataNodes.
⮚The DataNodes perform the low-level read and write requests from the
file system’s clients.
⮚They send heartbeats to the NameNode periodically to report the
overall health of HDFS, by default, this frequency is set to 3 seconds.
Functions of Datanode

Functions of Secondary NameNode
⮚ The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the
file system.
⮚ It is responsible for combining the EditLogs with FsImage from the NameNode.
⮚ It downloads the EditLogs from the NameNode at regular intervals and applies to
FsImage.
⮚ The new FsImage is copied back to the NameNode, which is used whenever the
NameNode is started the next time

What is MapReduce?
■MapReduce is a processing technique and a program model for distributed computing
based on java.
■The MapReduce algorithm contains two important tasks, namely Map and Reduce.
■Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
■Reducer task which takes the output from a map as an input and combines those data
tuples into a smaller set of tuples.
■As the sequence of the name MapReduce implies, the reduce task is always performed
after the map job.

■ MapReduce is the system used to process data in the Hadoop cluster.
■ Consists of two phases: Map, and then Reduce.
■ Each Map task operates on a discrete portion (one HDFS Block) of the overall
dataset.
■ MapReduce system distributes the intermediate data to nodes which perform the
Reduce phase.

Hadoop MapReduce WordCount Example

(Continued..)

(Continued...)

(Continued....)

Introduction to Big Data: HDFS, MapReduce & Hadoop

Recommended

Recommended

More Related Content

Similar to Introduction to Big Data: HDFS, MapReduce & Hadoop

Similar to Introduction to Big Data: HDFS, MapReduce & Hadoop (20)

Recently uploaded

Recently uploaded (20)

Introduction to Big Data: HDFS, MapReduce & Hadoop