Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use Cases

Big Data Use-Cases
US Primary Election
Analysis
Market Analysis for
US Cab Startup
1 2

US Primary Election Analysis
1
2

US Election
STEP 1: Primary & Caucuses

US Election
STEP 2: National Conventions

US Election
STEP 3: General Elections

US Election
STEP 4: Electoral CollegeSTEP 3: General Elections

US Primary Election
PROBLEM STATEMENT:
In the US Primary Election 2016, Hillary Clinton was nominated over Bernie Sanders from Democrats and on the other hand,
Donald Trump was nominated from Republican Party to contest for the presidential position.
As an analyst, you have been tasked to understand different factors that led to the winning of Hillary Clinton and Donald Trump in
the primary elections based on demographic features to plan their next initiatives and campaigns.
Republican Democrat

US Primary Election Dataset
US Primary Election Data Set
US Demographic Features
(County-wise) Data Set
Now as a data analyst you have 2 datasets available :

US Primary Election Dataset
state: List of US states
state_abbreviation: Abbreviation of each US state
county: List of counties in each US states
fips: FIPS county code is a Federal Information Processing
Standards (FIPS) code which uniquely identifies counties
party: Different parties in US (i.e. Republican & Democrat)
candidate: candidates in US primary election from different parties
votes: number of votes gained by a candidate
fraction_votes: total number of votes gained by a candidate/ total
votes gained by the party

US County Demographic Features Dataset
DETAILS:
Population, 2014 estimate
Population, 2010 (April 1) estimates base
Population, percent change - April 1, 2010 to July 1, 2014
Population, 2010
Persons under 5 years, percent, 2014
Persons under 18 years, percent, 2014
Persons 65 years and over, percent, 2014
Female persons, percent, 2014
White alone, percent, 2014 …

US Election Solution Strategy
US Primary
Election Dataset
1

Storing Data
in HDFS
2

Processing Data Using
Spark Components
3

Transforming Data Using
Spark SQL
4

Clustering Data Using
Spark MLlib (K-Means)
5

Visualizing the Result
Using Zeppelin
6

US Primary
Election Dataset
Storing Data
in HDFS
Spark Components
Spark SQL
Using Zeppelin
1 2
3
456

Visualization of Result
1
2

2
Market Analysis for US Cab Start-Ups
1

Market Analysis for US Cab Start-Ups
PROBLEM STATEMENT:
A US cab service start-up wants to meet the demands in an optimum manner and maximize the profit.
Thus, they hired you as a data analyst to interpret the available Uber’s data set and find out the beehive customer pick-up points &
peak hours for meeting the demand in a profitable manner.

Uber Dataset
• Date/Time – Pickup Date & Time
• Lat – Latitude of Pickup
• Lon – Longitude of Pickup
• Base – TLC Base Code

Market Analysis for US Cab Start-Ups Solution Strategy
Uber Pick-Up
Locations Dataset
Predictions
1
Lat
Lon

Predictions
Storing Data
in HDFS
2
Lat
Lon

Predictions
Transforming
Dataset
3
Lat
Lon

Predictions
K-Means Clustering On
Latitude & Longitude
4
Lat
Lon

Uber Pick-Up
Locations Dataset
Predictions
Lat
Lon
Transforming
DataSet
K-Means Clustering On
Latitude & Longitude
1
3
4
Storing Data
in HDFS
2

Let Us Know What It Takes…

Fundamentals Road Map
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means &
Zeppelin
Solution of
Use-Cases

Introduction to
Hadoop & Spark

Hadoop Spark
Introduction to Hadoop & Spark
Hadoop is a framework that allows you to store and process large
data sets in parallel and distributed fashion.
Apache Spark is an open-source cluster-computing framework for
real time processing
❖ Provides an interface for programming entire clusters with
implicit data parallelism and fault-tolerance
❖ Built on top of YARN and it extends the YARN model to efficiently
use more types of computations
❖ Hadoop has two core components:
▪ HDFS: Allows to dump any kind of data across the cluster
▪ YARN: Allows parallel processing of the data stored in HDFS

Spark Complementing Hadoop

Spark processes data 100 times faster than
MapReduce
Spark & Hadoop
1
Spark Applications can run on YARN leveraging
Hadoop cluster2
Apache Spark can use HDFS as its storage
3
Faster Analytic
Cost Optimization
Avoid Duplication
Challenges Addressed :
Combining Spark’s ability, i.e. high processing speed, advance analytics and multiple integration support
with Hadoop’s low cost operation on commodity hardware gives the best results

Big Data Use-Cases

Kafka
Big Data Use-Cases Solution Architecture
Apache HIVE
MapReduce
Apache Spark
SolutionOptions
Storing Big Data on
HDFS
Processing through
YARN framework
Tools used for processing

Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)

HDFS
❖ HDFS stands for Hadoop Distributed File System
❖ HDFS is the storage unit of Hadoop
HDFS creates an abstraction layer over the distributed storage resources, from where we can see the whole HDFS
as a single unit.
DataNodeDataNode DataNode
NameNode
Secondary
NameNode

NameNode
NameNode
Secondary
NameNode
NameNode
• Master daemon
• Maintains and Manages DataNodes
• Records metadata e.g. location of blocks stored, the size of the files,
permissions, hierarchy, etc.
• Receives heartbeat and block report from all the DataNodes

Secondary NameNode
NameNode
Secondary
NameNode
Secondary NameNode
• Checkpointing is a process of combining edit logs with FsImage
• Allows faster Failover as we have a back up of the metadata
• Checkpointing happens periodically (default: 1 hour)

Secondary NameNode
NameNode
Secondary
NameNode

DataNode
DataNode
• Slave daemons
• Stores actual data
• Serves read and write requests

HDFS Architecture in Detail
NameNode
Metadata (Name, replicas, …):
/hdfs/foo/data, 3, …
Client
Client
Replication
Block ops
DataNodesDataNodes
Read
Write
Metadata ops
Rack 2Rack 1

HDFS Block & Replication

HDFS Data Block
380 MB
How many blocks will be created if a file of
size 500 MB is copied to HDFS?
• Each file is stored on HDFS as block
• The default size of each block is 128 MB
• Let us say, I have a file example.txt of size 380 MB:
128
MB
128
MB
124
MB
Block 1 Block 2 Block 3

HDFS Data Block
• Each file is stored on HDFS as block
• The default size of each block is 128 MB
• Let us say, I have a file example.txt of size 500 MB:
How many blocks will be created if a file of
size 500 MB is copied to HDFS?
128
MB
128
MB
128
MB
500 MB
116
MB
Block 4
380 MB
128
MB
128
MB
124
MB

HDFS Block Replication
NameNode
Secondary
NameNode
128
MB
120
MB
248 MB Block 1 Block 2
Each data blocks are replicated (thrice by default)
and are distributed across different DataNodes
Replication Factor = 3

Rack Awareness
• Rack Awareness Algorithm reduces latency as well as provide fault tolerance by replicating data block
• Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack & the next two replicas
will be stored on a different (remote) rack

Rack Awareness
Rack - 1 Rack - 2 Rack - 3

HDFS Fault Tolerance
If a DataNode fails, the data blocks can be
recovered and retrieved from the replicas stored on
another DataNodes.
Replication Factor = 3
NameNode
Secondary
NameNode

Start Hadoop Daemons
./sbin/start-all.sh
1
./sbin/stop-all.sh
2
jps
3
Starts all the Hadoop daemons(HDFS & YARN)
Stops all the Hadoop daemons
Checks all the daemons running on you machines

Writing & Deleting a File in Hadoop
hdfs fs –put /test.txt /
1
hdfs dfs –ls /
2
hdfs fs –rm /test.txt /
3
Coping a file from local file system to HDFS
Lists all the HDFS files/directories
Deleting the file from HDFS

Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)

What is YARN ?
• Hadoop 2.0 came up with new framework YARN ( Yet Another Resource Negotiator ), which provides ability to run Non-
MapReduce application.
• It provides a paradigm for parallel processing over Hadoop.
• YARN framework is responsible for integration of different tools with Hadoop like Spark, Hive, Pig.

ResourceManager
ResourceManager
ResourceManager
• Receives the processing requests
• Passes the requests to corresponding NodeManagers

NodeManager
NodeManagerNodeManager NodeManager
NodeManager
• Installed on every DataNode
• Responsible for execution of task on every single DataNode

YARN Architecture in Detail

YARN Workflow

Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
YARN child
task JVM
1. Run Job

MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
2. Submit Job
YARN child
task JVM

MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
3. Get application ID
YARN child
task JVM

MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
4.1. Start Container
4.2. Launch
AppMaster YARN child
task JVM

MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
YARN child
task JVM5. Allocate Resources

MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
6.1. Start Container
6.2. Launch
YARN child
task JVM

MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
YARN child
task JVM
7. Execute

1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status
8. AM unregisters with RM
6
3
2
4
5
4
5
1
RM NM AMClient
4
5
7
8
YARN Application Workflow

Hadoop Cluster Architecture =
HDFS + YARN

Hadoop Cluster Architecture
MasterSlave
HDFS YARN

Hadoop Cluster Hardware Specification

Hadoop Cluster Hardware Specification
▪ RAM: 64 GB,
▪ Hard disk: 1 TB
▪ Processor: Xenon with 8 Cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS / Linux
▪ Power: Redundant Power Supply
▪ RAM: 32 GB
▪ Hard disk: 1 TB
▪ Processor: Xenon with 4 Cores
▪ Power: Redundant Power Supply
▪ RAM: 16GB
▪ Hard disk: 6 x 2TB
▪ Processor: Xenon with 2 cores
NameNode
Secondary NameNode
DataNode

Real Time Hadoop Cluster Deployment

Hadoop Cluster : Facebook Use Case
21 PB of storage in a
single HDFS cluster
2000 Machines
Per Cluster
12 TB of Data
Per Machine
1200 machines with
8 cores each + 800
machines with 16
cores each
32 GB of RAM
per machine
15 map-
reduce tasks
per machine
That's a total of more than 21 PB of configured
storage capacity, this is larger than the
previously known Yahoo!'s cluster of 14 PB

Hadoop Cluster : Spotify Use Case
Use Hadoop for generating
music recommendations
1650 node cluster
~ 65 PB storage
70 TB RAM
+25,000 daily Hadoop jobs
43,000 virtualised cores

Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark

Spark Core Components

Spark Core
Spark Core is the base engine for large-scale parallel and distributed data processing
It is responsible for:
▪ Memory management and fault recovery
▪ Scheduling, distributing and monitoring jobs on a cluster
▪ Interacting with storage systems

Spark Architecture

Spark SQL
• Spark SQL integrates relational processing with Spark’s functional programming
• Provides support for various data sources and makes it possible to weave SQL queries with code transformations
CSV JSON JDBC
Data Source API
Data Frame API
DataFrame DSL Spark SQL and HQL

Start Spark Daemons
./sbin/start-all.sh
1
jps
2
./bin/spark-shell
3
Starts all the Spark daemons(Master & Worker)
Checks all the daemons running on you machines
Starts the Spark Shell

Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means & Zeppelin

K-Means Clustering
The process by which objects are classified into a predefined number of groups so that they are as much dissimilar as possible
from one group to another group, but as much similar as possible within each group
▪ The objects in group 1 should be as similar as
possible
▪ But there should be much difference between an
object in group 1 and group 2
▪ The attributes of the objects are allowed to
determine which objects should be grouped
together
Total population
Group 1 Group 2 Group 3 Group 4

K-Means Clustering
▪ Consider a comparison on Income & Balance:
CurrentBalance
High
High
Medium
Medium
Low
Low
Gross Monthly Income
Example Cluster 1
High Balance
Low Income
Example Cluster 2
High Income
Low Balance
The objects in Cluster 1
have similar characteristics
(High Income and Low
balance)
Also the objects in Cluster 2
have the same
characteristic (High Balance
and Low Income)

Example
▪ The plot of students in an area is as given below
I need to find specific
locations to build
schools in this area so
that the students
doesn’t have to travel
much

Example
▪ Using k-means clustering we got output as:

Apache Zeppelin

What is Zeppelin?
▪ A completely open web-based notebook that enables interactive data analytics
▪ Web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop & Spark
▪ The various languages are supported via Zeppelin language interpreters

Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means & Zeppelin
Solution of
Use-case
1

Analyzing Data using
SparkSQL
Storing Data in HDFS
Visualizing the data
US County Solution
HDFS
Zeppelin

US Primary
Election Dataset
Storing Data
in HDFS
Spark Components
Scala & Spark SQL
Using Zeppelin
1 2
3
456

Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means & Zeppelin
Solution of
Use-case
12

Edureka LMS

LMS: Getting Started

LMS: Pre-Recorded Session

LMS: Course Content

LMS: Projects

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Similar to Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka