SlideShare a Scribd company logo
1 of 107
Hadoop Tutorial
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use-Cases
US Primary Election
Analysis
Market Analysis for
US Cab Startup
1 2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Primary Election Analysis
1
2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election
STEP 1: Primary & Caucuses
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election
STEP 1: Primary & Caucuses
STEP 2: National Conventions
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election
STEP 1: Primary & Caucuses
STEP 2: National Conventions
STEP 3: General Elections
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election
STEP 1: Primary & Caucuses
STEP 2: National Conventions
STEP 4: Electoral CollegeSTEP 3: General Elections
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Primary Election
PROBLEM STATEMENT:
In the US Primary Election 2016, Hillary Clinton was nominated over Bernie Sanders from Democrats and on the other hand,
Donald Trump was nominated from Republican Party to contest for the presidential position.
As an analyst, you have been tasked to understand different factors that led to the winning of Hillary Clinton and Donald Trump in
the primary elections based on demographic features to plan their next initiatives and campaigns.
Republican Democrat
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Primary Election Dataset
US Primary Election Data Set
US Demographic Features
(County-wise) Data Set
Now as a data analyst you have 2 datasets available :
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Primary Election Dataset
state: List of US states
state_abbreviation: Abbreviation of each US state
county: List of counties in each US states
fips: FIPS county code is a Federal Information Processing
Standards (FIPS) code which uniquely identifies counties
party: Different parties in US (i.e. Republican & Democrat)
candidate: candidates in US primary election from different parties
votes: number of votes gained by a candidate
fraction_votes: total number of votes gained by a candidate/ total
votes gained by the party
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US County Demographic Features Dataset
DETAILS:
Population, 2014 estimate
Population, 2010 (April 1) estimates base
Population, percent change - April 1, 2010 to July 1, 2014
Population, 2010
Persons under 5 years, percent, 2014
Persons under 18 years, percent, 2014
Persons 65 years and over, percent, 2014
Female persons, percent, 2014
White alone, percent, 2014 …
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
US Primary
Election Dataset
1
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
Storing Data
in HDFS
2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
Processing Data Using
Spark Components
3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
Transforming Data Using
Spark SQL
4
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
Clustering Data Using
Spark MLlib (K-Means)
5
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
Visualizing the Result
Using Zeppelin
6
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
US Primary
Election Dataset
Storing Data
in HDFS
Processing Data Using
Spark Components
Transforming Data Using
Spark SQL
Clustering Data Using
Spark MLlib (K-Means)
Visualizing the Result
Using Zeppelin
1 2
3
456
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Visualization of Result
1
2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
2
Market Analysis for US Cab Start-Ups
1
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups
PROBLEM STATEMENT:
A US cab service start-up wants to meet the demands in an optimum manner and maximize the profit.
Thus, they hired you as a data analyst to interpret the available Uber’s data set and find out the beehive customer pick-up points &
peak hours for meeting the demand in a profitable manner.
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Uber Dataset
• Date/Time – Pickup Date & Time
• Lat – Latitude of Pickup
• Lon – Longitude of Pickup
• Base – TLC Base Code
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Uber Pick-Up
Locations Dataset
Predictions
1
Lat
Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Predictions
Storing Data
in HDFS
2
Lat
Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Predictions
Transforming
Dataset
3
Lat
Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Predictions
K-Means Clustering On
Latitude & Longitude
4
Lat
Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Uber Pick-Up
Locations Dataset
Predictions
Lat
Lon
Transforming
DataSet
K-Means Clustering On
Latitude & Longitude
1
3
4
Storing Data
in HDFS
2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Let Us Know What It Takes…
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Fundamentals Road Map
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means &
Zeppelin
Solution of
Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Spark
Introduction to Hadoop & Spark
Hadoop is a framework that allows you to store and process large
data sets in parallel and distributed fashion.
Apache Spark is an open-source cluster-computing framework for
real time processing
❖ Provides an interface for programming entire clusters with
implicit data parallelism and fault-tolerance
❖ Built on top of YARN and it extends the YARN model to efficiently
use more types of computations
❖ Hadoop has two core components:
▪ HDFS: Allows to dump any kind of data across the cluster
▪ YARN: Allows parallel processing of the data stored in HDFS
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark Complementing Hadoop
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark processes data 100 times faster than
MapReduce
Spark & Hadoop
1
Spark Applications can run on YARN leveraging
Hadoop cluster2
Apache Spark can use HDFS as its storage
3
Faster Analytic
Cost Optimization
Avoid Duplication
Challenges Addressed :
Combining Spark’s ability, i.e. high processing speed, advance analytics and multiple integration support
with Hadoop’s low cost operation on commodity hardware gives the best results
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Kafka
Big Data Use-Cases Solution Architecture
Apache HIVE
MapReduce
Apache Spark
SolutionOptions
Storing Big Data on
HDFS
Processing through
YARN framework
Tools used for processing
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS
❖ HDFS stands for Hadoop Distributed File System
❖ HDFS is the storage unit of Hadoop
HDFS creates an abstraction layer over the distributed storage resources, from where we can see the whole HDFS
as a single unit.
DataNodeDataNode DataNode
NameNode
Secondary
NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
NameNode
NameNode
Secondary
NameNode
NameNode
• Master daemon
• Maintains and Manages DataNodes
• Records metadata e.g. location of blocks stored, the size of the files,
permissions, hierarchy, etc.
• Receives heartbeat and block report from all the DataNodes
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Secondary NameNode
NameNode
Secondary
NameNode
Secondary NameNode
• Checkpointing is a process of combining edit logs with FsImage
• Allows faster Failover as we have a back up of the metadata
• Checkpointing happens periodically (default: 1 hour)
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Secondary NameNode
NameNode
Secondary
NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
DataNode
DataNode
• Slave daemons
• Stores actual data
• Serves read and write requests
DataNodeDataNode DataNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Architecture in Detail
NameNode
Metadata (Name, replicas, …):
/hdfs/foo/data, 3, …
Client
Client
Replication
Block ops
DataNodesDataNodes
Read
Write
Metadata ops
Rack 2Rack 1
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Block & Replication
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Data Block
380 MB
How many blocks will be created if a file of
size 500 MB is copied to HDFS?
• Each file is stored on HDFS as block
• The default size of each block is 128 MB
• Let us say, I have a file example.txt of size 380 MB:
128
MB
128
MB
124
MB
Block 1 Block 2 Block 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Data Block
• Each file is stored on HDFS as block
• The default size of each block is 128 MB
• Let us say, I have a file example.txt of size 500 MB:
How many blocks will be created if a file of
size 500 MB is copied to HDFS?
128
MB
128
MB
128
MB
500 MB
Block 1 Block 2 Block 3
116
MB
Block 4
380 MB
128
MB
128
MB
124
MB
Block 1 Block 2 Block 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Block Replication
NameNode
DataNodeDataNode DataNode
Secondary
NameNode
128
MB
120
MB
248 MB Block 1 Block 2
Each data blocks are replicated (thrice by default)
and are distributed across different DataNodes
Replication Factor = 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
• Rack Awareness Algorithm reduces latency as well as provide fault tolerance by replicating data block
• Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack & the next two replicas
will be stored on a different (remote) rack
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Fault Tolerance
If a DataNode fails, the data blocks can be
recovered and retrieved from the replicas stored on
another DataNodes.
Replication Factor = 3
NameNode
DataNodeDataNode DataNode
Secondary
NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Fault Tolerance
If a DataNode fails, the data blocks can be
recovered and retrieved from the replicas stored on
another DataNodes.
Replication Factor = 3
NameNode
DataNodeDataNode DataNode
Secondary
NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Start Hadoop Daemons
./sbin/start-all.sh
1
./sbin/stop-all.sh
2
jps
3
Starts all the Hadoop daemons(HDFS & YARN)
Stops all the Hadoop daemons
Checks all the daemons running on you machines
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Writing & Deleting a File in Hadoop
hdfs fs –put /test.txt /
1
hdfs dfs –ls /
2
hdfs fs –rm /test.txt /
3
Coping a file from local file system to HDFS
Lists all the HDFS files/directories
Deleting the file from HDFS
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
What is YARN ?
• Hadoop 2.0 came up with new framework YARN ( Yet Another Resource Negotiator ), which provides ability to run Non-
MapReduce application.
• It provides a paradigm for parallel processing over Hadoop.
• YARN framework is responsible for integration of different tools with Hadoop like Spark, Hive, Pig.
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
ResourceManager
ResourceManager
ResourceManager
• Receives the processing requests
• Passes the requests to corresponding NodeManagers
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
NodeManager
NodeManagerNodeManager NodeManager
NodeManager
• Installed on every DataNode
• Responsible for execution of task on every single DataNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
YARN Architecture in Detail
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
YARN Workflow
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
YARN child
task JVM
1. Run Job
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
2. Submit Job
YARN child
task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
3. Get application ID
YARN child
task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
4.1. Start Container
4.2. Launch
AppMaster YARN child
task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
YARN child
task JVM5. Allocate Resources
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
6.1. Start Container
6.2. Launch
YARN child
task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
YARN child
task JVM
7. Execute
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status
8. AM unregisters with RM
6
3
2
4
5
4
5
1
RM NM AMClient
4
5
7
8
YARN Application Workflow
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster Architecture =
HDFS + YARN
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster Architecture
MasterSlave
HDFS YARN
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster Hardware Specification
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster Hardware Specification
▪ RAM: 64 GB,
▪ Hard disk: 1 TB
▪ Processor: Xenon with 8 Cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS / Linux
▪ Power: Redundant Power Supply
▪ RAM: 32 GB
▪ Hard disk: 1 TB
▪ Processor: Xenon with 4 Cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS / Linux
▪ Power: Redundant Power Supply
▪ RAM: 16GB
▪ Hard disk: 6 x 2TB
▪ Processor: Xenon with 2 cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS / Linux
NameNode
Secondary NameNode
DataNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Real Time Hadoop Cluster Deployment
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster : Facebook Use Case
21 PB of storage in a
single HDFS cluster
2000 Machines
Per Cluster
12 TB of Data
Per Machine
1200 machines with
8 cores each + 800
machines with 16
cores each
32 GB of RAM
per machine
15 map-
reduce tasks
per machine
That's a total of more than 21 PB of configured
storage capacity, this is larger than the
previously known Yahoo!'s cluster of 14 PB
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster : Spotify Use Case
Use Hadoop for generating
music recommendations
1650 node cluster
~ 65 PB storage
70 TB RAM
+25,000 daily Hadoop jobs
43,000 virtualised cores
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark Core Components
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark Core
Spark Core is the base engine for large-scale parallel and distributed data processing
It is responsible for:
▪ Memory management and fault recovery
▪ Scheduling, distributing and monitoring jobs on a cluster
▪ Interacting with storage systems
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark Architecture
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark SQL
• Spark SQL integrates relational processing with Spark’s functional programming
• Provides support for various data sources and makes it possible to weave SQL queries with code transformations
CSV JSON JDBC
Data Source API
Data Frame API
DataFrame DSL Spark SQL and HQL
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Start Spark Daemons
./sbin/start-all.sh
1
jps
2
./bin/spark-shell
3
Starts all the Spark daemons(Master & Worker)
Checks all the daemons running on you machines
Starts the Spark Shell
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means & Zeppelin
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
K-Means Clustering
The process by which objects are classified into a predefined number of groups so that they are as much dissimilar as possible
from one group to another group, but as much similar as possible within each group
▪ The objects in group 1 should be as similar as
possible
▪ But there should be much difference between an
object in group 1 and group 2
▪ The attributes of the objects are allowed to
determine which objects should be grouped
together
Total population
Group 1 Group 2 Group 3 Group 4
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
K-Means Clustering
▪ Consider a comparison on Income & Balance:
CurrentBalance
High
High
Medium
Medium
Low
Low
Gross Monthly Income
Example Cluster 1
High Balance
Low Income
Example Cluster 2
High Income
Low Balance
The objects in Cluster 1
have similar characteristics
(High Income and Low
balance)
Also the objects in Cluster 2
have the same
characteristic (High Balance
and Low Income)
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Example
▪ The plot of students in an area is as given below
I need to find specific
locations to build
schools in this area so
that the students
doesn’t have to travel
much
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Example
▪ Using k-means clustering we got output as:
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Apache Zeppelin
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
What is Zeppelin?
▪ A completely open web-based notebook that enables interactive data analytics
▪ Web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop & Spark
▪ The various languages are supported via Zeppelin language interpreters
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means & Zeppelin
Solution of
Use-case
1
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Analyzing Data using
SparkSQL
Storing Data in HDFS
Visualizing the data
US County Solution
HDFS
Zeppelin
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
US Primary
Election Dataset
Storing Data
in HDFS
Processing Data Using
Spark Components
Transforming Data Using
Scala & Spark SQL
Clustering Data Using
Spark MLlib (K-Means)
Visualizing the Result
Using Zeppelin
1 2
3
456
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means & Zeppelin
Solution of
Use-case
12
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Uber Pick-Up
Locations Dataset
Predictions
Lat
Lon
Transforming
DataSet
K-Means Clustering On
Latitude & Longitude
1
3
4
Storing Data
in HDFS
2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Edureka LMS
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
LMS: Getting Started
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
LMS: Pre-Recorded Session
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
LMS: Course Content
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
LMS: Projects
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

More Related Content

What's hot

Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaAshish Thapliyal
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
How To Become A Big Data Engineer? Edureka
How To Become A Big Data Engineer? EdurekaHow To Become A Big Data Engineer? Edureka
How To Become A Big Data Engineer? EdurekaEdureka!
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Jay Patel
 

What's hot (20)

Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and Kafka
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
How To Become A Big Data Engineer? Edureka
How To Become A Big Data Engineer? EdurekaHow To Become A Big Data Engineer? Edureka
How To Become A Big Data Engineer? Edureka
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 

Similar to Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | EdurekaBig Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | EdurekaEdureka!
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaEdureka!
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesProcessing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesSyed Muhammad Ali Hasnain
 
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...Edureka!
 
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...Amazon Web Services
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseHortonworks
 
Big Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | EdurekaBig Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | EdurekaEdureka!
 
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...Edureka!
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Edureka!
 
Modern REST APIs for Enterprise Databases - OData
Modern REST APIs for Enterprise Databases - ODataModern REST APIs for Enterprise Databases - OData
Modern REST APIs for Enterprise Databases - ODataNishanth Kadiyala
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Kevin Crocker
 
2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow PresentationFelix Liao
 
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...Edureka!
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopSSandip Patil
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 

Similar to Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka (20)

Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | EdurekaBig Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesProcessing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web Technologies
 
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
 
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
 
Big Data
Big DataBig Data
Big Data
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
 
Big Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | EdurekaBig Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | Edureka
 
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
 
Modern REST APIs for Enterprise Databases - OData
Modern REST APIs for Enterprise Databases - ODataModern REST APIs for Enterprise Databases - OData
Modern REST APIs for Enterprise Databases - OData
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation
 
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 

More from Edureka!

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaEdureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaEdureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaEdureka!
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaEdureka!
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaEdureka!
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaEdureka!
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaEdureka!
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaEdureka!
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaEdureka!
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaEdureka!
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | EdurekaEdureka!
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEdureka!
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEdureka!
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaEdureka!
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaEdureka!
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaEdureka!
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaEdureka!
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaEdureka!
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | EdurekaEdureka!
 

More from Edureka! (20)

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
 

Recently uploaded

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Recently uploaded (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

  • 2. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use Cases
  • 3. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases US Primary Election Analysis Market Analysis for US Cab Startup 1 2
  • 4. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Analysis 1 2
  • 5. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses
  • 6. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions
  • 7. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions STEP 3: General Elections
  • 8. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions STEP 4: Electoral CollegeSTEP 3: General Elections
  • 9. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election PROBLEM STATEMENT: In the US Primary Election 2016, Hillary Clinton was nominated over Bernie Sanders from Democrats and on the other hand, Donald Trump was nominated from Republican Party to contest for the presidential position. As an analyst, you have been tasked to understand different factors that led to the winning of Hillary Clinton and Donald Trump in the primary elections based on demographic features to plan their next initiatives and campaigns. Republican Democrat
  • 10. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Dataset US Primary Election Data Set US Demographic Features (County-wise) Data Set Now as a data analyst you have 2 datasets available :
  • 11. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Dataset state: List of US states state_abbreviation: Abbreviation of each US state county: List of counties in each US states fips: FIPS county code is a Federal Information Processing Standards (FIPS) code which uniquely identifies counties party: Different parties in US (i.e. Republican & Democrat) candidate: candidates in US primary election from different parties votes: number of votes gained by a candidate fraction_votes: total number of votes gained by a candidate/ total votes gained by the party
  • 12. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US County Demographic Features Dataset DETAILS: Population, 2014 estimate Population, 2010 (April 1) estimates base Population, percent change - April 1, 2010 to July 1, 2014 Population, 2010 Persons under 5 years, percent, 2014 Persons under 18 years, percent, 2014 Persons 65 years and over, percent, 2014 Female persons, percent, 2014 White alone, percent, 2014 …
  • 13. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset 1
  • 14. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Storing Data in HDFS 2
  • 15. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Processing Data Using Spark Components 3
  • 16. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Transforming Data Using Spark SQL 4
  • 17. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Clustering Data Using Spark MLlib (K-Means) 5
  • 18. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Visualizing the Result Using Zeppelin 6
  • 19. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset Storing Data in HDFS Processing Data Using Spark Components Transforming Data Using Spark SQL Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin 1 2 3 456
  • 20. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Visualization of Result 1 2
  • 21. Copyright © 2017, edureka and/or its affiliates. All rights reserved. 2 Market Analysis for US Cab Start-Ups 1
  • 22. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups PROBLEM STATEMENT: A US cab service start-up wants to meet the demands in an optimum manner and maximize the profit. Thus, they hired you as a data analyst to interpret the available Uber’s data set and find out the beehive customer pick-up points & peak hours for meeting the demand in a profitable manner.
  • 23. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Uber Dataset • Date/Time – Pickup Date & Time • Lat – Latitude of Pickup • Lon – Longitude of Pickup • Base – TLC Base Code
  • 24. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions 1 Lat Lon
  • 25. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions Storing Data in HDFS 2 Lat Lon
  • 26. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions Transforming Dataset 3 Lat Lon
  • 27. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions K-Means Clustering On Latitude & Longitude 4 Lat Lon
  • 28. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions Lat Lon Transforming DataSet K-Means Clustering On Latitude & Longitude 1 3 4 Storing Data in HDFS 2
  • 29. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Let Us Know What It Takes…
  • 30. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Fundamentals Road Map Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-Cases
  • 31. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark
  • 32. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Spark Introduction to Hadoop & Spark Hadoop is a framework that allows you to store and process large data sets in parallel and distributed fashion. Apache Spark is an open-source cluster-computing framework for real time processing ❖ Provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance ❖ Built on top of YARN and it extends the YARN model to efficiently use more types of computations ❖ Hadoop has two core components: ▪ HDFS: Allows to dump any kind of data across the cluster ▪ YARN: Allows parallel processing of the data stored in HDFS
  • 33. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Complementing Hadoop
  • 34. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark processes data 100 times faster than MapReduce Spark & Hadoop 1 Spark Applications can run on YARN leveraging Hadoop cluster2 Apache Spark can use HDFS as its storage 3 Faster Analytic Cost Optimization Avoid Duplication Challenges Addressed : Combining Spark’s ability, i.e. high processing speed, advance analytics and multiple integration support with Hadoop’s low cost operation on commodity hardware gives the best results
  • 35. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
  • 36. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
  • 37. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
  • 38. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Kafka Big Data Use-Cases Solution Architecture Apache HIVE MapReduce Apache Spark SolutionOptions Storing Big Data on HDFS Processing through YARN framework Tools used for processing
  • 39. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage)
  • 40. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS ❖ HDFS stands for Hadoop Distributed File System ❖ HDFS is the storage unit of Hadoop HDFS creates an abstraction layer over the distributed storage resources, from where we can see the whole HDFS as a single unit. DataNodeDataNode DataNode NameNode Secondary NameNode
  • 41. Copyright © 2017, edureka and/or its affiliates. All rights reserved. NameNode NameNode Secondary NameNode NameNode • Master daemon • Maintains and Manages DataNodes • Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. • Receives heartbeat and block report from all the DataNodes
  • 42. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Secondary NameNode NameNode Secondary NameNode Secondary NameNode • Checkpointing is a process of combining edit logs with FsImage • Allows faster Failover as we have a back up of the metadata • Checkpointing happens periodically (default: 1 hour)
  • 43. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Secondary NameNode NameNode Secondary NameNode
  • 44. Copyright © 2017, edureka and/or its affiliates. All rights reserved. DataNode DataNode • Slave daemons • Stores actual data • Serves read and write requests DataNodeDataNode DataNode
  • 45. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Architecture in Detail NameNode Metadata (Name, replicas, …): /hdfs/foo/data, 3, … Client Client Replication Block ops DataNodesDataNodes Read Write Metadata ops Rack 2Rack 1
  • 46. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Block & Replication
  • 47. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Data Block 380 MB How many blocks will be created if a file of size 500 MB is copied to HDFS? • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 380 MB: 128 MB 128 MB 124 MB Block 1 Block 2 Block 3
  • 48. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Data Block • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 500 MB: How many blocks will be created if a file of size 500 MB is copied to HDFS? 128 MB 128 MB 128 MB 500 MB Block 1 Block 2 Block 3 116 MB Block 4 380 MB 128 MB 128 MB 124 MB Block 1 Block 2 Block 3
  • 49. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Block Replication NameNode DataNodeDataNode DataNode Secondary NameNode 128 MB 120 MB 248 MB Block 1 Block 2 Each data blocks are replicated (thrice by default) and are distributed across different DataNodes Replication Factor = 3
  • 50. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness • Rack Awareness Algorithm reduces latency as well as provide fault tolerance by replicating data block • Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack & the next two replicas will be stored on a different (remote) rack
  • 51. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 52. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 53. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 54. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 55. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 56. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 57. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Fault Tolerance If a DataNode fails, the data blocks can be recovered and retrieved from the replicas stored on another DataNodes. Replication Factor = 3 NameNode DataNodeDataNode DataNode Secondary NameNode
  • 58. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Fault Tolerance If a DataNode fails, the data blocks can be recovered and retrieved from the replicas stored on another DataNodes. Replication Factor = 3 NameNode DataNodeDataNode DataNode Secondary NameNode
  • 59. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Start Hadoop Daemons ./sbin/start-all.sh 1 ./sbin/stop-all.sh 2 jps 3 Starts all the Hadoop daemons(HDFS & YARN) Stops all the Hadoop daemons Checks all the daemons running on you machines
  • 60. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Writing & Deleting a File in Hadoop hdfs fs –put /test.txt / 1 hdfs dfs –ls / 2 hdfs fs –rm /test.txt / 3 Coping a file from local file system to HDFS Lists all the HDFS files/directories Deleting the file from HDFS
  • 61. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing)
  • 62. Copyright © 2017, edureka and/or its affiliates. All rights reserved. What is YARN ? • Hadoop 2.0 came up with new framework YARN ( Yet Another Resource Negotiator ), which provides ability to run Non- MapReduce application. • It provides a paradigm for parallel processing over Hadoop. • YARN framework is responsible for integration of different tools with Hadoop like Spark, Hive, Pig.
  • 63. Copyright © 2017, edureka and/or its affiliates. All rights reserved. ResourceManager ResourceManager ResourceManager • Receives the processing requests • Passes the requests to corresponding NodeManagers
  • 64. Copyright © 2017, edureka and/or its affiliates. All rights reserved. NodeManager NodeManagerNodeManager NodeManager NodeManager • Installed on every DataNode • Responsible for execution of task on every single DataNode
  • 65. Copyright © 2017, edureka and/or its affiliates. All rights reserved. YARN Architecture in Detail
  • 66. Copyright © 2017, edureka and/or its affiliates. All rights reserved. YARN Workflow
  • 67. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM 1. Run Job
  • 68. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 2. Submit Job YARN child task JVM
  • 69. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 3. Get application ID YARN child task JVM
  • 70. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 4.1. Start Container 4.2. Launch AppMaster YARN child task JVM
  • 71. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM5. Allocate Resources
  • 72. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 6.1. Start Container 6.2. Launch YARN child task JVM
  • 73. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM 7. Execute
  • 74. Copyright © 2017, edureka and/or its affiliates. All rights reserved. 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 8. AM unregisters with RM 6 3 2 4 5 4 5 1 RM NM AMClient 4 5 7 8 YARN Application Workflow
  • 75. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Architecture = HDFS + YARN
  • 76. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Architecture MasterSlave HDFS YARN
  • 77. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Hardware Specification
  • 78. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Hardware Specification ▪ RAM: 64 GB, ▪ Hard disk: 1 TB ▪ Processor: Xenon with 8 Cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux ▪ Power: Redundant Power Supply ▪ RAM: 32 GB ▪ Hard disk: 1 TB ▪ Processor: Xenon with 4 Cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux ▪ Power: Redundant Power Supply ▪ RAM: 16GB ▪ Hard disk: 6 x 2TB ▪ Processor: Xenon with 2 cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux NameNode Secondary NameNode DataNode
  • 79. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Real Time Hadoop Cluster Deployment
  • 80. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster : Facebook Use Case 21 PB of storage in a single HDFS cluster 2000 Machines Per Cluster 12 TB of Data Per Machine 1200 machines with 8 cores each + 800 machines with 16 cores each 32 GB of RAM per machine 15 map- reduce tasks per machine That's a total of more than 21 PB of configured storage capacity, this is larger than the previously known Yahoo!'s cluster of 14 PB
  • 81. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster : Spotify Use Case Use Hadoop for generating music recommendations 1650 node cluster ~ 65 PB storage 70 TB RAM +25,000 daily Hadoop jobs 43,000 virtualised cores
  • 82. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark
  • 83. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Core Components
  • 84. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing It is responsible for: ▪ Memory management and fault recovery ▪ Scheduling, distributing and monitoring jobs on a cluster ▪ Interacting with storage systems
  • 85. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Architecture
  • 86. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark SQL • Spark SQL integrates relational processing with Spark’s functional programming • Provides support for various data sources and makes it possible to weave SQL queries with code transformations CSV JSON JDBC Data Source API Data Frame API DataFrame DSL Spark SQL and HQL
  • 87. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Start Spark Daemons ./sbin/start-all.sh 1 jps 2 ./bin/spark-shell 3 Starts all the Spark daemons(Master & Worker) Checks all the daemons running on you machines Starts the Spark Shell
  • 88. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin
  • 89. Copyright © 2017, edureka and/or its affiliates. All rights reserved. K-Means Clustering The process by which objects are classified into a predefined number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group ▪ The objects in group 1 should be as similar as possible ▪ But there should be much difference between an object in group 1 and group 2 ▪ The attributes of the objects are allowed to determine which objects should be grouped together Total population Group 1 Group 2 Group 3 Group 4
  • 90. Copyright © 2017, edureka and/or its affiliates. All rights reserved. K-Means Clustering ▪ Consider a comparison on Income & Balance: CurrentBalance High High Medium Medium Low Low Gross Monthly Income Example Cluster 1 High Balance Low Income Example Cluster 2 High Income Low Balance The objects in Cluster 1 have similar characteristics (High Income and Low balance) Also the objects in Cluster 2 have the same characteristic (High Balance and Low Income)
  • 91. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Example ▪ The plot of students in an area is as given below I need to find specific locations to build schools in this area so that the students doesn’t have to travel much
  • 92. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Example ▪ Using k-means clustering we got output as:
  • 93. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Apache Zeppelin
  • 94. Copyright © 2017, edureka and/or its affiliates. All rights reserved. What is Zeppelin? ▪ A completely open web-based notebook that enables interactive data analytics ▪ Web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop & Spark ▪ The various languages are supported via Zeppelin language interpreters
  • 95. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-case 1
  • 96. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Analyzing Data using SparkSQL Storing Data in HDFS Visualizing the data US County Solution HDFS Zeppelin
  • 97. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset Storing Data in HDFS Processing Data Using Spark Components Transforming Data Using Scala & Spark SQL Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin 1 2 3 456
  • 98. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-case 12
  • 99. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions Lat Lon Transforming DataSet K-Means Clustering On Latitude & Longitude 1 3 4 Storing Data in HDFS 2
  • 100. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Edureka LMS
  • 101. Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Getting Started
  • 102. Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Pre-Recorded Session
  • 103. Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Course Content
  • 104. Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Projects
  • 105. Copyright © 2017, edureka and/or its affiliates. All rights reserved.
  • 106. Copyright © 2017, edureka and/or its affiliates. All rights reserved.