SlideShare a Scribd company logo
Hadoop Tutorial
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use-Cases
US Primary Election
Analysis
Market Analysis for
US Cab Startup
1 2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Primary Election Analysis
1
2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election
STEP 1: Primary & Caucuses
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election
STEP 1: Primary & Caucuses
STEP 2: National Conventions
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election
STEP 1: Primary & Caucuses
STEP 2: National Conventions
STEP 3: General Elections
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election
STEP 1: Primary & Caucuses
STEP 2: National Conventions
STEP 4: Electoral CollegeSTEP 3: General Elections
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Primary Election
PROBLEM STATEMENT:
In the US Primary Election 2016, Hillary Clinton was nominated over Bernie Sanders from Democrats and on the other hand,
Donald Trump was nominated from Republican Party to contest for the presidential position.
As an analyst, you have been tasked to understand different factors that led to the winning of Hillary Clinton and Donald Trump in
the primary elections based on demographic features to plan their next initiatives and campaigns.
Republican Democrat
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Primary Election Dataset
US Primary Election Data Set
US Demographic Features
(County-wise) Data Set
Now as a data analyst you have 2 datasets available :
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Primary Election Dataset
state: List of US states
state_abbreviation: Abbreviation of each US state
county: List of counties in each US states
fips: FIPS county code is a Federal Information Processing
Standards (FIPS) code which uniquely identifies counties
party: Different parties in US (i.e. Republican & Democrat)
candidate: candidates in US primary election from different parties
votes: number of votes gained by a candidate
fraction_votes: total number of votes gained by a candidate/ total
votes gained by the party
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US County Demographic Features Dataset
DETAILS:
Population, 2014 estimate
Population, 2010 (April 1) estimates base
Population, percent change - April 1, 2010 to July 1, 2014
Population, 2010
Persons under 5 years, percent, 2014
Persons under 18 years, percent, 2014
Persons 65 years and over, percent, 2014
Female persons, percent, 2014
White alone, percent, 2014 …
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
US Primary
Election Dataset
1
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
Storing Data
in HDFS
2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
Processing Data Using
Spark Components
3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
Transforming Data Using
Spark SQL
4
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
Clustering Data Using
Spark MLlib (K-Means)
5
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
Visualizing the Result
Using Zeppelin
6
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
US Primary
Election Dataset
Storing Data
in HDFS
Processing Data Using
Spark Components
Transforming Data Using
Spark SQL
Clustering Data Using
Spark MLlib (K-Means)
Visualizing the Result
Using Zeppelin
1 2
3
456
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Visualization of Result
1
2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
2
Market Analysis for US Cab Start-Ups
1
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups
PROBLEM STATEMENT:
A US cab service start-up wants to meet the demands in an optimum manner and maximize the profit.
Thus, they hired you as a data analyst to interpret the available Uber’s data set and find out the beehive customer pick-up points &
peak hours for meeting the demand in a profitable manner.
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Uber Dataset
• Date/Time – Pickup Date & Time
• Lat – Latitude of Pickup
• Lon – Longitude of Pickup
• Base – TLC Base Code
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Uber Pick-Up
Locations Dataset
Predictions
1
Lat
Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Predictions
Storing Data
in HDFS
2
Lat
Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Predictions
Transforming
Dataset
3
Lat
Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Predictions
K-Means Clustering On
Latitude & Longitude
4
Lat
Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Uber Pick-Up
Locations Dataset
Predictions
Lat
Lon
Transforming
DataSet
K-Means Clustering On
Latitude & Longitude
1
3
4
Storing Data
in HDFS
2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Let Us Know What It Takes…
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Fundamentals Road Map
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means &
Zeppelin
Solution of
Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Spark
Introduction to Hadoop & Spark
Hadoop is a framework that allows you to store and process large
data sets in parallel and distributed fashion.
Apache Spark is an open-source cluster-computing framework for
real time processing
❖ Provides an interface for programming entire clusters with
implicit data parallelism and fault-tolerance
❖ Built on top of YARN and it extends the YARN model to efficiently
use more types of computations
❖ Hadoop has two core components:
▪ HDFS: Allows to dump any kind of data across the cluster
▪ YARN: Allows parallel processing of the data stored in HDFS
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark Complementing Hadoop
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark processes data 100 times faster than
MapReduce
Spark & Hadoop
1
Spark Applications can run on YARN leveraging
Hadoop cluster2
Apache Spark can use HDFS as its storage
3
Faster Analytic
Cost Optimization
Avoid Duplication
Challenges Addressed :
Combining Spark’s ability, i.e. high processing speed, advance analytics and multiple integration support
with Hadoop’s low cost operation on commodity hardware gives the best results
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Big Data Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Kafka
Big Data Use-Cases Solution Architecture
Apache HIVE
MapReduce
Apache Spark
SolutionOptions
Storing Big Data on
HDFS
Processing through
YARN framework
Tools used for processing
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS
❖ HDFS stands for Hadoop Distributed File System
❖ HDFS is the storage unit of Hadoop
HDFS creates an abstraction layer over the distributed storage resources, from where we can see the whole HDFS
as a single unit.
DataNodeDataNode DataNode
NameNode
Secondary
NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
NameNode
NameNode
Secondary
NameNode
NameNode
• Master daemon
• Maintains and Manages DataNodes
• Records metadata e.g. location of blocks stored, the size of the files,
permissions, hierarchy, etc.
• Receives heartbeat and block report from all the DataNodes
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Secondary NameNode
NameNode
Secondary
NameNode
Secondary NameNode
• Checkpointing is a process of combining edit logs with FsImage
• Allows faster Failover as we have a back up of the metadata
• Checkpointing happens periodically (default: 1 hour)
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Secondary NameNode
NameNode
Secondary
NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
DataNode
DataNode
• Slave daemons
• Stores actual data
• Serves read and write requests
DataNodeDataNode DataNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Architecture in Detail
NameNode
Metadata (Name, replicas, …):
/hdfs/foo/data, 3, …
Client
Client
Replication
Block ops
DataNodesDataNodes
Read
Write
Metadata ops
Rack 2Rack 1
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Block & Replication
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Data Block
380 MB
How many blocks will be created if a file of
size 500 MB is copied to HDFS?
• Each file is stored on HDFS as block
• The default size of each block is 128 MB
• Let us say, I have a file example.txt of size 380 MB:
128
MB
128
MB
124
MB
Block 1 Block 2 Block 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Data Block
• Each file is stored on HDFS as block
• The default size of each block is 128 MB
• Let us say, I have a file example.txt of size 500 MB:
How many blocks will be created if a file of
size 500 MB is copied to HDFS?
128
MB
128
MB
128
MB
500 MB
Block 1 Block 2 Block 3
116
MB
Block 4
380 MB
128
MB
128
MB
124
MB
Block 1 Block 2 Block 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Block Replication
NameNode
DataNodeDataNode DataNode
Secondary
NameNode
128
MB
120
MB
248 MB Block 1 Block 2
Each data blocks are replicated (thrice by default)
and are distributed across different DataNodes
Replication Factor = 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
• Rack Awareness Algorithm reduces latency as well as provide fault tolerance by replicating data block
• Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack & the next two replicas
will be stored on a different (remote) rack
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Rack Awareness
Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Fault Tolerance
If a DataNode fails, the data blocks can be
recovered and retrieved from the replicas stored on
another DataNodes.
Replication Factor = 3
NameNode
DataNodeDataNode DataNode
Secondary
NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HDFS Fault Tolerance
If a DataNode fails, the data blocks can be
recovered and retrieved from the replicas stored on
another DataNodes.
Replication Factor = 3
NameNode
DataNodeDataNode DataNode
Secondary
NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Start Hadoop Daemons
./sbin/start-all.sh
1
./sbin/stop-all.sh
2
jps
3
Starts all the Hadoop daemons(HDFS & YARN)
Stops all the Hadoop daemons
Checks all the daemons running on you machines
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Writing & Deleting a File in Hadoop
hdfs fs –put /test.txt /
1
hdfs dfs –ls /
2
hdfs fs –rm /test.txt /
3
Coping a file from local file system to HDFS
Lists all the HDFS files/directories
Deleting the file from HDFS
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
What is YARN ?
• Hadoop 2.0 came up with new framework YARN ( Yet Another Resource Negotiator ), which provides ability to run Non-
MapReduce application.
• It provides a paradigm for parallel processing over Hadoop.
• YARN framework is responsible for integration of different tools with Hadoop like Spark, Hive, Pig.
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
ResourceManager
ResourceManager
ResourceManager
• Receives the processing requests
• Passes the requests to corresponding NodeManagers
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
NodeManager
NodeManagerNodeManager NodeManager
NodeManager
• Installed on every DataNode
• Responsible for execution of task on every single DataNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
YARN Architecture in Detail
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
YARN Workflow
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
YARN child
task JVM
1. Run Job
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
2. Submit Job
YARN child
task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
3. Get application ID
YARN child
task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
4.1. Start Container
4.2. Launch
AppMaster YARN child
task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
YARN child
task JVM5. Allocate Resources
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
6.1. Start Container
6.2. Launch
YARN child
task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Application Submission in YARN
MR
Code
MR
Job
Client Mode
Resource Manager
RM Node
NodeManager
AppMaster JVM
NodeManager
MR Task
YARN child
task JVM
7. Execute
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status
8. AM unregisters with RM
6
3
2
4
5
4
5
1
RM NM AMClient
4
5
7
8
YARN Application Workflow
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster Architecture =
HDFS + YARN
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster Architecture
MasterSlave
HDFS YARN
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster Hardware Specification
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster Hardware Specification
▪ RAM: 64 GB,
▪ Hard disk: 1 TB
▪ Processor: Xenon with 8 Cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS / Linux
▪ Power: Redundant Power Supply
▪ RAM: 32 GB
▪ Hard disk: 1 TB
▪ Processor: Xenon with 4 Cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS / Linux
▪ Power: Redundant Power Supply
▪ RAM: 16GB
▪ Hard disk: 6 x 2TB
▪ Processor: Xenon with 2 cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS / Linux
NameNode
Secondary NameNode
DataNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Real Time Hadoop Cluster Deployment
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster : Facebook Use Case
21 PB of storage in a
single HDFS cluster
2000 Machines
Per Cluster
12 TB of Data
Per Machine
1200 machines with
8 cores each + 800
machines with 16
cores each
32 GB of RAM
per machine
15 map-
reduce tasks
per machine
That's a total of more than 21 PB of configured
storage capacity, this is larger than the
previously known Yahoo!'s cluster of 14 PB
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Cluster : Spotify Use Case
Use Hadoop for generating
music recommendations
1650 node cluster
~ 65 PB storage
70 TB RAM
+25,000 daily Hadoop jobs
43,000 virtualised cores
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark Core Components
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark Core
Spark Core is the base engine for large-scale parallel and distributed data processing
It is responsible for:
▪ Memory management and fault recovery
▪ Scheduling, distributing and monitoring jobs on a cluster
▪ Interacting with storage systems
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark Architecture
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Spark SQL
• Spark SQL integrates relational processing with Spark’s functional programming
• Provides support for various data sources and makes it possible to weave SQL queries with code transformations
CSV JSON JDBC
Data Source API
Data Frame API
DataFrame DSL Spark SQL and HQL
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Start Spark Daemons
./sbin/start-all.sh
1
jps
2
./bin/spark-shell
3
Starts all the Spark daemons(Master & Worker)
Checks all the daemons running on you machines
Starts the Spark Shell
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means & Zeppelin
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
K-Means Clustering
The process by which objects are classified into a predefined number of groups so that they are as much dissimilar as possible
from one group to another group, but as much similar as possible within each group
▪ The objects in group 1 should be as similar as
possible
▪ But there should be much difference between an
object in group 1 and group 2
▪ The attributes of the objects are allowed to
determine which objects should be grouped
together
Total population
Group 1 Group 2 Group 3 Group 4
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
K-Means Clustering
▪ Consider a comparison on Income & Balance:
CurrentBalance
High
High
Medium
Medium
Low
Low
Gross Monthly Income
Example Cluster 1
High Balance
Low Income
Example Cluster 2
High Income
Low Balance
The objects in Cluster 1
have similar characteristics
(High Income and Low
balance)
Also the objects in Cluster 2
have the same
characteristic (High Balance
and Low Income)
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Example
▪ The plot of students in an area is as given below
I need to find specific
locations to build
schools in this area so
that the students
doesn’t have to travel
much
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Example
▪ Using k-means clustering we got output as:
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Apache Zeppelin
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
What is Zeppelin?
▪ A completely open web-based notebook that enables interactive data analytics
▪ Web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop & Spark
▪ The various languages are supported via Zeppelin language interpreters
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means & Zeppelin
Solution of
Use-case
1
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Analyzing Data using
SparkSQL
Storing Data in HDFS
Visualizing the data
US County Solution
HDFS
Zeppelin
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
US Election Solution Strategy
US Primary
Election Dataset
Storing Data
in HDFS
Processing Data Using
Spark Components
Transforming Data Using
Scala & Spark SQL
Clustering Data Using
Spark MLlib (K-Means)
Visualizing the Result
Using Zeppelin
1 2
3
456
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Introduction to
Hadoop & Spark HDFS
(Hadoop Storage)
YARN
(Hadoop Processing)
Apache Spark
K-Means & Zeppelin
Solution of
Use-case
12
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Market Analysis for US Cab Start-Ups Solution Strategy
Uber Pick-Up
Locations Dataset
Predictions
Lat
Lon
Transforming
DataSet
K-Means Clustering On
Latitude & Longitude
1
3
4
Storing Data
in HDFS
2
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Edureka LMS
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
LMS: Getting Started
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
LMS: Pre-Recorded Session
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
LMS: Course Content
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
LMS: Projects
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

More Related Content

What's hot

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
iwrigley
 
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Cathrine Wilhelmsen
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
Alexey Grishchenko
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Simplilearn
 
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaMapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
Edureka!
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
Isheeta Sanghi
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
Dave Cramer
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
Edureka!
 
Architecting Modern Data Platforms
Architecting Modern Data PlatformsArchitecting Modern Data Platforms
Architecting Modern Data Platforms
Ankit Rathi
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 

What's hot (20)

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
 
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaMapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Architecting Modern Data Platforms
Architecting Modern Data PlatformsArchitecting Modern Data Platforms
Architecting Modern Data Platforms
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 

Similar to Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | EdurekaBig Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Edureka!
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
Edureka!
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesProcessing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web Technologies
Syed Muhammad Ali Hasnain
 
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Edureka!
 
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
Amazon Web Services
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
Hortonworks
 
Big Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | EdurekaBig Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | Edureka
Edureka!
 
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
Edureka!
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Edureka!
 
Modern REST APIs for Enterprise Databases - OData
Modern REST APIs for Enterprise Databases - ODataModern REST APIs for Enterprise Databases - OData
Modern REST APIs for Enterprise Databases - OData
Nishanth Kadiyala
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
himanshu arora
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Kevin Crocker
 
2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation
Felix Liao
 
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
Edureka!
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
SSandip Patil
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 

Similar to Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka (20)

Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | EdurekaBig Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesProcessing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web Technologies
 
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
 
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
 
Big Data
Big DataBig Data
Big Data
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
 
Big Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | EdurekaBig Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | Edureka
 
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
Customer-Product Analysis With Tableau | Tableau Training For Beginners | Tab...
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
 
Modern REST APIs for Enterprise Databases - OData
Modern REST APIs for Enterprise Databases - ODataModern REST APIs for Enterprise Databases - OData
Modern REST APIs for Enterprise Databases - OData
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation
 
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 

More from Edureka!

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
Edureka!
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
Edureka!
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
Edureka!
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
Edureka!
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
Edureka!
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
Edureka!
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
Edureka!
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
Edureka!
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
Edureka!
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
Edureka!
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
Edureka!
 

More from Edureka! (20)

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
 

Recently uploaded

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

  • 2. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use Cases
  • 3. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases US Primary Election Analysis Market Analysis for US Cab Startup 1 2
  • 4. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Analysis 1 2
  • 5. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses
  • 6. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions
  • 7. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions STEP 3: General Elections
  • 8. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions STEP 4: Electoral CollegeSTEP 3: General Elections
  • 9. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election PROBLEM STATEMENT: In the US Primary Election 2016, Hillary Clinton was nominated over Bernie Sanders from Democrats and on the other hand, Donald Trump was nominated from Republican Party to contest for the presidential position. As an analyst, you have been tasked to understand different factors that led to the winning of Hillary Clinton and Donald Trump in the primary elections based on demographic features to plan their next initiatives and campaigns. Republican Democrat
  • 10. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Dataset US Primary Election Data Set US Demographic Features (County-wise) Data Set Now as a data analyst you have 2 datasets available :
  • 11. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Dataset state: List of US states state_abbreviation: Abbreviation of each US state county: List of counties in each US states fips: FIPS county code is a Federal Information Processing Standards (FIPS) code which uniquely identifies counties party: Different parties in US (i.e. Republican & Democrat) candidate: candidates in US primary election from different parties votes: number of votes gained by a candidate fraction_votes: total number of votes gained by a candidate/ total votes gained by the party
  • 12. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US County Demographic Features Dataset DETAILS: Population, 2014 estimate Population, 2010 (April 1) estimates base Population, percent change - April 1, 2010 to July 1, 2014 Population, 2010 Persons under 5 years, percent, 2014 Persons under 18 years, percent, 2014 Persons 65 years and over, percent, 2014 Female persons, percent, 2014 White alone, percent, 2014 …
  • 13. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset 1
  • 14. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Storing Data in HDFS 2
  • 15. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Processing Data Using Spark Components 3
  • 16. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Transforming Data Using Spark SQL 4
  • 17. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Clustering Data Using Spark MLlib (K-Means) 5
  • 18. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Visualizing the Result Using Zeppelin 6
  • 19. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset Storing Data in HDFS Processing Data Using Spark Components Transforming Data Using Spark SQL Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin 1 2 3 456
  • 20. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Visualization of Result 1 2
  • 21. Copyright © 2017, edureka and/or its affiliates. All rights reserved. 2 Market Analysis for US Cab Start-Ups 1
  • 22. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups PROBLEM STATEMENT: A US cab service start-up wants to meet the demands in an optimum manner and maximize the profit. Thus, they hired you as a data analyst to interpret the available Uber’s data set and find out the beehive customer pick-up points & peak hours for meeting the demand in a profitable manner.
  • 23. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Uber Dataset • Date/Time – Pickup Date & Time • Lat – Latitude of Pickup • Lon – Longitude of Pickup • Base – TLC Base Code
  • 24. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions 1 Lat Lon
  • 25. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions Storing Data in HDFS 2 Lat Lon
  • 26. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions Transforming Dataset 3 Lat Lon
  • 27. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions K-Means Clustering On Latitude & Longitude 4 Lat Lon
  • 28. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions Lat Lon Transforming DataSet K-Means Clustering On Latitude & Longitude 1 3 4 Storing Data in HDFS 2
  • 29. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Let Us Know What It Takes…
  • 30. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Fundamentals Road Map Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-Cases
  • 31. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark
  • 32. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Spark Introduction to Hadoop & Spark Hadoop is a framework that allows you to store and process large data sets in parallel and distributed fashion. Apache Spark is an open-source cluster-computing framework for real time processing ❖ Provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance ❖ Built on top of YARN and it extends the YARN model to efficiently use more types of computations ❖ Hadoop has two core components: ▪ HDFS: Allows to dump any kind of data across the cluster ▪ YARN: Allows parallel processing of the data stored in HDFS
  • 33. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Complementing Hadoop
  • 34. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark processes data 100 times faster than MapReduce Spark & Hadoop 1 Spark Applications can run on YARN leveraging Hadoop cluster2 Apache Spark can use HDFS as its storage 3 Faster Analytic Cost Optimization Avoid Duplication Challenges Addressed : Combining Spark’s ability, i.e. high processing speed, advance analytics and multiple integration support with Hadoop’s low cost operation on commodity hardware gives the best results
  • 35. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
  • 36. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
  • 37. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
  • 38. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Kafka Big Data Use-Cases Solution Architecture Apache HIVE MapReduce Apache Spark SolutionOptions Storing Big Data on HDFS Processing through YARN framework Tools used for processing
  • 39. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage)
  • 40. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS ❖ HDFS stands for Hadoop Distributed File System ❖ HDFS is the storage unit of Hadoop HDFS creates an abstraction layer over the distributed storage resources, from where we can see the whole HDFS as a single unit. DataNodeDataNode DataNode NameNode Secondary NameNode
  • 41. Copyright © 2017, edureka and/or its affiliates. All rights reserved. NameNode NameNode Secondary NameNode NameNode • Master daemon • Maintains and Manages DataNodes • Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. • Receives heartbeat and block report from all the DataNodes
  • 42. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Secondary NameNode NameNode Secondary NameNode Secondary NameNode • Checkpointing is a process of combining edit logs with FsImage • Allows faster Failover as we have a back up of the metadata • Checkpointing happens periodically (default: 1 hour)
  • 43. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Secondary NameNode NameNode Secondary NameNode
  • 44. Copyright © 2017, edureka and/or its affiliates. All rights reserved. DataNode DataNode • Slave daemons • Stores actual data • Serves read and write requests DataNodeDataNode DataNode
  • 45. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Architecture in Detail NameNode Metadata (Name, replicas, …): /hdfs/foo/data, 3, … Client Client Replication Block ops DataNodesDataNodes Read Write Metadata ops Rack 2Rack 1
  • 46. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Block & Replication
  • 47. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Data Block 380 MB How many blocks will be created if a file of size 500 MB is copied to HDFS? • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 380 MB: 128 MB 128 MB 124 MB Block 1 Block 2 Block 3
  • 48. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Data Block • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 500 MB: How many blocks will be created if a file of size 500 MB is copied to HDFS? 128 MB 128 MB 128 MB 500 MB Block 1 Block 2 Block 3 116 MB Block 4 380 MB 128 MB 128 MB 124 MB Block 1 Block 2 Block 3
  • 49. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Block Replication NameNode DataNodeDataNode DataNode Secondary NameNode 128 MB 120 MB 248 MB Block 1 Block 2 Each data blocks are replicated (thrice by default) and are distributed across different DataNodes Replication Factor = 3
  • 50. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness • Rack Awareness Algorithm reduces latency as well as provide fault tolerance by replicating data block • Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack & the next two replicas will be stored on a different (remote) rack
  • 51. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 52. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 53. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 54. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 55. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 56. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 57. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Fault Tolerance If a DataNode fails, the data blocks can be recovered and retrieved from the replicas stored on another DataNodes. Replication Factor = 3 NameNode DataNodeDataNode DataNode Secondary NameNode
  • 58. Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Fault Tolerance If a DataNode fails, the data blocks can be recovered and retrieved from the replicas stored on another DataNodes. Replication Factor = 3 NameNode DataNodeDataNode DataNode Secondary NameNode
  • 59. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Start Hadoop Daemons ./sbin/start-all.sh 1 ./sbin/stop-all.sh 2 jps 3 Starts all the Hadoop daemons(HDFS & YARN) Stops all the Hadoop daemons Checks all the daemons running on you machines
  • 60. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Writing & Deleting a File in Hadoop hdfs fs –put /test.txt / 1 hdfs dfs –ls / 2 hdfs fs –rm /test.txt / 3 Coping a file from local file system to HDFS Lists all the HDFS files/directories Deleting the file from HDFS
  • 61. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing)
  • 62. Copyright © 2017, edureka and/or its affiliates. All rights reserved. What is YARN ? • Hadoop 2.0 came up with new framework YARN ( Yet Another Resource Negotiator ), which provides ability to run Non- MapReduce application. • It provides a paradigm for parallel processing over Hadoop. • YARN framework is responsible for integration of different tools with Hadoop like Spark, Hive, Pig.
  • 63. Copyright © 2017, edureka and/or its affiliates. All rights reserved. ResourceManager ResourceManager ResourceManager • Receives the processing requests • Passes the requests to corresponding NodeManagers
  • 64. Copyright © 2017, edureka and/or its affiliates. All rights reserved. NodeManager NodeManagerNodeManager NodeManager NodeManager • Installed on every DataNode • Responsible for execution of task on every single DataNode
  • 65. Copyright © 2017, edureka and/or its affiliates. All rights reserved. YARN Architecture in Detail
  • 66. Copyright © 2017, edureka and/or its affiliates. All rights reserved. YARN Workflow
  • 67. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM 1. Run Job
  • 68. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 2. Submit Job YARN child task JVM
  • 69. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 3. Get application ID YARN child task JVM
  • 70. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 4.1. Start Container 4.2. Launch AppMaster YARN child task JVM
  • 71. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM5. Allocate Resources
  • 72. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 6.1. Start Container 6.2. Launch YARN child task JVM
  • 73. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM 7. Execute
  • 74. Copyright © 2017, edureka and/or its affiliates. All rights reserved. 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 8. AM unregisters with RM 6 3 2 4 5 4 5 1 RM NM AMClient 4 5 7 8 YARN Application Workflow
  • 75. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Architecture = HDFS + YARN
  • 76. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Architecture MasterSlave HDFS YARN
  • 77. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Hardware Specification
  • 78. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Hardware Specification ▪ RAM: 64 GB, ▪ Hard disk: 1 TB ▪ Processor: Xenon with 8 Cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux ▪ Power: Redundant Power Supply ▪ RAM: 32 GB ▪ Hard disk: 1 TB ▪ Processor: Xenon with 4 Cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux ▪ Power: Redundant Power Supply ▪ RAM: 16GB ▪ Hard disk: 6 x 2TB ▪ Processor: Xenon with 2 cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux NameNode Secondary NameNode DataNode
  • 79. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Real Time Hadoop Cluster Deployment
  • 80. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster : Facebook Use Case 21 PB of storage in a single HDFS cluster 2000 Machines Per Cluster 12 TB of Data Per Machine 1200 machines with 8 cores each + 800 machines with 16 cores each 32 GB of RAM per machine 15 map- reduce tasks per machine That's a total of more than 21 PB of configured storage capacity, this is larger than the previously known Yahoo!'s cluster of 14 PB
  • 81. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster : Spotify Use Case Use Hadoop for generating music recommendations 1650 node cluster ~ 65 PB storage 70 TB RAM +25,000 daily Hadoop jobs 43,000 virtualised cores
  • 82. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark
  • 83. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Core Components
  • 84. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing It is responsible for: ▪ Memory management and fault recovery ▪ Scheduling, distributing and monitoring jobs on a cluster ▪ Interacting with storage systems
  • 85. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Architecture
  • 86. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark SQL • Spark SQL integrates relational processing with Spark’s functional programming • Provides support for various data sources and makes it possible to weave SQL queries with code transformations CSV JSON JDBC Data Source API Data Frame API DataFrame DSL Spark SQL and HQL
  • 87. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Start Spark Daemons ./sbin/start-all.sh 1 jps 2 ./bin/spark-shell 3 Starts all the Spark daemons(Master & Worker) Checks all the daemons running on you machines Starts the Spark Shell
  • 88. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin
  • 89. Copyright © 2017, edureka and/or its affiliates. All rights reserved. K-Means Clustering The process by which objects are classified into a predefined number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group ▪ The objects in group 1 should be as similar as possible ▪ But there should be much difference between an object in group 1 and group 2 ▪ The attributes of the objects are allowed to determine which objects should be grouped together Total population Group 1 Group 2 Group 3 Group 4
  • 90. Copyright © 2017, edureka and/or its affiliates. All rights reserved. K-Means Clustering ▪ Consider a comparison on Income & Balance: CurrentBalance High High Medium Medium Low Low Gross Monthly Income Example Cluster 1 High Balance Low Income Example Cluster 2 High Income Low Balance The objects in Cluster 1 have similar characteristics (High Income and Low balance) Also the objects in Cluster 2 have the same characteristic (High Balance and Low Income)
  • 91. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Example ▪ The plot of students in an area is as given below I need to find specific locations to build schools in this area so that the students doesn’t have to travel much
  • 92. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Example ▪ Using k-means clustering we got output as:
  • 93. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Apache Zeppelin
  • 94. Copyright © 2017, edureka and/or its affiliates. All rights reserved. What is Zeppelin? ▪ A completely open web-based notebook that enables interactive data analytics ▪ Web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop & Spark ▪ The various languages are supported via Zeppelin language interpreters
  • 95. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-case 1
  • 96. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Analyzing Data using SparkSQL Storing Data in HDFS Visualizing the data US County Solution HDFS Zeppelin
  • 97. Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset Storing Data in HDFS Processing Data Using Spark Components Transforming Data Using Scala & Spark SQL Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin 1 2 3 456
  • 98. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-case 12
  • 99. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions Lat Lon Transforming DataSet K-Means Clustering On Latitude & Longitude 1 3 4 Storing Data in HDFS 2
  • 100. Copyright © 2017, edureka and/or its affiliates. All rights reserved. Edureka LMS
  • 101. Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Getting Started
  • 102. Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Pre-Recorded Session
  • 103. Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Course Content
  • 104. Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Projects
  • 105. Copyright © 2017, edureka and/or its affiliates. All rights reserved.
  • 106. Copyright © 2017, edureka and/or its affiliates. All rights reserved.