SlideShare a Scribd company logo
1 of 29
Download to read offline
MCS 7106: Advanced Topics in Computer Science
Simon Alex and Nambaale
Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 1 / 29
Overview
1 Hadoop
Hadoop Overview
MapReduce
Future Hadoop
Pros and Cons
Pilot Implementation
Simon Alex and Nambaale MCS 7106 October 27, 2019 2 / 29
What is Hadoop?
Hadoop is “an open source software platform for distributed storage
and distributed processing of very large data sets on computer
clusters built from commodity hardware”-Hortonworks.
Hadoop software platform mitigates the three-dimensions (referred to
as 3V’s) of data management challenges including: volume, velocity
and variety.
Simon Alex and Nambaale MCS 7106 October 27, 2019 3 / 29
Hadoop Origin
Google published GFS and MapReduce papers in 2003-2004.
Yahoo! was building “Nutch”, an open source web search engine at
the same time.
Hadoop was primarily driven by Doug Cutting and Tom White in 2006.
Simon Alex and Nambaale MCS 7106 October 27, 2019 4 / 29
Why Hadoop?
Disk seek times
Hardware failures
Processing times
Simon Alex and Nambaale MCS 7106 October 27, 2019 5 / 29
World of Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 6 / 29
HDFS
HDFS is based on Google’s GFS
Handles big files
HDFS breaks big files into blocks
Stored across several commodity computers
Simon Alex and Nambaale MCS 7106 October 27, 2019 7 / 29
HDFS Architecture
HDFS comprises of two important components: a name-node (the
master) and a number of datanodes (workers).
The NameNode serves all metadata operations on the file system like
creating, opening, closing or renaming files and directories.
Datanodes store and retrieve blocks when they are told to (by clients
or the namenode).
Simon Alex and Nambaale MCS 7106 October 27, 2019 8 / 29
Reading a File
Simon Alex and Nambaale MCS 7106 October 27, 2019 9 / 29
Writing a File
Simon Alex and Nambaale MCS 7106 October 27, 2019 10 / 29
NameNode Resilience
Backup Metadata-name node writes to the local disk and NFS
Secondary Namenode-maintains merged copy of edit log
HDFS Federation-each namenode manages a specific namespace
HDFS High Availability-hot standby namenode using shared edit log
Simon Alex and Nambaale MCS 7106 October 27, 2019 11 / 29
Using HDFS
UI (Ambari)
Command-Line Interface
HTTP / HDFS Proxies
Java Interface
NFS Gateway
Simon Alex and Nambaale MCS 7106 October 27, 2019 12 / 29
MapReduce
MapReduce is a programming model and implementation developed at
Google for processing and generating large datasets across a cluster of
computers.
MapReduce is a core component of Apache Hadoop, which distributes
processing on a cluster of computers.
Simon Alex and Nambaale MCS 7106 October 27, 2019 13 / 29
MapReduce Programming Model
This programming model is inspired∗ by the map and reduce primitives
of functional programming languages such as Lisp.
map: takes as input a procedure and a sequence of values and applies
the procedure to each value in the sequence.
reduce: takes as input a sequence of values and combines all values
using binary operator.
∗
but not equivalent!
Simon Alex and Nambaale MCS 7106 October 27, 2019 14 / 29
How MapReduce Works?
MapReduce works by breaking the processing into two phases: the map
phase and the reduce phase.
Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer.The programmer also specifies two
functions: the map function and the reduce function.
Simon Alex and Nambaale MCS 7106 October 27, 2019 15 / 29
MapReduce Example
Challenge
What’s the highest ever recorded Makerere’s CGPA for each year?
Simon Alex and Nambaale MCS 7106 October 27, 2019 16 / 29
MapReduce Example
Figure: MapReduce logical data flow
Simon Alex and Nambaale MCS 7106 October 27, 2019 17 / 29
Recent Developments
TonY (TensorFlow on YARN)
Hadoop Encryption
HDFS High Availabilty Enhancement
Ozone
Simon Alex and Nambaale MCS 7106 October 27, 2019 18 / 29
Strengths and Weaknesses
Strengths
Varied Data sources
Cost effective
Performance
Fault tolerant
High availability
Low network traffic
Scalable
Simon Alex and Nambaale MCS 7106 October 27, 2019 19 / 29
Strengths and Weaknesses
Weaknesses
Issue with small files
Processing overhead
Supports only batch processing
Iterative processing
Simon Alex and Nambaale MCS 7106 October 27, 2019 20 / 29
Where is Hadoop used?
LinkedLn Assessment
Question Calibration
Simon Alex and Nambaale MCS 7106 October 27, 2019 21 / 29
Pilot Implementation
UI (Ambari)
Simon Alex and Nambaale MCS 7106 October 27, 2019 22 / 29
Installing the dataset into HDFS
Using Ambari
Simon Alex and Nambaale MCS 7106 October 27, 2019 23 / 29
Installing the dataset into HDFS
Using Command Line Interface
Simon Alex and Nambaale MCS 7106 October 27, 2019 24 / 29
MapReduce
Writing the Mapper
def mapper_get_ratings (self , _, line ):
(userID , movieID , rating , timestamp) = line.split(’t’)
yield rating , 1
Simon Alex and Nambaale MCS 7106 October 27, 2019 25 / 29
MapReduce
Writing the Reducer
def reducer_count_ratings (self , key , values ):
yield key , sum(values)
Simon Alex and Nambaale MCS 7106 October 27, 2019 26 / 29
MapReduce
Putting it all Together
from mrjob.job import MRJob
from mrjob.step import MRStep
class RatingsBreakdown (MRJob ):
def steps(self ):
return [
MRStep(mapper=self.mapper_get_ratings ,
reducer=self. reducer_count_ratings )
]
def mapper_get_ratings (self , _, line ):
(userID , movieID , rating , timestamp )= line.split(’t’)
yield rating , 1
def reducer_count_ratings (self , key , values ):
yield key , sum(values)
if __name__ == ’__main__ ’:
RatingsBreakdown .run()
Simon Alex and Nambaale MCS 7106 October 27, 2019 27 / 29
MapReduce
Running in Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 28 / 29
Questions?
Simon Alex and Nambaale MCS 7106 October 27, 2019 29 / 29

More Related Content

Similar to Hadoop presentation

IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET Journal
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Ankit Gupta
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMRABC Talks
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAbhijit Sharma
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSJane Man
 
Cloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless ComputingCloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless Computingmustafa sarac
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0vithakur
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Kubernetes - 7 lessons learned from 7 data centers in 7 months
Kubernetes - 7 lessons learned from 7 data centers in 7 monthsKubernetes - 7 lessons learned from 7 data centers in 7 months
Kubernetes - 7 lessons learned from 7 data centers in 7 monthsMichael Tougeron
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
 
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019Unity Technologies
 
Unite Copehagen 2019 - Unity Roadmap 2020
Unite Copehagen 2019 - Unity Roadmap 2020Unite Copehagen 2019 - Unity Roadmap 2020
Unite Copehagen 2019 - Unity Roadmap 2020Unity Technologies
 
Cloud-Native Application and Kubernetes
Cloud-Native Application and KubernetesCloud-Native Application and Kubernetes
Cloud-Native Application and KubernetesAlex Glikson
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012Mike Miller
 

Similar to Hadoop presentation (20)

IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
 
H04502048051
H04502048051H04502048051
H04502048051
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Cloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless ComputingCloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless Computing
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Kubernetes - 7 lessons learned from 7 data centers in 7 months
Kubernetes - 7 lessons learned from 7 data centers in 7 monthsKubernetes - 7 lessons learned from 7 data centers in 7 months
Kubernetes - 7 lessons learned from 7 data centers in 7 months
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
 
Unite Copehagen 2019 - Unity Roadmap 2020
Unite Copehagen 2019 - Unity Roadmap 2020Unite Copehagen 2019 - Unity Roadmap 2020
Unite Copehagen 2019 - Unity Roadmap 2020
 
Cloud-Native Application and Kubernetes
Cloud-Native Application and KubernetesCloud-Native Application and Kubernetes
Cloud-Native Application and Kubernetes
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012
 

Recently uploaded

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Hadoop presentation

  • 1. MCS 7106: Advanced Topics in Computer Science Simon Alex and Nambaale Hadoop Simon Alex and Nambaale MCS 7106 October 27, 2019 1 / 29
  • 2. Overview 1 Hadoop Hadoop Overview MapReduce Future Hadoop Pros and Cons Pilot Implementation Simon Alex and Nambaale MCS 7106 October 27, 2019 2 / 29
  • 3. What is Hadoop? Hadoop is “an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware”-Hortonworks. Hadoop software platform mitigates the three-dimensions (referred to as 3V’s) of data management challenges including: volume, velocity and variety. Simon Alex and Nambaale MCS 7106 October 27, 2019 3 / 29
  • 4. Hadoop Origin Google published GFS and MapReduce papers in 2003-2004. Yahoo! was building “Nutch”, an open source web search engine at the same time. Hadoop was primarily driven by Doug Cutting and Tom White in 2006. Simon Alex and Nambaale MCS 7106 October 27, 2019 4 / 29
  • 5. Why Hadoop? Disk seek times Hardware failures Processing times Simon Alex and Nambaale MCS 7106 October 27, 2019 5 / 29
  • 6. World of Hadoop Simon Alex and Nambaale MCS 7106 October 27, 2019 6 / 29
  • 7. HDFS HDFS is based on Google’s GFS Handles big files HDFS breaks big files into blocks Stored across several commodity computers Simon Alex and Nambaale MCS 7106 October 27, 2019 7 / 29
  • 8. HDFS Architecture HDFS comprises of two important components: a name-node (the master) and a number of datanodes (workers). The NameNode serves all metadata operations on the file system like creating, opening, closing or renaming files and directories. Datanodes store and retrieve blocks when they are told to (by clients or the namenode). Simon Alex and Nambaale MCS 7106 October 27, 2019 8 / 29
  • 9. Reading a File Simon Alex and Nambaale MCS 7106 October 27, 2019 9 / 29
  • 10. Writing a File Simon Alex and Nambaale MCS 7106 October 27, 2019 10 / 29
  • 11. NameNode Resilience Backup Metadata-name node writes to the local disk and NFS Secondary Namenode-maintains merged copy of edit log HDFS Federation-each namenode manages a specific namespace HDFS High Availability-hot standby namenode using shared edit log Simon Alex and Nambaale MCS 7106 October 27, 2019 11 / 29
  • 12. Using HDFS UI (Ambari) Command-Line Interface HTTP / HDFS Proxies Java Interface NFS Gateway Simon Alex and Nambaale MCS 7106 October 27, 2019 12 / 29
  • 13. MapReduce MapReduce is a programming model and implementation developed at Google for processing and generating large datasets across a cluster of computers. MapReduce is a core component of Apache Hadoop, which distributes processing on a cluster of computers. Simon Alex and Nambaale MCS 7106 October 27, 2019 13 / 29
  • 14. MapReduce Programming Model This programming model is inspired∗ by the map and reduce primitives of functional programming languages such as Lisp. map: takes as input a procedure and a sequence of values and applies the procedure to each value in the sequence. reduce: takes as input a sequence of values and combines all values using binary operator. ∗ but not equivalent! Simon Alex and Nambaale MCS 7106 October 27, 2019 14 / 29
  • 15. How MapReduce Works? MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer.The programmer also specifies two functions: the map function and the reduce function. Simon Alex and Nambaale MCS 7106 October 27, 2019 15 / 29
  • 16. MapReduce Example Challenge What’s the highest ever recorded Makerere’s CGPA for each year? Simon Alex and Nambaale MCS 7106 October 27, 2019 16 / 29
  • 17. MapReduce Example Figure: MapReduce logical data flow Simon Alex and Nambaale MCS 7106 October 27, 2019 17 / 29
  • 18. Recent Developments TonY (TensorFlow on YARN) Hadoop Encryption HDFS High Availabilty Enhancement Ozone Simon Alex and Nambaale MCS 7106 October 27, 2019 18 / 29
  • 19. Strengths and Weaknesses Strengths Varied Data sources Cost effective Performance Fault tolerant High availability Low network traffic Scalable Simon Alex and Nambaale MCS 7106 October 27, 2019 19 / 29
  • 20. Strengths and Weaknesses Weaknesses Issue with small files Processing overhead Supports only batch processing Iterative processing Simon Alex and Nambaale MCS 7106 October 27, 2019 20 / 29
  • 21. Where is Hadoop used? LinkedLn Assessment Question Calibration Simon Alex and Nambaale MCS 7106 October 27, 2019 21 / 29
  • 22. Pilot Implementation UI (Ambari) Simon Alex and Nambaale MCS 7106 October 27, 2019 22 / 29
  • 23. Installing the dataset into HDFS Using Ambari Simon Alex and Nambaale MCS 7106 October 27, 2019 23 / 29
  • 24. Installing the dataset into HDFS Using Command Line Interface Simon Alex and Nambaale MCS 7106 October 27, 2019 24 / 29
  • 25. MapReduce Writing the Mapper def mapper_get_ratings (self , _, line ): (userID , movieID , rating , timestamp) = line.split(’t’) yield rating , 1 Simon Alex and Nambaale MCS 7106 October 27, 2019 25 / 29
  • 26. MapReduce Writing the Reducer def reducer_count_ratings (self , key , values ): yield key , sum(values) Simon Alex and Nambaale MCS 7106 October 27, 2019 26 / 29
  • 27. MapReduce Putting it all Together from mrjob.job import MRJob from mrjob.step import MRStep class RatingsBreakdown (MRJob ): def steps(self ): return [ MRStep(mapper=self.mapper_get_ratings , reducer=self. reducer_count_ratings ) ] def mapper_get_ratings (self , _, line ): (userID , movieID , rating , timestamp )= line.split(’t’) yield rating , 1 def reducer_count_ratings (self , key , values ): yield key , sum(values) if __name__ == ’__main__ ’: RatingsBreakdown .run() Simon Alex and Nambaale MCS 7106 October 27, 2019 27 / 29
  • 28. MapReduce Running in Hadoop Simon Alex and Nambaale MCS 7106 October 27, 2019 28 / 29
  • 29. Questions? Simon Alex and Nambaale MCS 7106 October 27, 2019 29 / 29