SlideShare a Scribd company logo
MCS 7106: Advanced Topics in Computer Science
Simon Alex and Nambaale
Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 1 / 29
Overview
1 Hadoop
Hadoop Overview
MapReduce
Future Hadoop
Pros and Cons
Pilot Implementation
Simon Alex and Nambaale MCS 7106 October 27, 2019 2 / 29
What is Hadoop?
Hadoop is “an open source software platform for distributed storage
and distributed processing of very large data sets on computer
clusters built from commodity hardware”-Hortonworks.
Hadoop software platform mitigates the three-dimensions (referred to
as 3V’s) of data management challenges including: volume, velocity
and variety.
Simon Alex and Nambaale MCS 7106 October 27, 2019 3 / 29
Hadoop Origin
Google published GFS and MapReduce papers in 2003-2004.
Yahoo! was building “Nutch”, an open source web search engine at
the same time.
Hadoop was primarily driven by Doug Cutting and Tom White in 2006.
Simon Alex and Nambaale MCS 7106 October 27, 2019 4 / 29
Why Hadoop?
Disk seek times
Hardware failures
Processing times
Simon Alex and Nambaale MCS 7106 October 27, 2019 5 / 29
World of Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 6 / 29
HDFS
HDFS is based on Google’s GFS
Handles big files
HDFS breaks big files into blocks
Stored across several commodity computers
Simon Alex and Nambaale MCS 7106 October 27, 2019 7 / 29
HDFS Architecture
HDFS comprises of two important components: a name-node (the
master) and a number of datanodes (workers).
The NameNode serves all metadata operations on the file system like
creating, opening, closing or renaming files and directories.
Datanodes store and retrieve blocks when they are told to (by clients
or the namenode).
Simon Alex and Nambaale MCS 7106 October 27, 2019 8 / 29
Reading a File
Simon Alex and Nambaale MCS 7106 October 27, 2019 9 / 29
Writing a File
Simon Alex and Nambaale MCS 7106 October 27, 2019 10 / 29
NameNode Resilience
Backup Metadata-name node writes to the local disk and NFS
Secondary Namenode-maintains merged copy of edit log
HDFS Federation-each namenode manages a specific namespace
HDFS High Availability-hot standby namenode using shared edit log
Simon Alex and Nambaale MCS 7106 October 27, 2019 11 / 29
Using HDFS
UI (Ambari)
Command-Line Interface
HTTP / HDFS Proxies
Java Interface
NFS Gateway
Simon Alex and Nambaale MCS 7106 October 27, 2019 12 / 29
MapReduce
MapReduce is a programming model and implementation developed at
Google for processing and generating large datasets across a cluster of
computers.
MapReduce is a core component of Apache Hadoop, which distributes
processing on a cluster of computers.
Simon Alex and Nambaale MCS 7106 October 27, 2019 13 / 29
MapReduce Programming Model
This programming model is inspired∗ by the map and reduce primitives
of functional programming languages such as Lisp.
map: takes as input a procedure and a sequence of values and applies
the procedure to each value in the sequence.
reduce: takes as input a sequence of values and combines all values
using binary operator.
∗
but not equivalent!
Simon Alex and Nambaale MCS 7106 October 27, 2019 14 / 29
How MapReduce Works?
MapReduce works by breaking the processing into two phases: the map
phase and the reduce phase.
Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer.The programmer also specifies two
functions: the map function and the reduce function.
Simon Alex and Nambaale MCS 7106 October 27, 2019 15 / 29
MapReduce Example
Challenge
What’s the highest ever recorded Makerere’s CGPA for each year?
Simon Alex and Nambaale MCS 7106 October 27, 2019 16 / 29
MapReduce Example
Figure: MapReduce logical data flow
Simon Alex and Nambaale MCS 7106 October 27, 2019 17 / 29
Recent Developments
TonY (TensorFlow on YARN)
Hadoop Encryption
HDFS High Availabilty Enhancement
Ozone
Simon Alex and Nambaale MCS 7106 October 27, 2019 18 / 29
Strengths and Weaknesses
Strengths
Varied Data sources
Cost effective
Performance
Fault tolerant
High availability
Low network traffic
Scalable
Simon Alex and Nambaale MCS 7106 October 27, 2019 19 / 29
Strengths and Weaknesses
Weaknesses
Issue with small files
Processing overhead
Supports only batch processing
Iterative processing
Simon Alex and Nambaale MCS 7106 October 27, 2019 20 / 29
Where is Hadoop used?
LinkedLn Assessment
Question Calibration
Simon Alex and Nambaale MCS 7106 October 27, 2019 21 / 29
Pilot Implementation
UI (Ambari)
Simon Alex and Nambaale MCS 7106 October 27, 2019 22 / 29
Installing the dataset into HDFS
Using Ambari
Simon Alex and Nambaale MCS 7106 October 27, 2019 23 / 29
Installing the dataset into HDFS
Using Command Line Interface
Simon Alex and Nambaale MCS 7106 October 27, 2019 24 / 29
MapReduce
Writing the Mapper
def mapper_get_ratings (self , _, line ):
(userID , movieID , rating , timestamp) = line.split(’t’)
yield rating , 1
Simon Alex and Nambaale MCS 7106 October 27, 2019 25 / 29
MapReduce
Writing the Reducer
def reducer_count_ratings (self , key , values ):
yield key , sum(values)
Simon Alex and Nambaale MCS 7106 October 27, 2019 26 / 29
MapReduce
Putting it all Together
from mrjob.job import MRJob
from mrjob.step import MRStep
class RatingsBreakdown (MRJob ):
def steps(self ):
return [
MRStep(mapper=self.mapper_get_ratings ,
reducer=self. reducer_count_ratings )
]
def mapper_get_ratings (self , _, line ):
(userID , movieID , rating , timestamp )= line.split(’t’)
yield rating , 1
def reducer_count_ratings (self , key , values ):
yield key , sum(values)
if __name__ == ’__main__ ’:
RatingsBreakdown .run()
Simon Alex and Nambaale MCS 7106 October 27, 2019 27 / 29
MapReduce
Running in Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 28 / 29
Questions?
Simon Alex and Nambaale MCS 7106 October 27, 2019 29 / 29

More Related Content

Similar to Hadoop presentation

IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET Journal
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
B04 06 0918
B04 06 0918B04 06 0918
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
Ankit Gupta
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
redpel dot com
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Robert Grossman
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
ABC Talks
 
H04502048051
H04502048051H04502048051
H04502048051
ijceronline
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
David Ribeiro Alves
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
Abhijit Sharma
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
Jane Man
 
Cloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless ComputingCloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless Computing
mustafa sarac
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
vithakur
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
Gregg Barrett
 
Kubernetes - 7 lessons learned from 7 data centers in 7 months
Kubernetes - 7 lessons learned from 7 data centers in 7 monthsKubernetes - 7 lessons learned from 7 data centers in 7 months
Kubernetes - 7 lessons learned from 7 data centers in 7 months
Michael Tougeron
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
ijcsit
 
Unite Copehagen 2019 - Unity Roadmap 2020
Unite Copehagen 2019 - Unity Roadmap 2020Unite Copehagen 2019 - Unity Roadmap 2020
Unite Copehagen 2019 - Unity Roadmap 2020
Unity Technologies
 
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
Unity Technologies
 
Cloud-Native Application and Kubernetes
Cloud-Native Application and KubernetesCloud-Native Application and Kubernetes
Cloud-Native Application and Kubernetes
Alex Glikson
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012
Mike Miller
 

Similar to Hadoop presentation (20)

IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
 
H04502048051
H04502048051H04502048051
H04502048051
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Cloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless ComputingCloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless Computing
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Kubernetes - 7 lessons learned from 7 data centers in 7 months
Kubernetes - 7 lessons learned from 7 data centers in 7 monthsKubernetes - 7 lessons learned from 7 data centers in 7 months
Kubernetes - 7 lessons learned from 7 data centers in 7 months
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
Unite Copehagen 2019 - Unity Roadmap 2020
Unite Copehagen 2019 - Unity Roadmap 2020Unite Copehagen 2019 - Unity Roadmap 2020
Unite Copehagen 2019 - Unity Roadmap 2020
 
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
What to expect in 2020: Unity roadmap - Unite Copenhagen 2019
 
Cloud-Native Application and Kubernetes
Cloud-Native Application and KubernetesCloud-Native Application and Kubernetes
Cloud-Native Application and Kubernetes
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012
 

Recently uploaded

Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 

Recently uploaded (20)

Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 

Hadoop presentation

  • 1. MCS 7106: Advanced Topics in Computer Science Simon Alex and Nambaale Hadoop Simon Alex and Nambaale MCS 7106 October 27, 2019 1 / 29
  • 2. Overview 1 Hadoop Hadoop Overview MapReduce Future Hadoop Pros and Cons Pilot Implementation Simon Alex and Nambaale MCS 7106 October 27, 2019 2 / 29
  • 3. What is Hadoop? Hadoop is “an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware”-Hortonworks. Hadoop software platform mitigates the three-dimensions (referred to as 3V’s) of data management challenges including: volume, velocity and variety. Simon Alex and Nambaale MCS 7106 October 27, 2019 3 / 29
  • 4. Hadoop Origin Google published GFS and MapReduce papers in 2003-2004. Yahoo! was building “Nutch”, an open source web search engine at the same time. Hadoop was primarily driven by Doug Cutting and Tom White in 2006. Simon Alex and Nambaale MCS 7106 October 27, 2019 4 / 29
  • 5. Why Hadoop? Disk seek times Hardware failures Processing times Simon Alex and Nambaale MCS 7106 October 27, 2019 5 / 29
  • 6. World of Hadoop Simon Alex and Nambaale MCS 7106 October 27, 2019 6 / 29
  • 7. HDFS HDFS is based on Google’s GFS Handles big files HDFS breaks big files into blocks Stored across several commodity computers Simon Alex and Nambaale MCS 7106 October 27, 2019 7 / 29
  • 8. HDFS Architecture HDFS comprises of two important components: a name-node (the master) and a number of datanodes (workers). The NameNode serves all metadata operations on the file system like creating, opening, closing or renaming files and directories. Datanodes store and retrieve blocks when they are told to (by clients or the namenode). Simon Alex and Nambaale MCS 7106 October 27, 2019 8 / 29
  • 9. Reading a File Simon Alex and Nambaale MCS 7106 October 27, 2019 9 / 29
  • 10. Writing a File Simon Alex and Nambaale MCS 7106 October 27, 2019 10 / 29
  • 11. NameNode Resilience Backup Metadata-name node writes to the local disk and NFS Secondary Namenode-maintains merged copy of edit log HDFS Federation-each namenode manages a specific namespace HDFS High Availability-hot standby namenode using shared edit log Simon Alex and Nambaale MCS 7106 October 27, 2019 11 / 29
  • 12. Using HDFS UI (Ambari) Command-Line Interface HTTP / HDFS Proxies Java Interface NFS Gateway Simon Alex and Nambaale MCS 7106 October 27, 2019 12 / 29
  • 13. MapReduce MapReduce is a programming model and implementation developed at Google for processing and generating large datasets across a cluster of computers. MapReduce is a core component of Apache Hadoop, which distributes processing on a cluster of computers. Simon Alex and Nambaale MCS 7106 October 27, 2019 13 / 29
  • 14. MapReduce Programming Model This programming model is inspired∗ by the map and reduce primitives of functional programming languages such as Lisp. map: takes as input a procedure and a sequence of values and applies the procedure to each value in the sequence. reduce: takes as input a sequence of values and combines all values using binary operator. ∗ but not equivalent! Simon Alex and Nambaale MCS 7106 October 27, 2019 14 / 29
  • 15. How MapReduce Works? MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer.The programmer also specifies two functions: the map function and the reduce function. Simon Alex and Nambaale MCS 7106 October 27, 2019 15 / 29
  • 16. MapReduce Example Challenge What’s the highest ever recorded Makerere’s CGPA for each year? Simon Alex and Nambaale MCS 7106 October 27, 2019 16 / 29
  • 17. MapReduce Example Figure: MapReduce logical data flow Simon Alex and Nambaale MCS 7106 October 27, 2019 17 / 29
  • 18. Recent Developments TonY (TensorFlow on YARN) Hadoop Encryption HDFS High Availabilty Enhancement Ozone Simon Alex and Nambaale MCS 7106 October 27, 2019 18 / 29
  • 19. Strengths and Weaknesses Strengths Varied Data sources Cost effective Performance Fault tolerant High availability Low network traffic Scalable Simon Alex and Nambaale MCS 7106 October 27, 2019 19 / 29
  • 20. Strengths and Weaknesses Weaknesses Issue with small files Processing overhead Supports only batch processing Iterative processing Simon Alex and Nambaale MCS 7106 October 27, 2019 20 / 29
  • 21. Where is Hadoop used? LinkedLn Assessment Question Calibration Simon Alex and Nambaale MCS 7106 October 27, 2019 21 / 29
  • 22. Pilot Implementation UI (Ambari) Simon Alex and Nambaale MCS 7106 October 27, 2019 22 / 29
  • 23. Installing the dataset into HDFS Using Ambari Simon Alex and Nambaale MCS 7106 October 27, 2019 23 / 29
  • 24. Installing the dataset into HDFS Using Command Line Interface Simon Alex and Nambaale MCS 7106 October 27, 2019 24 / 29
  • 25. MapReduce Writing the Mapper def mapper_get_ratings (self , _, line ): (userID , movieID , rating , timestamp) = line.split(’t’) yield rating , 1 Simon Alex and Nambaale MCS 7106 October 27, 2019 25 / 29
  • 26. MapReduce Writing the Reducer def reducer_count_ratings (self , key , values ): yield key , sum(values) Simon Alex and Nambaale MCS 7106 October 27, 2019 26 / 29
  • 27. MapReduce Putting it all Together from mrjob.job import MRJob from mrjob.step import MRStep class RatingsBreakdown (MRJob ): def steps(self ): return [ MRStep(mapper=self.mapper_get_ratings , reducer=self. reducer_count_ratings ) ] def mapper_get_ratings (self , _, line ): (userID , movieID , rating , timestamp )= line.split(’t’) yield rating , 1 def reducer_count_ratings (self , key , values ): yield key , sum(values) if __name__ == ’__main__ ’: RatingsBreakdown .run() Simon Alex and Nambaale MCS 7106 October 27, 2019 27 / 29
  • 28. MapReduce Running in Hadoop Simon Alex and Nambaale MCS 7106 October 27, 2019 28 / 29
  • 29. Questions? Simon Alex and Nambaale MCS 7106 October 27, 2019 29 / 29