Hadoop presentation

MCS 7106: Advanced Topics in Computer Science
Simon Alex and Nambaale
Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 1 / 29

Overview
1 Hadoop
Hadoop Overview
MapReduce
Future Hadoop
Pros and Cons
Pilot Implementation

What is Hadoop?
Hadoop is “an open source software platform for distributed storage
and distributed processing of very large data sets on computer
clusters built from commodity hardware”-Hortonworks.
Hadoop software platform mitigates the three-dimensions (referred to
as 3V’s) of data management challenges including: volume, velocity
and variety.

Hadoop Origin
Google published GFS and MapReduce papers in 2003-2004.
Yahoo! was building “Nutch”, an open source web search engine at
the same time.
Hadoop was primarily driven by Doug Cutting and Tom White in 2006.

Why Hadoop?
Disk seek times
Hardware failures
Processing times

World of Hadoop

HDFS
HDFS is based on Google’s GFS
Handles big ﬁles
HDFS breaks big ﬁles into blocks
Stored across several commodity computers

HDFS Architecture
HDFS comprises of two important components: a name-node (the
master) and a number of datanodes (workers).
The NameNode serves all metadata operations on the ﬁle system like
creating, opening, closing or renaming ﬁles and directories.
Datanodes store and retrieve blocks when they are told to (by clients
or the namenode).

Reading a File

Writing a File

NameNode Resilience
Backup Metadata-name node writes to the local disk and NFS
Secondary Namenode-maintains merged copy of edit log
HDFS Federation-each namenode manages a speciﬁc namespace
HDFS High Availability-hot standby namenode using shared edit log

Using HDFS
UI (Ambari)
Command-Line Interface
HTTP / HDFS Proxies
Java Interface
NFS Gateway

MapReduce
MapReduce is a programming model and implementation developed at
Google for processing and generating large datasets across a cluster of
computers.
MapReduce is a core component of Apache Hadoop, which distributes
processing on a cluster of computers.

MapReduce Programming Model
This programming model is inspired∗ by the map and reduce primitives
of functional programming languages such as Lisp.
map: takes as input a procedure and a sequence of values and applies
the procedure to each value in the sequence.
reduce: takes as input a sequence of values and combines all values
using binary operator.
∗
but not equivalent!

How MapReduce Works?
MapReduce works by breaking the processing into two phases: the map
phase and the reduce phase.
Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer.The programmer also speciﬁes two
functions: the map function and the reduce function.

MapReduce Example
Challenge
What’s the highest ever recorded Makerere’s CGPA for each year?

MapReduce Example
Figure: MapReduce logical data ﬂow

Recent Developments
TonY (TensorFlow on YARN)
Hadoop Encryption
HDFS High Availabilty Enhancement
Ozone

Strengths and Weaknesses
Strengths
Varied Data sources
Cost eﬀective
Performance
Fault tolerant
High availability
Low network traﬃc
Scalable

Strengths and Weaknesses
Weaknesses
Issue with small ﬁles
Processing overhead
Supports only batch processing
Iterative processing

Where is Hadoop used?
LinkedLn Assessment
Question Calibration

Pilot Implementation
UI (Ambari)

Installing the dataset into HDFS
Using Ambari

Installing the dataset into HDFS
Using Command Line Interface

MapReduce
Writing the Mapper
def mapper_get_ratings (self , _, line ):
(userID , movieID , rating , timestamp) = line.split(’t’)
yield rating , 1

MapReduce
Writing the Reducer
def reducer_count_ratings (self , key , values ):
yield key , sum(values)

MapReduce
Putting it all Together
from mrjob.job import MRJob
from mrjob.step import MRStep
class RatingsBreakdown (MRJob ):
def steps(self ):
return [
MRStep(mapper=self.mapper_get_ratings ,
reducer=self. reducer_count_ratings )
]
def mapper_get_ratings (self , _, line ):
(userID , movieID , rating , timestamp )= line.split(’t’)
yield rating , 1
def reducer_count_ratings (self , key , values ):
yield key , sum(values)
if __name__ == ’__main__ ’:
RatingsBreakdown .run()

MapReduce
Running in Hadoop

Questions?

Hadoop presentation

More Related Content

Similar to Hadoop presentation

Recently uploaded

Hadoop presentation