MCS 7106: Advanced Topics in Computer Science
Simon Alex and Nambaale
Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 1 / 29
Overview
1 Hadoop
Hadoop Overview
MapReduce
Future Hadoop
Pros and Cons
Pilot Implementation
Simon Alex and Nambaale MCS 7106 October 27, 2019 2 / 29
What is Hadoop?
Hadoop is “an open source software platform for distributed storage
and distributed processing of very large data sets on computer
clusters built from commodity hardware”-Hortonworks.
Hadoop software platform mitigates the three-dimensions (referred to
as 3V’s) of data management challenges including: volume, velocity
and variety.
Simon Alex and Nambaale MCS 7106 October 27, 2019 3 / 29
Hadoop Origin
Google published GFS and MapReduce papers in 2003-2004.
Yahoo! was building “Nutch”, an open source web search engine at
the same time.
Hadoop was primarily driven by Doug Cutting and Tom White in 2006.
Simon Alex and Nambaale MCS 7106 October 27, 2019 4 / 29
Why Hadoop?
Disk seek times
Hardware failures
Processing times
Simon Alex and Nambaale MCS 7106 October 27, 2019 5 / 29
World of Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 6 / 29
HDFS
HDFS is based on Google’s GFS
Handles big files
HDFS breaks big files into blocks
Stored across several commodity computers
Simon Alex and Nambaale MCS 7106 October 27, 2019 7 / 29
HDFS Architecture
HDFS comprises of two important components: a name-node (the
master) and a number of datanodes (workers).
The NameNode serves all metadata operations on the file system like
creating, opening, closing or renaming files and directories.
Datanodes store and retrieve blocks when they are told to (by clients
or the namenode).
Simon Alex and Nambaale MCS 7106 October 27, 2019 8 / 29
Reading a File
Simon Alex and Nambaale MCS 7106 October 27, 2019 9 / 29
Writing a File
Simon Alex and Nambaale MCS 7106 October 27, 2019 10 / 29
NameNode Resilience
Backup Metadata-name node writes to the local disk and NFS
Secondary Namenode-maintains merged copy of edit log
HDFS Federation-each namenode manages a specific namespace
HDFS High Availability-hot standby namenode using shared edit log
Simon Alex and Nambaale MCS 7106 October 27, 2019 11 / 29
Using HDFS
UI (Ambari)
Command-Line Interface
HTTP / HDFS Proxies
Java Interface
NFS Gateway
Simon Alex and Nambaale MCS 7106 October 27, 2019 12 / 29
MapReduce
MapReduce is a programming model and implementation developed at
Google for processing and generating large datasets across a cluster of
computers.
MapReduce is a core component of Apache Hadoop, which distributes
processing on a cluster of computers.
Simon Alex and Nambaale MCS 7106 October 27, 2019 13 / 29
MapReduce Programming Model
This programming model is inspired∗ by the map and reduce primitives
of functional programming languages such as Lisp.
map: takes as input a procedure and a sequence of values and applies
the procedure to each value in the sequence.
reduce: takes as input a sequence of values and combines all values
using binary operator.
∗
but not equivalent!
Simon Alex and Nambaale MCS 7106 October 27, 2019 14 / 29
How MapReduce Works?
MapReduce works by breaking the processing into two phases: the map
phase and the reduce phase.
Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer.The programmer also specifies two
functions: the map function and the reduce function.
Simon Alex and Nambaale MCS 7106 October 27, 2019 15 / 29
MapReduce Example
Challenge
What’s the highest ever recorded Makerere’s CGPA for each year?
Simon Alex and Nambaale MCS 7106 October 27, 2019 16 / 29
MapReduce Example
Figure: MapReduce logical data flow
Simon Alex and Nambaale MCS 7106 October 27, 2019 17 / 29
Recent Developments
TonY (TensorFlow on YARN)
Hadoop Encryption
HDFS High Availabilty Enhancement
Ozone
Simon Alex and Nambaale MCS 7106 October 27, 2019 18 / 29
Strengths and Weaknesses
Strengths
Varied Data sources
Cost effective
Performance
Fault tolerant
High availability
Low network traffic
Scalable
Simon Alex and Nambaale MCS 7106 October 27, 2019 19 / 29
Strengths and Weaknesses
Weaknesses
Issue with small files
Processing overhead
Supports only batch processing
Iterative processing
Simon Alex and Nambaale MCS 7106 October 27, 2019 20 / 29
Where is Hadoop used?
LinkedLn Assessment
Question Calibration
Simon Alex and Nambaale MCS 7106 October 27, 2019 21 / 29
Pilot Implementation
UI (Ambari)
Simon Alex and Nambaale MCS 7106 October 27, 2019 22 / 29
Installing the dataset into HDFS
Using Ambari
Simon Alex and Nambaale MCS 7106 October 27, 2019 23 / 29
Installing the dataset into HDFS
Using Command Line Interface
Simon Alex and Nambaale MCS 7106 October 27, 2019 24 / 29
MapReduce
Writing the Mapper
def mapper_get_ratings (self , _, line ):
(userID , movieID , rating , timestamp) = line.split(’t’)
yield rating , 1
Simon Alex and Nambaale MCS 7106 October 27, 2019 25 / 29
MapReduce
Writing the Reducer
def reducer_count_ratings (self , key , values ):
yield key , sum(values)
Simon Alex and Nambaale MCS 7106 October 27, 2019 26 / 29
MapReduce
Putting it all Together
from mrjob.job import MRJob
from mrjob.step import MRStep
class RatingsBreakdown (MRJob ):
def steps(self ):
return [
MRStep(mapper=self.mapper_get_ratings ,
reducer=self. reducer_count_ratings )
]
def mapper_get_ratings (self , _, line ):
(userID , movieID , rating , timestamp )= line.split(’t’)
yield rating , 1
def reducer_count_ratings (self , key , values ):
yield key , sum(values)
if __name__ == ’__main__ ’:
RatingsBreakdown .run()
Simon Alex and Nambaale MCS 7106 October 27, 2019 27 / 29
MapReduce
Running in Hadoop
Simon Alex and Nambaale MCS 7106 October 27, 2019 28 / 29
Questions?
Simon Alex and Nambaale MCS 7106 October 27, 2019 29 / 29

Hadoop presentation

  • 1.
    MCS 7106: AdvancedTopics in Computer Science Simon Alex and Nambaale Hadoop Simon Alex and Nambaale MCS 7106 October 27, 2019 1 / 29
  • 2.
    Overview 1 Hadoop Hadoop Overview MapReduce FutureHadoop Pros and Cons Pilot Implementation Simon Alex and Nambaale MCS 7106 October 27, 2019 2 / 29
  • 3.
    What is Hadoop? Hadoopis “an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware”-Hortonworks. Hadoop software platform mitigates the three-dimensions (referred to as 3V’s) of data management challenges including: volume, velocity and variety. Simon Alex and Nambaale MCS 7106 October 27, 2019 3 / 29
  • 4.
    Hadoop Origin Google publishedGFS and MapReduce papers in 2003-2004. Yahoo! was building “Nutch”, an open source web search engine at the same time. Hadoop was primarily driven by Doug Cutting and Tom White in 2006. Simon Alex and Nambaale MCS 7106 October 27, 2019 4 / 29
  • 5.
    Why Hadoop? Disk seektimes Hardware failures Processing times Simon Alex and Nambaale MCS 7106 October 27, 2019 5 / 29
  • 6.
    World of Hadoop SimonAlex and Nambaale MCS 7106 October 27, 2019 6 / 29
  • 7.
    HDFS HDFS is basedon Google’s GFS Handles big files HDFS breaks big files into blocks Stored across several commodity computers Simon Alex and Nambaale MCS 7106 October 27, 2019 7 / 29
  • 8.
    HDFS Architecture HDFS comprisesof two important components: a name-node (the master) and a number of datanodes (workers). The NameNode serves all metadata operations on the file system like creating, opening, closing or renaming files and directories. Datanodes store and retrieve blocks when they are told to (by clients or the namenode). Simon Alex and Nambaale MCS 7106 October 27, 2019 8 / 29
  • 9.
    Reading a File SimonAlex and Nambaale MCS 7106 October 27, 2019 9 / 29
  • 10.
    Writing a File SimonAlex and Nambaale MCS 7106 October 27, 2019 10 / 29
  • 11.
    NameNode Resilience Backup Metadata-namenode writes to the local disk and NFS Secondary Namenode-maintains merged copy of edit log HDFS Federation-each namenode manages a specific namespace HDFS High Availability-hot standby namenode using shared edit log Simon Alex and Nambaale MCS 7106 October 27, 2019 11 / 29
  • 12.
    Using HDFS UI (Ambari) Command-LineInterface HTTP / HDFS Proxies Java Interface NFS Gateway Simon Alex and Nambaale MCS 7106 October 27, 2019 12 / 29
  • 13.
    MapReduce MapReduce is aprogramming model and implementation developed at Google for processing and generating large datasets across a cluster of computers. MapReduce is a core component of Apache Hadoop, which distributes processing on a cluster of computers. Simon Alex and Nambaale MCS 7106 October 27, 2019 13 / 29
  • 14.
    MapReduce Programming Model Thisprogramming model is inspired∗ by the map and reduce primitives of functional programming languages such as Lisp. map: takes as input a procedure and a sequence of values and applies the procedure to each value in the sequence. reduce: takes as input a sequence of values and combines all values using binary operator. ∗ but not equivalent! Simon Alex and Nambaale MCS 7106 October 27, 2019 14 / 29
  • 15.
    How MapReduce Works? MapReduceworks by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer.The programmer also specifies two functions: the map function and the reduce function. Simon Alex and Nambaale MCS 7106 October 27, 2019 15 / 29
  • 16.
    MapReduce Example Challenge What’s thehighest ever recorded Makerere’s CGPA for each year? Simon Alex and Nambaale MCS 7106 October 27, 2019 16 / 29
  • 17.
    MapReduce Example Figure: MapReducelogical data flow Simon Alex and Nambaale MCS 7106 October 27, 2019 17 / 29
  • 18.
    Recent Developments TonY (TensorFlowon YARN) Hadoop Encryption HDFS High Availabilty Enhancement Ozone Simon Alex and Nambaale MCS 7106 October 27, 2019 18 / 29
  • 19.
    Strengths and Weaknesses Strengths VariedData sources Cost effective Performance Fault tolerant High availability Low network traffic Scalable Simon Alex and Nambaale MCS 7106 October 27, 2019 19 / 29
  • 20.
    Strengths and Weaknesses Weaknesses Issuewith small files Processing overhead Supports only batch processing Iterative processing Simon Alex and Nambaale MCS 7106 October 27, 2019 20 / 29
  • 21.
    Where is Hadoopused? LinkedLn Assessment Question Calibration Simon Alex and Nambaale MCS 7106 October 27, 2019 21 / 29
  • 22.
    Pilot Implementation UI (Ambari) SimonAlex and Nambaale MCS 7106 October 27, 2019 22 / 29
  • 23.
    Installing the datasetinto HDFS Using Ambari Simon Alex and Nambaale MCS 7106 October 27, 2019 23 / 29
  • 24.
    Installing the datasetinto HDFS Using Command Line Interface Simon Alex and Nambaale MCS 7106 October 27, 2019 24 / 29
  • 25.
    MapReduce Writing the Mapper defmapper_get_ratings (self , _, line ): (userID , movieID , rating , timestamp) = line.split(’t’) yield rating , 1 Simon Alex and Nambaale MCS 7106 October 27, 2019 25 / 29
  • 26.
    MapReduce Writing the Reducer defreducer_count_ratings (self , key , values ): yield key , sum(values) Simon Alex and Nambaale MCS 7106 October 27, 2019 26 / 29
  • 27.
    MapReduce Putting it allTogether from mrjob.job import MRJob from mrjob.step import MRStep class RatingsBreakdown (MRJob ): def steps(self ): return [ MRStep(mapper=self.mapper_get_ratings , reducer=self. reducer_count_ratings ) ] def mapper_get_ratings (self , _, line ): (userID , movieID , rating , timestamp )= line.split(’t’) yield rating , 1 def reducer_count_ratings (self , key , values ): yield key , sum(values) if __name__ == ’__main__ ’: RatingsBreakdown .run() Simon Alex and Nambaale MCS 7106 October 27, 2019 27 / 29
  • 28.
    MapReduce Running in Hadoop SimonAlex and Nambaale MCS 7106 October 27, 2019 28 / 29
  • 29.
    Questions? Simon Alex andNambaale MCS 7106 October 27, 2019 29 / 29