MapReduce and HDFS for Large-Scale Analytics

Distributed Systems
MapReduce
Seyyed Ahmad Javadi
sajavadi@aut.ac.ir
Spring 2023

2
Outline
ØToday: MapReduce
§ Analytics programming interface
§ Word count example
§ Chaining
§ Reading/write from/to HDFS
§ Dealing with failures
ØIn future: Spark
§ Motivation
§ Resilient Distributed Datasets (RDDs)
§ Programming interface
§ Transformations and actions
04/03/2023
Principles of Cloud Computing--S.A.Javadi 2

3
Large-scale Analytics
ØCompute the frequency of words
in a corpus of documents.
ØCount how many times users
have clicked on each of
a (large) set of ads.
ØPageRank: Compute the
“importance” of a web page
based on the “importances”
of the pages that link to it.
04/03/2023

4
Option 1: SQL
Ø Before MapReduce, analytics mostly done in SQL, or manually.
Ø Example: Count word appearances in a corpus of documents.
Ø With SQL, the rough query might be:
Ø Very expressive, convenient to program, but no one knew how to scale SQL
execution!
04/03/2023

5
Option 2: Manual
ØExample: Count word appearances in a corpus of documents.
04/03/2023

6
Option 2: Manual
ØPhase 1: Assign documents to different machines/nodes.
§ Each computes a dictionary: {word: local_freq}.
04/03/2023

7
Option 2: Manual
ØPhase 1: Assign documents to different machines/nodes.
§ Each computes a dictionary: {word: local_freq}.
ØPhase 2: Nodes exchange dictionaries (how?) to aggregate
local_freq’s.
ØBut how to make this scale??
04/03/2023

8
Option 2: Manual
ØPhase 2, Option a: Send all {word: local_freq} dictionaries to one
node, who aggregates.
§ But what if it’s too much data for one node?
ØPhase 2, Option b: Each node sends (word, local_freq) to a
designated node, e.g., node
with ID
hash(word) % N.
04/03/2023
We’ve roughly discovered an app-specific
version of MapReduce!

9
Option 2: Challenges
ØHow to generalize to other applications?
ØHow to deal with failures?
ØHow to deal with slow nodes?
ØHow to deal with load balancing
(some docs are very large,
others small)?
ØHow to deal with skew
(some words are very frequent,
so nodes designated to
aggregate them will be pounded)?
04/03/2023

10
MapReduce
ØParallelizable programming model:
§ Applies to a broad class of analytics applications.
§ Isn’t as expressive as SQL but it is easier to scale.
04/03/2023

11
MapReduce
ØConsists of two phases, each intrinsically parallelizable:
§ Map: processes input elements independently to emit relevant
(key, value) pairs from each.
• Transparently, the system groups all the values for each key
together: (key, [list of values]).
§ Reduce: aggregates all the values for each key to emit a global
value for each key.
04/03/2023

12
MapReduce
ØScalable, efficient, fault tolerant runtime system.
§ Addresses preceding challenges (and more).
ØTechnical paper: please read it!
§ MapReduce: Simplified Data Processing on Large Clusters
§ https://research.google/pubs/pub62/
§ ~14000 citations
§ OSDI 2004
04/03/2023

13
How it works
ØInput: a collection of elements of (key, value) pair type.
ØProgrammer defines two functions:
§ Map(key, value) → 0, 1, or more (key’, value’) pairs
§ Reduce(key, value-list) → output
04/03/2023

14
How it works
ØExecution:
§ Apply Map to each input key-value pair, in parallel for different keys.
§ Sort emitted (key’, value’) pairs to produce (key, values-list) pairs.
§ Apply Reduce to each (key, value-list) pair, in parallel for different keys.
ØOutput is the union of all Reduce invocations’ outputs.
04/03/2023

15
Workflow
04/03/2023

16
Example: Word count
ØWe have a directory, which contains many documents.
ØThe documents contain words separated by whitespace and
punctuation.
ØGoal: Count the number of times each distinct word appears in
the file.
04/03/2023

17
Word count with MapReduce
Map(key, value):
// key: document ID; value: document content
FOR (each word w IN value)
emit(w, 1);
Reduce(key, value-list):
// key: a word; value-list: a list of integers
result = 0;
FOR (each integer v on value-list)
result += v;
emit(key, result);
04/03/2023
Expect to be all 1’s, but “combiners” allow
local summing of integers with the same
key before passing to reducers.

18
Map Phase
ØMapper is given key: document ID; value: document content, say:
(D1,“The teacher went to the store. The store was
closed; the store opens in the morning. The store
opens at 9am.”)
ØIt will emit the following pairs:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store,1>
<the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store,1>
<opens, 1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1>
<opens, 1> <at, 1> <9am, 1>
04/03/2023

19
Map Phase
ØNormally, there will be many documents, hence many Mappers that
emit such pairs in parallel, but for simplicity, let’s say that these are
all the pairs emitted from the Map phase.
04/03/2023

20
Intermediary Phase
ØSystem sorts emitted (key, value) pairs by key:
04/03/2023

21
Intermediary Phase
ØSystem sorts emitted (key, value) pairs by key:
04/03/2023

22
Reduce Phase
ØFor each unique key emitted from the Map Phase, function
Reduce(key, value-list) is invoked on Reducer 1 or Reducer 2.
ØAcross their invocations, these Reducers will emit:
04/03/2023

23
Chaining MapReduce
ØThe programming model seems pretty restrictive.
ØBut quite a few analytics applications can be written with it,
especially with a technique called chaining.
ØIf the output of reducers is (key, value) pairs, then their output
can be passed onto other Map/Reduce processes.
ØThis chaining can support a variety of analytics (though certainly
not all types of analytics, e.g., no ML b/c no loops).
04/03/2023

24
Example: Word Frequency
ØSuppose instead of word count, we wanted to compute word
frequency: the probability that a word would appear in a
document.
ØThis means computing the fraction of times a word appears, out
of the total number of words in the corpus.
04/03/2023

25
Solution: Chain two MapReduce’s
ØFirst Map/Reduce: Word Count (like before)
§ Map: process documents and output <word, 1> pairs.
§ Multiple Reducers: emit <word, word_count> for each word.
ØSecond MapReduce: ?
04/03/2023

26
Solution: Chain two MapReduce’s
ØSecond MapReduce:
§ Map: processes <word, word_count> and outputs (1, (word,
word_count)).
§ Reducer: does two passes:
• In first pass, sums up all word_count’s to calculate overall_count.
• In second pass calculates fractions. Emits multiple <word,
word_count/overall_count>.
04/03/2023

27
Option 2: Challenges (recall)
ØHow to generalize to other applications?
§ See original MapReduce paper for more examples.
ØHow to deal with failures?
ØHow to deal with slow nodes?
ØHow to deal with load balancing?
ØHow to deal with skew?
ØThe MapReduce runtime tries to hide these challenges as much as
possible. We’ll talk about a few of the performance and fault
tolerance challenges here.
04/03/2023

28
Architecture
04/03/2023

From cloud computing
course slides
04/03/2023

30
High-performance distributed file systems
and storage clouds
ØLustre
ØGoogle File System (GFS)
ØHadoop Distributed File System (HDFS)
ØAmazon Simple Storage Service (S3)
ØAnd many more
04/03/2023

31
Hadoop Distributed File System (HDFS)
ØHDFS stores very large files (many terabytes and petabytes).
ØIt could store tens of millions of files.
ØIt can run on hundreds or thousands of commodity servers.
04/03/2023

32
ØA general assumption about HDFS is that hardware failure is a
norm – not an exception.
ØIt is suitable for big data analytics
§ It is optimized to write very-big files once and to read them
many times
§ It is not suitable for random reads/writes.
04/03/2023

33
ØAn HDFS file is a sequence of blocks stored in a cluster of multiple servers.
ØFault tolerance is at block level.
ØHDFS blocks are big ones – 128 MB by default.
ØHDFC is designed for applications that access (streaming) data sets
successively, it is not suitable for small files or for direct reads and writes.
ØHadoop moves the computations to the storage nodes
§ It is the best approach for cases when the computing programs are relatively
small, and the stored data are big enough.
04/03/2023

34
HDFS Structure
ØThere are namenode and datanode nodes
ØThe namenode contains metadata for files, directories, file block
locations, and so forth.
ØThe datanode stores data blocks.
04/03/2023

35
HDFS Architecture
35
Master Node
NameNode process
File System Metedata
Node 1
DataNode process
Local file system
B1 B2
Node 2
DataNode process
Local file system
B1 B2
...
Node N
DataNode process
Local file system
HDFS Data Blocks HDFS
04/03/2023
Principles of Cloud Computing--S.A.Javadi

36
HDFS Architecture (Cont.)
Source: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
04/03/2023

37
Anatomy of File Read in HDFS
04/03/2023
https://www.geeksforgeeks.org/anatomy-of-file-read-and-write-in-hdfs/

38
Anatomy of File Write in HDFS
04/03/2023
https://www.geeksforgeeks.org/anatomy-of-file-read-and-write-in-hdfs/

39
HDFS Structure
ØThe client opens files or directories using the metadata from a
namenode, after that, the file datanodes execute the operations.
ØThe read operations directly access datanodes in a sequential
read access mode.
ØIf a read fails, then the datanode uses a block replica.
ØWhen HDFS has read all the data from a given block, it chooses
the next block among all next block replicas.
04/03/2023

Back to Distributed
Systems slides
04/03/2023

41
Server vs Rack
04/03/2023
https://windowsreport.com/best-hp-rack-server/

42
Data locality for performance
ØMaster scheduling policy:
§ Asks DFS for locations of replicas of input file blocks.
§ Map tasks are scheduled so DFS input block replica are on the
same machine or on the same rack.
ØEffect: Thousands of machines read input at local disk speed.
§ Don’t need to transfer input data all over the cluster over the
network: eliminate network bottleneck!
04/03/2023

43
Data locality in Hadoop
ØData local data locality in Hadoop
§ When the data is located on the same node as the mapper
working on the data it is known as data local data locality.
§ In this case, the proximity of data is very near to computation.
§ This is the most preferred scenario.
Ø…
04/03/2023
https://data-flair.training/blogs/data-locality-in-hadoop-mapreduce/

44
ØIntra-Rack data locality in Hadoop
§ It is not always possible to execute the mapper on the same
datanode due to resource constraints.
§ In such case, it is preferred to run the mapper on the different
node but on the same rack.
Ø…
04/03/2023

45
ØIntra-Rack data locality in Hadoop
ØInter-Rack data locality in Hadoop
§ Sometimes it is not possible to execute mapper on a different
node in the same rack due to resource constraints.
§ In such a case, we will execute the mapper on the nodes on
different racks.
§ This is the least preferred scenario.
04/03/2023

46
Heartbeats, replication for fault tolerance
ØFailures are the norm in data centers.
ØWorker failure:
§ Master detects if workers failed by periodically pinging them (this
is called a “heartbeat”).
§ Re-execute in-progress map/reduce tasks.
04/03/2023

47
Heartbeats, replication for fault tolerance
ØMaster failure:
§ Initially, was single point of failure; Resume from Execution Log.
§ Subsequent versions used replication and consensus.
ØEffect: From Google’s paper: lost 1600 of 1800 machines once,
but finished fine.
04/03/2023

48
Redundant execution for performance
ØSlow workers or stragglers significantly lengthen completion
time.
ØSlowest worker can determine the total latency!
§ This is why many systems measure 99th percentile latency.
§ Other jobs consuming resources on machine.
§ Bad disks with errors transfer data very slowly.
ØSolution: spawn backup copies of tasks.
§ Whichever one finishes first "wins.”
§ I.e., treat slow executions as failed executions!
04/03/2023

49
Take-Aways
ØMapReduce is a programming model and runtime for scalable,
fault tolerant, and efficient analytics.
§ Well, some types of analytics, it’s not very expressive.
ØIt hides common at-scale challenges under convenient
abstractions.
ØProgrammers need only implement Map and Reduce and the
system takes care of running those at scale on lots of data.
04/03/2023

50
Take-Aways
ØFailures raise semantic/performance challenges for MapReduce,
which it typically handles through redundancy.
ØThere exist open-source implementations, including Hadoop,
Flink.
ØSpark, a popular data analytics framework, also supports
MapReduce but also a wider range of programming models.
04/03/2023

51
Acknowledgement
ØSlides prepared based on:
§ Dr Geambasu’s lecture slides
• https://systems.cs.columbia.edu/ds1-class/
• https://systems.cs.columbia.edu/ds2-class/
§ Dr Mittal’s lecture slides
• https://courses.grainger.illinois.edu/ece428/sp2022/index.html
§ Dr Gupta’s lecture slides
• http://courses.engr.illinois.edu/cs425/
04/03/2023

MapReduce and HDFS for Large-Scale Analytics

Recommended

Recommended

More Related Content

Similar to MapReduce and HDFS for Large-Scale Analytics

Similar to MapReduce and HDFS for Large-Scale Analytics (20)

Recently uploaded

Recently uploaded (20)

MapReduce and HDFS for Large-Scale Analytics