MapReduce
Programming
Model
OUTLINE
Motivation
Sales exemples
words count exemple
1.wordcount in Hadoop
using python
2.Arraysum Demo using
Java
MapReduce daemons
in Hadoop
Big Data
Map
Reduce
INTRODUCTION
Parallel Processing
01 02
04
Task tracker
Job tracker
MapReduce
03
Demo code
05
2
Summary
Conclusion
06
INTRODUCTION
01
3
Parallel
Processing
4
Task is broken up to multiple parts with a software tool
and each part is distributed to a processor, then each
processor will perform the assigned part.
5
Finally, the parts are reassembled to deliver
the final solution or execute the task.
Reminder !
6
Multiprocessing
Parallel Processing is not
Motivation
7
8
NO !
9
What is the proposed
solution to deal with
?
Motivation
10
• Motivations
● Large-scale data processing on clusters
● Massively parallel (hundreds or thousands of CPUs)
● Reliable execution with easy data access
• Functions
● Fault-tolerance
● Status and monitoring tools
● A clean abstraction for programmers
Inspired by LISP
Function
Programming
Map
11
Reduce
12
Lisp map function
● Input parameters: a function and a set of values
● This function is applied to each of the values.
Lisp reduce function
● given a binary function and a set of values.
● It combines all the values together using the
binary function.
(map ‘length ‘(() (a) (ab) (abc)))
(length(()) length(a) length(ab)
length(abc))
(0 1 2 3)
use the + (add) function to reduce the
list
(reduce #'+ '(0 1 2 3))
6
Example
MapReduce
02
13
14
Instead of browsing the file sequentially, it is divided into chunks that are browsed in
parallel.
Example 1 :
Principal
15
Calculate the total sales for the current year ?
Solution
16
+
++
Instead of having one person
cover the whole book
we hire several !
A first group is called mappers
the second is called reducers
Divide the book in several parts
and give one to each mapper .
17
18
(key , value)
(key , values)
Intermediate registration
Results
shuffle & sort
The Famous
words count
example
02
19
20
Example 2 :
More Details
21
Input/output specification of the WC mapreduce job
Input : a set of (key values) stored in files
key: document ID
value: a list of words as content of each document
Output: a set of (key values) stored in files
key: wordID
value: word frequency appeared in all documents
MapReduce function specification:
map(String input_key, String input_value):
reduce(String output_key, Iterator intermediate_values):
22
Pseudo-code
23
MapReduce
Daemons in
Hadoop
03
24
25
“MapReduce has been implemented in many
programming languages and frameworks, such
as Apache Hadoop, Pig, Hive, etc. “
26
Divides the work on mappers
and reducers
runs on each node to execute
the real mapreduce tasks
Brief introduction for later use
mapReduce daemons
Demo Code
1
05
27
Sum array elements using mapReduce
28
Map: Split the array of 1000
elements into 10 small data
chunks (each chunk will have 100
elements)
Each chunk will be processed by a
separate thread concurrently.
We will have 10 threads and each
thread will iterate 100 elements to
produce the sum of those 100
elements.
Reducer: takes the output of
these 10 threads and will be
summed again to produce the
final output.
Sum array elements using mapReduce with java
29
Project structure Main
Call map task and Reduce Task to
perform mapReduce fn
Environnement
30
create thread pool of 10
save each task
of each chunk
in queue
split array of 1k into
chunks each of
100
save map result
of each chunk
into mapOutput
31
getoutput of map and
aggregate results
For each element
in mapOut(
the result from
previous map)
source code link : https://github.com/HabibaAbderrahim/thread_mapReduce
Demo Code
2
32
Words count using Hadoop framework
33
Environnement
Pseudo Distributed environment
PS : This is a pseudo environment that simulate a fully distributed environment since
we have one server / one pc
java should be installed
create hadoop
sudo user
install hadoop
for the official
website
check hadoop
is installed
version : 3.2.1
34
Environnement
Pseudo Distributed environment
Files configuration
version : 3.2.1
java home and hadoop home
add java path
HDFS : hadoop file system
HDFS configuration : namenode/datanode/replication
mapReduce configuration
mapReduce runs on Yarn
Verify Hadoop daemons
35
We decided to work with python
just to test hadoop
streaming Features
Environnement
version : 3.2.1
version : 3.5.1
word count using mapReduce in Hadoop with python
36
Environnement
version : 3.2.1
version : 3.5.1
Mapper
Reducer
37
Environnement
version : 3.2.1
version : 3.5.1
see what is inside our file
data.txt
Words count in
data.txt
MapReduce
sort results alphabetic
Conclusion
06
38
The ideas, concepts and diagrams are taken from the following websites:
● http://www.metz.supelec.fr/metz/personnel/vialle/course/BigData-2A-CS/poly-
pdf/Poly-chap6.pdf
● https://sites.cs.ucsb.edu/~tyang/class/240a17/slides/CS240TopicMapReduce.
pdf
● https://fr.slideshare.net/LiliaSfaxi/bigdatachp2-hadoop-mapreduce
● https://algodaily.com/lessons/what-is-mapreduce-and-how-does-it-work
[References]
39
Thanks!
Do you have any questions?
40

map Reduce.pptx

  • 1.
  • 2.
    OUTLINE Motivation Sales exemples words countexemple 1.wordcount in Hadoop using python 2.Arraysum Demo using Java MapReduce daemons in Hadoop Big Data Map Reduce INTRODUCTION Parallel Processing 01 02 04 Task tracker Job tracker MapReduce 03 Demo code 05 2 Summary Conclusion 06
  • 3.
  • 4.
  • 5.
    Task is brokenup to multiple parts with a software tool and each part is distributed to a processor, then each processor will perform the assigned part. 5 Finally, the parts are reassembled to deliver the final solution or execute the task.
  • 6.
  • 7.
  • 8.
  • 9.
    9 What is theproposed solution to deal with ?
  • 10.
    Motivation 10 • Motivations ● Large-scaledata processing on clusters ● Massively parallel (hundreds or thousands of CPUs) ● Reliable execution with easy data access • Functions ● Fault-tolerance ● Status and monitoring tools ● A clean abstraction for programmers
  • 11.
  • 12.
    12 Lisp map function ●Input parameters: a function and a set of values ● This function is applied to each of the values. Lisp reduce function ● given a binary function and a set of values. ● It combines all the values together using the binary function. (map ‘length ‘(() (a) (ab) (abc))) (length(()) length(a) length(ab) length(abc)) (0 1 2 3) use the + (add) function to reduce the list (reduce #'+ '(0 1 2 3)) 6 Example
  • 13.
  • 14.
    14 Instead of browsingthe file sequentially, it is divided into chunks that are browsed in parallel. Example 1 : Principal
  • 15.
    15 Calculate the totalsales for the current year ? Solution
  • 16.
    16 + ++ Instead of havingone person cover the whole book we hire several ! A first group is called mappers the second is called reducers Divide the book in several parts and give one to each mapper .
  • 17.
  • 18.
    18 (key , value) (key, values) Intermediate registration Results shuffle & sort
  • 19.
  • 20.
  • 21.
    21 Input/output specification ofthe WC mapreduce job Input : a set of (key values) stored in files key: document ID value: a list of words as content of each document Output: a set of (key values) stored in files key: wordID value: word frequency appeared in all documents MapReduce function specification: map(String input_key, String input_value): reduce(String output_key, Iterator intermediate_values):
  • 22.
  • 23.
  • 24.
  • 25.
    25 “MapReduce has beenimplemented in many programming languages and frameworks, such as Apache Hadoop, Pig, Hive, etc. “
  • 26.
    26 Divides the workon mappers and reducers runs on each node to execute the real mapreduce tasks Brief introduction for later use mapReduce daemons
  • 27.
    Demo Code 1 05 27 Sum arrayelements using mapReduce
  • 28.
    28 Map: Split thearray of 1000 elements into 10 small data chunks (each chunk will have 100 elements) Each chunk will be processed by a separate thread concurrently. We will have 10 threads and each thread will iterate 100 elements to produce the sum of those 100 elements. Reducer: takes the output of these 10 threads and will be summed again to produce the final output. Sum array elements using mapReduce with java
  • 29.
    29 Project structure Main Callmap task and Reduce Task to perform mapReduce fn Environnement
  • 30.
    30 create thread poolof 10 save each task of each chunk in queue split array of 1k into chunks each of 100 save map result of each chunk into mapOutput
  • 31.
    31 getoutput of mapand aggregate results For each element in mapOut( the result from previous map) source code link : https://github.com/HabibaAbderrahim/thread_mapReduce
  • 32.
    Demo Code 2 32 Words countusing Hadoop framework
  • 33.
    33 Environnement Pseudo Distributed environment PS: This is a pseudo environment that simulate a fully distributed environment since we have one server / one pc java should be installed create hadoop sudo user install hadoop for the official website check hadoop is installed version : 3.2.1
  • 34.
    34 Environnement Pseudo Distributed environment Filesconfiguration version : 3.2.1 java home and hadoop home add java path HDFS : hadoop file system HDFS configuration : namenode/datanode/replication mapReduce configuration mapReduce runs on Yarn Verify Hadoop daemons
  • 35.
    35 We decided towork with python just to test hadoop streaming Features Environnement version : 3.2.1 version : 3.5.1 word count using mapReduce in Hadoop with python
  • 36.
  • 37.
    37 Environnement version : 3.2.1 version: 3.5.1 see what is inside our file data.txt Words count in data.txt MapReduce sort results alphabetic
  • 38.
  • 39.
    The ideas, conceptsand diagrams are taken from the following websites: ● http://www.metz.supelec.fr/metz/personnel/vialle/course/BigData-2A-CS/poly- pdf/Poly-chap6.pdf ● https://sites.cs.ucsb.edu/~tyang/class/240a17/slides/CS240TopicMapReduce. pdf ● https://fr.slideshare.net/LiliaSfaxi/bigdatachp2-hadoop-mapreduce ● https://algodaily.com/lessons/what-is-mapreduce-and-how-does-it-work [References] 39
  • 40.
    Thanks! Do you haveany questions? 40

Editor's Notes

  • #7 should not be confused with Multiprocessing in where multiple processors or cores are working on solving different tasks, instead of parts of the same task as in parallel processing.
  • #9 before driven into detail , take a moment and ask yourself what does
  • #11 » Functional programming meets distributed computing » A batch data processing system
  • #28 Traditional approach In this approach we will iterate each element in an array and will add it to produce final sum.