Welcome to MapReduce Session
MapReduce
TODAY’S CLASS
● Thinking in MapReduce
○ Word Frequency Problem
■ Solution 1 - Coding
■ Solution 2 - SQL
■ Solution 3 - Unix Pipes
■ Solution 4 - External Sort
● Map/Reduce Overview
● Visualisation
● Analogies to groupby
● Assignments
Understanding Sorting
MapReduce
BIG DATA PROBLEM - PROCESSING
Q: How fast can 1GHz processor sort 1TB data? This
data is made up of 10 billion 100 byte size strings.
A: Around 6-10 hours
What's wrong 6-10 hours?
1. Faster Sort
2. Bigger Data Sorting
3. More often
We need
MapReduce
BIG DATA PROBLEM - PROCESSING
Google, 8 Sept, 2011:
Sorting 10PB took 6.5 hrs on 8000 computers
MapReduce
1. Every SQL Query is impacted by Sorting:
○ Where clause - Index (Sorting)
○ Group By - Involves Sorting
○ Joins - immensly enhanced by Sorting
○ Distinct
○ Order BY
2. Most of the algorithms depend on sorting
Why Sorting is such as big deal
MapReduce
• Programming Paradigm
• To help solve Big Data problems
• Specifically sorting intensive jobs or disc read
intensive
• You would have to code two functions:
• Mapper - Converts Input into “key - value” pairs
• Reducer - Aggregates all the values for a key
THINKING IN MAP / REDUCE
What is Map/Reduce?
MapReduce
• Also supported by many other systems such as
• MongoDB / CouchDB / Cassandra
• Apache Spark
• Mapper & Reducers in hadoop
• can be written in Java, Shell, Python or any binary
THINKING IN MAP / REDUCE
What is Map/Reduce?
MapReduce
MAP REDUCEMAP REDUCE
Split 0 Map
Sort
Split 1 Map
Sort
Split 2 Map
Sort
Reduce Part 0
Copy
Merge
HDFS
Block
HDFS
Block
HDFS
Block
TO
HDFS
MapReduce
MAP REDUCE
CutIntoPieces()
MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of containing 100s of text books,[500 mb]
how would you find the frequencies of words?
MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 1 (Programmatic):
• Create a frequency hash table / dictionary
• For each word in the files
• Increase its frequency in the hash table
• When no more words left in file, print the hash table
Problems?
MapReduce
THINKING IN MAP / REDUCE
Problems?
Start
Initialize a dictionary or
hashtable (word, count)
Read next word from file
Is Any
word
left?
Find word in
dictionary
Does the word
exist in
dictionary?
Increase the count by 1
Add new word
with count as 0
End
Print the word and
counts
1. wordcount={}
2. for word in file.read().split():
3. if word not in wordcount:
4. wordcount[word] = 0
5. wordcount[word] += 1
6. for k,v in wordcount.items():
7. print k, v
Line 1
2
2 3 4
5
6&7
MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 1 (Programmatic):
• Create a frequency hash table / dictionary
• For each word in the file
• Increase its frequency in the hash table
• When no more words left in file, print the hash table
Problems?
Can not process the data beyond RAM size.
MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach2 (SQL):
• Break the books into one word per line
• Insert one word per row in database table
• Execute: select word, count(*) from table group by word.
Understanding Unix Pipeline
MapReduce
Understanding Unix Pipeline
A program can take input from you.
MapReduce
Understanding Unix Pipeline
A program may also print some output
MapReduce
Understanding Unix Pipeline
command1 | command2
Command1 Command2Pipe
MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 3 (Unix):
• Replace space with a newline
• Order lines with a sort command
• Then find frequencies using uniq
• Scans from top to bottom
• prints the count when line value changes
cat myfile| sed -E 's/[t ]+/n/g'| sort -S 1g | uniq -c
MapReduce
THINKING IN MAP / REDUCE
Problems in Approach 2 (SQL) & Approach 3 (Unix)?
MapReduce
THINKING IN MAP / REDUCE
Problems in Approach 2 (SQL) & Approach 3 (Unix)?
The moment the data starts going beyond RAM the time taken
starts increasing. The following become bottlenecks:
• CPU
• Disk Speed
• Disk Space
MapReduce
THINKING IN MAP / REDUCE
Then?
Approach 4: Use a external sort.
• Split the files to a size that fits RAM
• Use the previous approaches (2&3) to find freq
• Merge (sort -m) and sum-up frequencies
Machine 1 Machine 2
Launcher
sa, re, re ga, ga, re
re:2
sa:1
ga:2
re:1
merge
ga:2
re:3
sa:1
MapReduce
• Takes O(n) time to merge sorted data
• Or the time is proportional to the number of
elements to be merged
THINKING IN MAP / REDUCE
Merging
MapReduce
Merging
Merge the two sorted queues to
form another sorted queue
MapReduce
Merging
Compare the heads
MapReduce
Merging
Pick shorter
MapReduce
Merging
Pick shorter
MapReduce
Merging
Compare the heads
again
MapReduce
Merging
Pick shorter
MapReduce
Merging
Compare the heads
again
MapReduce
Merging
Pick both if equal
MapReduce
Merging
Compare the heads
again
MapReduce
Merging
Pick shorter
MapReduce
Merging
Compare the heads
again
MapReduce
Merging
Pick shorter
MapReduce
Merging
Compare the heads
again
MapReduce
Merging
Pick shorter
MapReduce
Merging
Since no one is left on
second queue.
Put remaining from
first
MapReduce
Merging
This merges the two
queues into one
MapReduce
• For more than two lists
○ Use min-heap
THINKING IN MAP / REDUCE
Merging
1 4 6
9 10 12
6 7 8
8 9 9
3 5 7
5 10 17
To the output
MapReduce
• For more than two lists
○ Use min-heap
THINKING IN MAP / REDUCE
Merging
1 4 6
9 10 12
6 7 8
8 9 9
3 5 7
5 10 17
1,
MapReduce
• For more than two lists
○ Or merge two at a time
THINKING IN MAP / REDUCE
Merging
MapReduce
THINKING IN MAP / REDUCE
Problems with Approach 4?
Machine 1 Machine 2
Launcher
sa, re, re ga, ga, re
re:2
sa:1
ga:2
re:1
merge
ga:2
re:3
sa:1
MapReduce
THINKING IN MAP / REDUCE
Problems with external Sort?
Time is consumed in transport of data.
+
For each requirement we would need to
special purpose network oriented program.
+
Would Require A lot of Engineering.
Solution?
Use Map/Reduce
MapReduce
• Programming Paradigm
• To help solve Big Data problems
• Specifically sorting intensive jobs or disc read
intensive
• You would have to code two functions:
• Mapper - Convert Input into “key - value” pairs
• Reducer - Aggregates all the values for a key
THINKING IN MAP / REDUCE
What is Map/Reduce?
MapReduce
• Also supported by many other systems such as
• MongoDB / CouchDB / Cassandra
• Apache Spark
• Mapper & Reducers in hadoop
• can be written in Java, Shell, Python or any binary
THINKING IN MAP / REDUCE
What is Map/Reduce?
MapReduce
Function Mapper (Image):
Convert image
to 100x100 pixel
EXAMPLE OF ONLY MAPPER
Directory Of Profile Pictures in HDFS
Function Mapper (Image):
Convert image
to 100x100 pixel
Function Mapper (Image):
Convert image
to 100x100 pixel
HDFS - Output Directory Of 100x100px Profile Pictures
Machine 1 Machine 2 Machine 3
MapReduce
InputFormat
Datanode
Input Split
HDFS Block1
Record1
Record2
Record3
Map()
Map()
Map()
Mapper
(key1, value1)
(key2, value2)
Nothing
(key3, value3)
InputSplit
MapReduce
With Both mapper() & Reducer() code
HDFS
Input
HDFS
MapReduce
MAP / REDUCE
Mapper/Reducer for word frequency problem.
function map(line):
foreach(word in line) :
print(word, 1);
sa 1
re 1
re 1
sa 1
ga 1
hdfs
sa re re
sa ga
MapReduce
MAP / REDUCE
Mapper/Reducer for word frequency problem.
function map(line):
foreach(word in line) :
print(word, 1);
sa re re
sa ga
function reduce(word, freqArray):
return Array.sum(freqArray);
sa 1
re 1
re 1
sa 1
ga 1
ga [1]
re [1, 1]
sa [1, 1]
ga 1
re 2
sa 2
hdfs
MapReduce
Mapper/Reducer for computing max temp
def mapp(line):
(t, c, time) = line.split(",")
print(c, t)
def reduce(key, values):
return max(values)
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, Seatle, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
NYC 20
NYC 20
NYC 21
BLR 23
SEATLE 25
CHICAGO 21
NYC 24
BLR 23
CHICAGO 21
NYC 20,20,21,24
SEATLE 25
BLR 23
CHICAGo 21
NYC 24
SEATLE 25
Temp, City, Date
MapReduce
Mapper/Reducer for computing max temp
def mapp(line):
(t, c, date) = line.split(",")
print(c, (t, date))
def reduce(key, values):
maxt = -19191919;
date = ''
for i in values:
T = i[0]
If T > maxt: maxt = T, date=i[1
return (maxt, date)
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, Seatle, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
NYC (20, 2014-01-01)
NYC (20, 2015-01-01)
NYC 21
BLR 23
SEATLE 25
CHICAGO 21
NYC 24
BLR (23, '2014-01-01'
CHICAGO (21, '2015-01-01').
NYC 20,20,21,24
SEATLE 25
BLR (23, '2015-01-01')
CHICAGo 21
NYC 24
SEATLE 25
Temp, City, Date
MapReduce
MAP / REDUCE
Analogous to Group By
function map():
(temp, city, time) = line.split(",")
print(city, temp)
function reduce(city, arr_temps):
return max(arr_temps);
select city,
max(temp)
from table
group by city.
MapReduce
MAP / REDUCE
Analogous to Group By
function map():
foreach(word in input) :
print(word, 1);
function reduce(word, freqArray):
return Array.sum(freqArray);
select word,
count(*)
from table
group by
word.
MapReduce
MAP REDUCE - Multiple Reducers
Split 0 Map
Sort
Split 1 Map
Sort
Split 2 Map
Sort
Reduce Part 0
Reduce Part 1
Copy
Merge
HDFS
Block
HDFS
Block
HDFS
Block
TO
HDFS
TO
HDFS
Apple
Banana
Apricot
Carrots
MapReduce
MAP REDUCE - Paritioning
Reducer 0 Reducer 1 Reducer 2 Reducer 3
1
5
9
.
.
.
3201
2
6
10
.
.
.
3202
3
7
11
.
.
.
3203
0
4
8
.
.
.
3200
Key k will go to this reducer: hashcode(k) % total_reducers
Keys
Thank you
MapReduce
Thank you.
Hadoop & Spark
support@knowbigdata.com
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
Subscribe to our Youtube channel for latest videos -
https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA
MapReduce
MAP / REDUCE - RECAP
MapReduce
MAP / REDUCE
The data generated by the mapper is given to
reducer and then it is sorted / shuffled [Yes/No]?
MapReduce
MAP / REDUCE
The data generated by the mapper is given to
reducer and then it is sorted / shuffled [Yes/No]?
No. The output of mapper is first
shuffled/sorted and then given to reducers.
MapReduce
MAP / REDUCE
The mapper can only generate a single key value
pair for an input value [True/False]?
MapReduce
MAP / REDUCE
The mapper can only generate a single key value
pair for an input value [True/False]?
False. Mapper can generate as many key-value pair
as it wants for an input.
MapReduce
MAP / REDUCE
A mapper always have to generate at least a
key-value pair[Correct/Wrong]?
MapReduce
MAP / REDUCE
A mapper always generates at least a key-value
pair[Correct/Wrong]?
Wrong
MapReduce
MAP / REDUCE
By default there is only one reducer in case of
streaming job [Yes/No]?
MapReduce
MAP / REDUCE
By default there is only one reducer in case of
streaming job [Yes/No]?
Yes. By default there is a single reducer job but it
can be split by specifying cmd option :
mapred.reduce.tasks.
MapReduce
MAP / REDUCE
In hadoop 1.0, What is the role of job tracker?
A: Executing the Map/Reduce Logic
B: Delegate the Map/Reduce Logic to task
tracker.
MapReduce
MAP / REDUCE
What is the role of job tracker?
A: Executing the Map/Reduce Logic
B: Delegate the Map/Reduce Logic to task
tracker.
B.
MapReduce
MAP / REDUCE
Q: The Map logic is executed preferably on the
nodes that have the required data [Yes/No]?
MapReduce
MAP / REDUCE
Q: The Map logic is executed preferably on the
nodes that have the required data [Yes/No]?
Yes.
MapReduce
MAP / REDUCE
Q: The Map logic is always executed on the nodes
that have the required data [Correct/Wrong]?
MapReduce
MAP / REDUCE
Wrong
Q: The Map logic is always executed on the nodes
that have the required data [Correct/Wrong]?
MapReduce
MAP / REDUCE
Where does Hadoop Store the result of reducer?
In HDFS or Local File System?
MapReduce
MAP / REDUCE
In HDFS.
Where does Hadoop Store the result of reducer?
In HDFS or Local File System?
MapReduce
MAP / REDUCE
Where does Hadoop Store the intermediate data
such as output of Map Tasks?
In HDFS or Local File System or Memory?
MapReduce
MAP / REDUCE
First in Memory and purged to
Local File System.
Output of mapper is saved in HDFS directly only if
there is no reduce phase.
Where does Hadoop Store the intermediate data
such as output of Map Tasks?
In HDFS or File System or Memory?
MapReduce
MAP / REDUCE Assignment For Tomorrow
1. Frequencies of letters [a-z] - Do you need Map/Reduce?
2. Find anagrams in a huge text. An anagram is basically a
different arrangement of letters in a word. Anagram does not
need have a meaning.
Input:
“the cat act in tic tac toe”
Output:
cat, tac, act
the
toe
in
tic
MapReduce
MAP / REDUCE
3a. A file contains the DNA sequence of people. Find all the
people who have same DNAs.
Output:
User1, User4
User2
User3, User 5
User6
Input:
“User1 ACGT”
“User2 TGCA”
“User3 ACG”
“User4 ACGT”
“User5 ACG”
“User6 AGCT”
Assignment For Tomorrow
MapReduce
MAP / REDUCE Assignment For Tomorrow
3b. A file contains the DNA sequence of people. Find all the
people who have same or mirror image of DNAs.
Input:
“User1 ACGT”
“User2 TGCA”
“User3 ACG”
“User4 ACGT”
“User5 ACG”
“User6 ACCT”
Output:
User1, User2, User4
User3, User 5
User6
MapReduce
MAP / REDUCE Assignment For Tomorrow
4. In an unusual democracy, everyone is not equal. The vote count is a
function of worth of the voter. Though everyone is voting for each other.
As example, if A with a worth of 5 and B with a worth of 1 are voting
for C, the vote count of C would be 6.
You are given a list of people with their value of vote. You are also given
another list describing who voted for who all.
List1
Voter Votee
A C
B C
C F
Find out what is the vote count of everyone?
List2
PersonWorth
A 5
B 1
C 11
Result
PersonVoteCount
A 0
B 0
C 6
F 11
MapReduce
JOB TRACKER
MapReduce
JOB TRACKER (DETAILED)
MapReduce
JOB TRACKER (CONT.)
MapReduce
JOB TRACKER (CONT.)
MapReduce
MapReduce
QUICK - CLUSTER HANDS ON
MapReduce Command
The Example is available here
Remove old output directory
hadoop fs -rm -r /user/student/wordcount/output
Execute the mapReduce Command:
hadoop jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-examples.jar
wordcount /data/mr/wordcount/input mrout

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

  • 1.
  • 2.
    MapReduce TODAY’S CLASS ● Thinkingin MapReduce ○ Word Frequency Problem ■ Solution 1 - Coding ■ Solution 2 - SQL ■ Solution 3 - Unix Pipes ■ Solution 4 - External Sort ● Map/Reduce Overview ● Visualisation ● Analogies to groupby ● Assignments
  • 3.
  • 4.
    MapReduce BIG DATA PROBLEM- PROCESSING Q: How fast can 1GHz processor sort 1TB data? This data is made up of 10 billion 100 byte size strings. A: Around 6-10 hours What's wrong 6-10 hours? 1. Faster Sort 2. Bigger Data Sorting 3. More often We need
  • 5.
    MapReduce BIG DATA PROBLEM- PROCESSING Google, 8 Sept, 2011: Sorting 10PB took 6.5 hrs on 8000 computers
  • 6.
    MapReduce 1. Every SQLQuery is impacted by Sorting: ○ Where clause - Index (Sorting) ○ Group By - Involves Sorting ○ Joins - immensly enhanced by Sorting ○ Distinct ○ Order BY 2. Most of the algorithms depend on sorting Why Sorting is such as big deal
  • 7.
    MapReduce • Programming Paradigm •To help solve Big Data problems • Specifically sorting intensive jobs or disc read intensive • You would have to code two functions: • Mapper - Converts Input into “key - value” pairs • Reducer - Aggregates all the values for a key THINKING IN MAP / REDUCE What is Map/Reduce?
  • 8.
    MapReduce • Also supportedby many other systems such as • MongoDB / CouchDB / Cassandra • Apache Spark • Mapper & Reducers in hadoop • can be written in Java, Shell, Python or any binary THINKING IN MAP / REDUCE What is Map/Reduce?
  • 9.
    MapReduce MAP REDUCEMAP REDUCE Split0 Map Sort Split 1 Map Sort Split 2 Map Sort Reduce Part 0 Copy Merge HDFS Block HDFS Block HDFS Block TO HDFS
  • 10.
  • 11.
    MapReduce THINKING IN MAP/ REDUCE If you have the plain text file of containing 100s of text books,[500 mb] how would you find the frequencies of words?
  • 12.
    MapReduce THINKING IN MAP/ REDUCE If you have the plain text file of all the Lord Of Rings books, how would you find the frequencies of words? Approach 1 (Programmatic): • Create a frequency hash table / dictionary • For each word in the files • Increase its frequency in the hash table • When no more words left in file, print the hash table Problems?
  • 13.
    MapReduce THINKING IN MAP/ REDUCE Problems? Start Initialize a dictionary or hashtable (word, count) Read next word from file Is Any word left? Find word in dictionary Does the word exist in dictionary? Increase the count by 1 Add new word with count as 0 End Print the word and counts 1. wordcount={} 2. for word in file.read().split(): 3. if word not in wordcount: 4. wordcount[word] = 0 5. wordcount[word] += 1 6. for k,v in wordcount.items(): 7. print k, v Line 1 2 2 3 4 5 6&7
  • 14.
    MapReduce THINKING IN MAP/ REDUCE If you have the plain text file of all the Lord Of Rings books, how would you find the frequencies of words? Approach 1 (Programmatic): • Create a frequency hash table / dictionary • For each word in the file • Increase its frequency in the hash table • When no more words left in file, print the hash table Problems? Can not process the data beyond RAM size.
  • 15.
    MapReduce THINKING IN MAP/ REDUCE If you have the plain text file of all the Lord Of Rings books, how would you find the frequencies of words? Approach2 (SQL): • Break the books into one word per line • Insert one word per row in database table • Execute: select word, count(*) from table group by word.
  • 16.
  • 17.
    MapReduce Understanding Unix Pipeline Aprogram can take input from you.
  • 18.
    MapReduce Understanding Unix Pipeline Aprogram may also print some output
  • 19.
    MapReduce Understanding Unix Pipeline command1| command2 Command1 Command2Pipe
  • 20.
    MapReduce THINKING IN MAP/ REDUCE If you have the plain text file of all the Lord Of Rings books, how would you find the frequencies of words? Approach 3 (Unix): • Replace space with a newline • Order lines with a sort command • Then find frequencies using uniq • Scans from top to bottom • prints the count when line value changes cat myfile| sed -E 's/[t ]+/n/g'| sort -S 1g | uniq -c
  • 21.
    MapReduce THINKING IN MAP/ REDUCE Problems in Approach 2 (SQL) & Approach 3 (Unix)?
  • 22.
    MapReduce THINKING IN MAP/ REDUCE Problems in Approach 2 (SQL) & Approach 3 (Unix)? The moment the data starts going beyond RAM the time taken starts increasing. The following become bottlenecks: • CPU • Disk Speed • Disk Space
  • 23.
    MapReduce THINKING IN MAP/ REDUCE Then? Approach 4: Use a external sort. • Split the files to a size that fits RAM • Use the previous approaches (2&3) to find freq • Merge (sort -m) and sum-up frequencies Machine 1 Machine 2 Launcher sa, re, re ga, ga, re re:2 sa:1 ga:2 re:1 merge ga:2 re:3 sa:1
  • 24.
    MapReduce • Takes O(n)time to merge sorted data • Or the time is proportional to the number of elements to be merged THINKING IN MAP / REDUCE Merging
  • 25.
    MapReduce Merging Merge the twosorted queues to form another sorted queue
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    MapReduce Merging Since no oneis left on second queue. Put remaining from first
  • 40.
  • 41.
    MapReduce • For morethan two lists ○ Use min-heap THINKING IN MAP / REDUCE Merging 1 4 6 9 10 12 6 7 8 8 9 9 3 5 7 5 10 17 To the output
  • 42.
    MapReduce • For morethan two lists ○ Use min-heap THINKING IN MAP / REDUCE Merging 1 4 6 9 10 12 6 7 8 8 9 9 3 5 7 5 10 17 1,
  • 43.
    MapReduce • For morethan two lists ○ Or merge two at a time THINKING IN MAP / REDUCE Merging
  • 44.
    MapReduce THINKING IN MAP/ REDUCE Problems with Approach 4? Machine 1 Machine 2 Launcher sa, re, re ga, ga, re re:2 sa:1 ga:2 re:1 merge ga:2 re:3 sa:1
  • 45.
    MapReduce THINKING IN MAP/ REDUCE Problems with external Sort? Time is consumed in transport of data. + For each requirement we would need to special purpose network oriented program. + Would Require A lot of Engineering. Solution? Use Map/Reduce
  • 46.
    MapReduce • Programming Paradigm •To help solve Big Data problems • Specifically sorting intensive jobs or disc read intensive • You would have to code two functions: • Mapper - Convert Input into “key - value” pairs • Reducer - Aggregates all the values for a key THINKING IN MAP / REDUCE What is Map/Reduce?
  • 47.
    MapReduce • Also supportedby many other systems such as • MongoDB / CouchDB / Cassandra • Apache Spark • Mapper & Reducers in hadoop • can be written in Java, Shell, Python or any binary THINKING IN MAP / REDUCE What is Map/Reduce?
  • 48.
    MapReduce Function Mapper (Image): Convertimage to 100x100 pixel EXAMPLE OF ONLY MAPPER Directory Of Profile Pictures in HDFS Function Mapper (Image): Convert image to 100x100 pixel Function Mapper (Image): Convert image to 100x100 pixel HDFS - Output Directory Of 100x100px Profile Pictures Machine 1 Machine 2 Machine 3
  • 49.
  • 50.
    MapReduce With Both mapper()& Reducer() code HDFS Input HDFS
  • 51.
    MapReduce MAP / REDUCE Mapper/Reducerfor word frequency problem. function map(line): foreach(word in line) : print(word, 1); sa 1 re 1 re 1 sa 1 ga 1 hdfs sa re re sa ga
  • 52.
    MapReduce MAP / REDUCE Mapper/Reducerfor word frequency problem. function map(line): foreach(word in line) : print(word, 1); sa re re sa ga function reduce(word, freqArray): return Array.sum(freqArray); sa 1 re 1 re 1 sa 1 ga 1 ga [1] re [1, 1] sa [1, 1] ga 1 re 2 sa 2 hdfs
  • 53.
    MapReduce Mapper/Reducer for computingmax temp def mapp(line): (t, c, time) = line.split(",") print(c, t) def reduce(key, values): return max(values) 20, NYC, 2014-01-01 20, NYC, 2015-01-01 21, NYC, 2014-01-02 23, BLR, 2012-01-01 25, Seatle, 2016-01-01 21, CHICAGO, 2013-01-05 24, NYC, 2016-5-05 NYC 20 NYC 20 NYC 21 BLR 23 SEATLE 25 CHICAGO 21 NYC 24 BLR 23 CHICAGO 21 NYC 20,20,21,24 SEATLE 25 BLR 23 CHICAGo 21 NYC 24 SEATLE 25 Temp, City, Date
  • 54.
    MapReduce Mapper/Reducer for computingmax temp def mapp(line): (t, c, date) = line.split(",") print(c, (t, date)) def reduce(key, values): maxt = -19191919; date = '' for i in values: T = i[0] If T > maxt: maxt = T, date=i[1 return (maxt, date) 20, NYC, 2014-01-01 20, NYC, 2015-01-01 21, NYC, 2014-01-02 23, BLR, 2012-01-01 25, Seatle, 2016-01-01 21, CHICAGO, 2013-01-05 24, NYC, 2016-5-05 NYC (20, 2014-01-01) NYC (20, 2015-01-01) NYC 21 BLR 23 SEATLE 25 CHICAGO 21 NYC 24 BLR (23, '2014-01-01' CHICAGO (21, '2015-01-01'). NYC 20,20,21,24 SEATLE 25 BLR (23, '2015-01-01') CHICAGo 21 NYC 24 SEATLE 25 Temp, City, Date
  • 55.
    MapReduce MAP / REDUCE Analogousto Group By function map(): (temp, city, time) = line.split(",") print(city, temp) function reduce(city, arr_temps): return max(arr_temps); select city, max(temp) from table group by city.
  • 56.
    MapReduce MAP / REDUCE Analogousto Group By function map(): foreach(word in input) : print(word, 1); function reduce(word, freqArray): return Array.sum(freqArray); select word, count(*) from table group by word.
  • 57.
    MapReduce MAP REDUCE -Multiple Reducers Split 0 Map Sort Split 1 Map Sort Split 2 Map Sort Reduce Part 0 Reduce Part 1 Copy Merge HDFS Block HDFS Block HDFS Block TO HDFS TO HDFS Apple Banana Apricot Carrots
  • 58.
    MapReduce MAP REDUCE -Paritioning Reducer 0 Reducer 1 Reducer 2 Reducer 3 1 5 9 . . . 3201 2 6 10 . . . 3202 3 7 11 . . . 3203 0 4 8 . . . 3200 Key k will go to this reducer: hashcode(k) % total_reducers Keys
  • 59.
  • 60.
    MapReduce Thank you. Hadoop &Spark support@knowbigdata.com +1 419 665 3276 (US) +91 803 959 1464 (IN) Subscribe to our Youtube channel for latest videos - https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA
  • 61.
  • 62.
    MapReduce MAP / REDUCE Thedata generated by the mapper is given to reducer and then it is sorted / shuffled [Yes/No]?
  • 63.
    MapReduce MAP / REDUCE Thedata generated by the mapper is given to reducer and then it is sorted / shuffled [Yes/No]? No. The output of mapper is first shuffled/sorted and then given to reducers.
  • 64.
    MapReduce MAP / REDUCE Themapper can only generate a single key value pair for an input value [True/False]?
  • 65.
    MapReduce MAP / REDUCE Themapper can only generate a single key value pair for an input value [True/False]? False. Mapper can generate as many key-value pair as it wants for an input.
  • 66.
    MapReduce MAP / REDUCE Amapper always have to generate at least a key-value pair[Correct/Wrong]?
  • 67.
    MapReduce MAP / REDUCE Amapper always generates at least a key-value pair[Correct/Wrong]? Wrong
  • 68.
    MapReduce MAP / REDUCE Bydefault there is only one reducer in case of streaming job [Yes/No]?
  • 69.
    MapReduce MAP / REDUCE Bydefault there is only one reducer in case of streaming job [Yes/No]? Yes. By default there is a single reducer job but it can be split by specifying cmd option : mapred.reduce.tasks.
  • 70.
    MapReduce MAP / REDUCE Inhadoop 1.0, What is the role of job tracker? A: Executing the Map/Reduce Logic B: Delegate the Map/Reduce Logic to task tracker.
  • 71.
    MapReduce MAP / REDUCE Whatis the role of job tracker? A: Executing the Map/Reduce Logic B: Delegate the Map/Reduce Logic to task tracker. B.
  • 72.
    MapReduce MAP / REDUCE Q:The Map logic is executed preferably on the nodes that have the required data [Yes/No]?
  • 73.
    MapReduce MAP / REDUCE Q:The Map logic is executed preferably on the nodes that have the required data [Yes/No]? Yes.
  • 74.
    MapReduce MAP / REDUCE Q:The Map logic is always executed on the nodes that have the required data [Correct/Wrong]?
  • 75.
    MapReduce MAP / REDUCE Wrong Q:The Map logic is always executed on the nodes that have the required data [Correct/Wrong]?
  • 76.
    MapReduce MAP / REDUCE Wheredoes Hadoop Store the result of reducer? In HDFS or Local File System?
  • 77.
    MapReduce MAP / REDUCE InHDFS. Where does Hadoop Store the result of reducer? In HDFS or Local File System?
  • 78.
    MapReduce MAP / REDUCE Wheredoes Hadoop Store the intermediate data such as output of Map Tasks? In HDFS or Local File System or Memory?
  • 79.
    MapReduce MAP / REDUCE Firstin Memory and purged to Local File System. Output of mapper is saved in HDFS directly only if there is no reduce phase. Where does Hadoop Store the intermediate data such as output of Map Tasks? In HDFS or File System or Memory?
  • 80.
    MapReduce MAP / REDUCEAssignment For Tomorrow 1. Frequencies of letters [a-z] - Do you need Map/Reduce? 2. Find anagrams in a huge text. An anagram is basically a different arrangement of letters in a word. Anagram does not need have a meaning. Input: “the cat act in tic tac toe” Output: cat, tac, act the toe in tic
  • 81.
    MapReduce MAP / REDUCE 3a.A file contains the DNA sequence of people. Find all the people who have same DNAs. Output: User1, User4 User2 User3, User 5 User6 Input: “User1 ACGT” “User2 TGCA” “User3 ACG” “User4 ACGT” “User5 ACG” “User6 AGCT” Assignment For Tomorrow
  • 82.
    MapReduce MAP / REDUCEAssignment For Tomorrow 3b. A file contains the DNA sequence of people. Find all the people who have same or mirror image of DNAs. Input: “User1 ACGT” “User2 TGCA” “User3 ACG” “User4 ACGT” “User5 ACG” “User6 ACCT” Output: User1, User2, User4 User3, User 5 User6
  • 83.
    MapReduce MAP / REDUCEAssignment For Tomorrow 4. In an unusual democracy, everyone is not equal. The vote count is a function of worth of the voter. Though everyone is voting for each other. As example, if A with a worth of 5 and B with a worth of 1 are voting for C, the vote count of C would be 6. You are given a list of people with their value of vote. You are also given another list describing who voted for who all. List1 Voter Votee A C B C C F Find out what is the vote count of everyone? List2 PersonWorth A 5 B 1 C 11 Result PersonVoteCount A 0 B 0 C 6 F 11
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
    MapReduce QUICK - CLUSTERHANDS ON MapReduce Command The Example is available here Remove old output directory hadoop fs -rm -r /user/student/wordcount/output Execute the mapReduce Command: hadoop jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /data/mr/wordcount/input mrout