SlideShare a Scribd company logo
Welcome to MapReduce Session
MapReduce
TODAY’S CLASS
● Thinking in MapReduce
○ Word Frequency Problem
■ Solution 1 - Coding
■ Solution 2 - SQL
■ Solution 3 - Unix Pipes
■ Solution 4 - External Sort
● Map/Reduce Overview
● Visualisation
● Analogies to groupby
● Assignments
Understanding Sorting
MapReduce
BIG DATA PROBLEM - PROCESSING
Q: How fast can 1GHz processor sort 1TB data? This
data is made up of 10 billion 100 byte size strings.
A: Around 6-10 hours
What's wrong 6-10 hours?
1. Faster Sort
2. Bigger Data Sorting
3. More often
We need
MapReduce
BIG DATA PROBLEM - PROCESSING
Google, 8 Sept, 2011:
Sorting 10PB took 6.5 hrs on 8000 computers
MapReduce
1. Every SQL Query is impacted by Sorting:
○ Where clause - Index (Sorting)
○ Group By - Involves Sorting
○ Joins - immensly enhanced by Sorting
○ Distinct
○ Order BY
2. Most of the algorithms depend on sorting
Why Sorting is such as big deal
MapReduce
• Programming Paradigm
• To help solve Big Data problems
• Specifically sorting intensive jobs or disc read
intensive
• You would have to code two functions:
• Mapper - Converts Input into “key - value” pairs
• Reducer - Aggregates all the values for a key
THINKING IN MAP / REDUCE
What is Map/Reduce?
MapReduce
• Also supported by many other systems such as
• MongoDB / CouchDB / Cassandra
• Apache Spark
• Mapper & Reducers in hadoop
• can be written in Java, Shell, Python or any binary
THINKING IN MAP / REDUCE
What is Map/Reduce?
MapReduce
MAP REDUCEMAP REDUCE
Split 0 Map
Sort
Split 1 Map
Sort
Split 2 Map
Sort
Reduce Part 0
Copy
Merge
HDFS
Block
HDFS
Block
HDFS
Block
TO
HDFS
MapReduce
MAP REDUCE
CutIntoPieces()
MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of containing 100s of text books,[500 mb]
how would you find the frequencies of words?
MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 1 (Programmatic):
• Create a frequency hash table / dictionary
• For each word in the files
• Increase its frequency in the hash table
• When no more words left in file, print the hash table
Problems?
MapReduce
THINKING IN MAP / REDUCE
Problems?
Start
Initialize a dictionary or
hashtable (word, count)
Read next word from file
Is Any
word
left?
Find word in
dictionary
Does the word
exist in
dictionary?
Increase the count by 1
Add new word
with count as 0
End
Print the word and
counts
1. wordcount={}
2. for word in file.read().split():
3. if word not in wordcount:
4. wordcount[word] = 0
5. wordcount[word] += 1
6. for k,v in wordcount.items():
7. print k, v
Line 1
2
2 3 4
5
6&7
MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 1 (Programmatic):
• Create a frequency hash table / dictionary
• For each word in the file
• Increase its frequency in the hash table
• When no more words left in file, print the hash table
Problems?
Can not process the data beyond RAM size.
MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach2 (SQL):
• Break the books into one word per line
• Insert one word per row in database table
• Execute: select word, count(*) from table group by word.
Understanding Unix Pipeline
MapReduce
Understanding Unix Pipeline
A program can take input from you.
MapReduce
Understanding Unix Pipeline
A program may also print some output
MapReduce
Understanding Unix Pipeline
command1 | command2
Command1 Command2Pipe
MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 3 (Unix):
• Replace space with a newline
• Order lines with a sort command
• Then find frequencies using uniq
• Scans from top to bottom
• prints the count when line value changes
cat myfile| sed -E 's/[t ]+/n/g'| sort -S 1g | uniq -c
MapReduce
THINKING IN MAP / REDUCE
Problems in Approach 2 (SQL) & Approach 3 (Unix)?
MapReduce
THINKING IN MAP / REDUCE
Problems in Approach 2 (SQL) & Approach 3 (Unix)?
The moment the data starts going beyond RAM the time taken
starts increasing. The following become bottlenecks:
• CPU
• Disk Speed
• Disk Space
MapReduce
THINKING IN MAP / REDUCE
Then?
Approach 4: Use a external sort.
• Split the files to a size that fits RAM
• Use the previous approaches (2&3) to find freq
• Merge (sort -m) and sum-up frequencies
Machine 1 Machine 2
Launcher
sa, re, re ga, ga, re
re:2
sa:1
ga:2
re:1
merge
ga:2
re:3
sa:1
MapReduce
• Takes O(n) time to merge sorted data
• Or the time is proportional to the number of
elements to be merged
THINKING IN MAP / REDUCE
Merging
MapReduce
Merging
Merge the two sorted queues to
form another sorted queue
MapReduce
Merging
Compare the heads
MapReduce
Merging
Pick shorter
MapReduce
Merging
Pick shorter
MapReduce
Merging
Compare the heads
again
MapReduce
Merging
Pick shorter
MapReduce
Merging
Compare the heads
again
MapReduce
Merging
Pick both if equal
MapReduce
Merging
Compare the heads
again
MapReduce
Merging
Pick shorter
MapReduce
Merging
Compare the heads
again
MapReduce
Merging
Pick shorter
MapReduce
Merging
Compare the heads
again
MapReduce
Merging
Pick shorter
MapReduce
Merging
Since no one is left on
second queue.
Put remaining from
first
MapReduce
Merging
This merges the two
queues into one
MapReduce
• For more than two lists
○ Use min-heap
THINKING IN MAP / REDUCE
Merging
1 4 6
9 10 12
6 7 8
8 9 9
3 5 7
5 10 17
To the output
MapReduce
• For more than two lists
○ Use min-heap
THINKING IN MAP / REDUCE
Merging
1 4 6
9 10 12
6 7 8
8 9 9
3 5 7
5 10 17
1,
MapReduce
• For more than two lists
○ Or merge two at a time
THINKING IN MAP / REDUCE
Merging
MapReduce
THINKING IN MAP / REDUCE
Problems with Approach 4?
Machine 1 Machine 2
Launcher
sa, re, re ga, ga, re
re:2
sa:1
ga:2
re:1
merge
ga:2
re:3
sa:1
MapReduce
THINKING IN MAP / REDUCE
Problems with external Sort?
Time is consumed in transport of data.
+
For each requirement we would need to
special purpose network oriented program.
+
Would Require A lot of Engineering.
Solution?
Use Map/Reduce
MapReduce
• Programming Paradigm
• To help solve Big Data problems
• Specifically sorting intensive jobs or disc read
intensive
• You would have to code two functions:
• Mapper - Convert Input into “key - value” pairs
• Reducer - Aggregates all the values for a key
THINKING IN MAP / REDUCE
What is Map/Reduce?
MapReduce
• Also supported by many other systems such as
• MongoDB / CouchDB / Cassandra
• Apache Spark
• Mapper & Reducers in hadoop
• can be written in Java, Shell, Python or any binary
THINKING IN MAP / REDUCE
What is Map/Reduce?
MapReduce
Function Mapper (Image):
Convert image
to 100x100 pixel
EXAMPLE OF ONLY MAPPER
Directory Of Profile Pictures in HDFS
Function Mapper (Image):
Convert image
to 100x100 pixel
Function Mapper (Image):
Convert image
to 100x100 pixel
HDFS - Output Directory Of 100x100px Profile Pictures
Machine 1 Machine 2 Machine 3
MapReduce
InputFormat
Datanode
Input Split
HDFS Block1
Record1
Record2
Record3
Map()
Map()
Map()
Mapper
(key1, value1)
(key2, value2)
Nothing
(key3, value3)
InputSplit
MapReduce
With Both mapper() & Reducer() code
HDFS
Input
HDFS
MapReduce
MAP / REDUCE
Mapper/Reducer for word frequency problem.
function map(line):
foreach(word in line) :
print(word, 1);
sa 1
re 1
re 1
sa 1
ga 1
hdfs
sa re re
sa ga
MapReduce
MAP / REDUCE
Mapper/Reducer for word frequency problem.
function map(line):
foreach(word in line) :
print(word, 1);
sa re re
sa ga
function reduce(word, freqArray):
return Array.sum(freqArray);
sa 1
re 1
re 1
sa 1
ga 1
ga [1]
re [1, 1]
sa [1, 1]
ga 1
re 2
sa 2
hdfs
MapReduce
Mapper/Reducer for computing max temp
def mapp(line):
(t, c, time) = line.split(",")
print(c, t)
def reduce(key, values):
return max(values)
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, Seatle, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
NYC 20
NYC 20
NYC 21
BLR 23
SEATLE 25
CHICAGO 21
NYC 24
BLR 23
CHICAGO 21
NYC 20,20,21,24
SEATLE 25
BLR 23
CHICAGo 21
NYC 24
SEATLE 25
Temp, City, Date
MapReduce
Mapper/Reducer for computing max temp
def mapp(line):
(t, c, date) = line.split(",")
print(c, (t, date))
def reduce(key, values):
maxt = -19191919;
date = ''
for i in values:
T = i[0]
If T > maxt: maxt = T, date=i[1
return (maxt, date)
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, Seatle, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
NYC (20, 2014-01-01)
NYC (20, 2015-01-01)
NYC 21
BLR 23
SEATLE 25
CHICAGO 21
NYC 24
BLR (23, '2014-01-01'
CHICAGO (21, '2015-01-01').
NYC 20,20,21,24
SEATLE 25
BLR (23, '2015-01-01')
CHICAGo 21
NYC 24
SEATLE 25
Temp, City, Date
MapReduce
MAP / REDUCE
Analogous to Group By
function map():
(temp, city, time) = line.split(",")
print(city, temp)
function reduce(city, arr_temps):
return max(arr_temps);
select city,
max(temp)
from table
group by city.
MapReduce
MAP / REDUCE
Analogous to Group By
function map():
foreach(word in input) :
print(word, 1);
function reduce(word, freqArray):
return Array.sum(freqArray);
select word,
count(*)
from table
group by
word.
MapReduce
MAP REDUCE - Multiple Reducers
Split 0 Map
Sort
Split 1 Map
Sort
Split 2 Map
Sort
Reduce Part 0
Reduce Part 1
Copy
Merge
HDFS
Block
HDFS
Block
HDFS
Block
TO
HDFS
TO
HDFS
Apple
Banana
Apricot
Carrots
MapReduce
MAP REDUCE - Paritioning
Reducer 0 Reducer 1 Reducer 2 Reducer 3
1
5
9
.
.
.
3201
2
6
10
.
.
.
3202
3
7
11
.
.
.
3203
0
4
8
.
.
.
3200
Key k will go to this reducer: hashcode(k) % total_reducers
Keys
Thank you
MapReduce
Thank you.
Hadoop & Spark
support@knowbigdata.com
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
Subscribe to our Youtube channel for latest videos -
https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA
MapReduce
MAP / REDUCE - RECAP
MapReduce
MAP / REDUCE
The data generated by the mapper is given to
reducer and then it is sorted / shuffled [Yes/No]?
MapReduce
MAP / REDUCE
The data generated by the mapper is given to
reducer and then it is sorted / shuffled [Yes/No]?
No. The output of mapper is first
shuffled/sorted and then given to reducers.
MapReduce
MAP / REDUCE
The mapper can only generate a single key value
pair for an input value [True/False]?
MapReduce
MAP / REDUCE
The mapper can only generate a single key value
pair for an input value [True/False]?
False. Mapper can generate as many key-value pair
as it wants for an input.
MapReduce
MAP / REDUCE
A mapper always have to generate at least a
key-value pair[Correct/Wrong]?
MapReduce
MAP / REDUCE
A mapper always generates at least a key-value
pair[Correct/Wrong]?
Wrong
MapReduce
MAP / REDUCE
By default there is only one reducer in case of
streaming job [Yes/No]?
MapReduce
MAP / REDUCE
By default there is only one reducer in case of
streaming job [Yes/No]?
Yes. By default there is a single reducer job but it
can be split by specifying cmd option :
mapred.reduce.tasks.
MapReduce
MAP / REDUCE
In hadoop 1.0, What is the role of job tracker?
A: Executing the Map/Reduce Logic
B: Delegate the Map/Reduce Logic to task
tracker.
MapReduce
MAP / REDUCE
What is the role of job tracker?
A: Executing the Map/Reduce Logic
B: Delegate the Map/Reduce Logic to task
tracker.
B.
MapReduce
MAP / REDUCE
Q: The Map logic is executed preferably on the
nodes that have the required data [Yes/No]?
MapReduce
MAP / REDUCE
Q: The Map logic is executed preferably on the
nodes that have the required data [Yes/No]?
Yes.
MapReduce
MAP / REDUCE
Q: The Map logic is always executed on the nodes
that have the required data [Correct/Wrong]?
MapReduce
MAP / REDUCE
Wrong
Q: The Map logic is always executed on the nodes
that have the required data [Correct/Wrong]?
MapReduce
MAP / REDUCE
Where does Hadoop Store the result of reducer?
In HDFS or Local File System?
MapReduce
MAP / REDUCE
In HDFS.
Where does Hadoop Store the result of reducer?
In HDFS or Local File System?
MapReduce
MAP / REDUCE
Where does Hadoop Store the intermediate data
such as output of Map Tasks?
In HDFS or Local File System or Memory?
MapReduce
MAP / REDUCE
First in Memory and purged to
Local File System.
Output of mapper is saved in HDFS directly only if
there is no reduce phase.
Where does Hadoop Store the intermediate data
such as output of Map Tasks?
In HDFS or File System or Memory?
MapReduce
MAP / REDUCE Assignment For Tomorrow
1. Frequencies of letters [a-z] - Do you need Map/Reduce?
2. Find anagrams in a huge text. An anagram is basically a
different arrangement of letters in a word. Anagram does not
need have a meaning.
Input:
“the cat act in tic tac toe”
Output:
cat, tac, act
the
toe
in
tic
MapReduce
MAP / REDUCE
3a. A file contains the DNA sequence of people. Find all the
people who have same DNAs.
Output:
User1, User4
User2
User3, User 5
User6
Input:
“User1 ACGT”
“User2 TGCA”
“User3 ACG”
“User4 ACGT”
“User5 ACG”
“User6 AGCT”
Assignment For Tomorrow
MapReduce
MAP / REDUCE Assignment For Tomorrow
3b. A file contains the DNA sequence of people. Find all the
people who have same or mirror image of DNAs.
Input:
“User1 ACGT”
“User2 TGCA”
“User3 ACG”
“User4 ACGT”
“User5 ACG”
“User6 ACCT”
Output:
User1, User2, User4
User3, User 5
User6
MapReduce
MAP / REDUCE Assignment For Tomorrow
4. In an unusual democracy, everyone is not equal. The vote count is a
function of worth of the voter. Though everyone is voting for each other.
As example, if A with a worth of 5 and B with a worth of 1 are voting
for C, the vote count of C would be 6.
You are given a list of people with their value of vote. You are also given
another list describing who voted for who all.
List1
Voter Votee
A C
B C
C F
Find out what is the vote count of everyone?
List2
PersonWorth
A 5
B 1
C 11
Result
PersonVoteCount
A 0
B 0
C 6
F 11
MapReduce
JOB TRACKER
MapReduce
JOB TRACKER (DETAILED)
MapReduce
JOB TRACKER (CONT.)
MapReduce
JOB TRACKER (CONT.)
MapReduce
MapReduce
QUICK - CLUSTER HANDS ON
MapReduce Command
The Example is available here
Remove old output directory
hadoop fs -rm -r /user/student/wordcount/output
Execute the mapReduce Command:
hadoop jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-examples.jar
wordcount /data/mr/wordcount/input mrout

More Related Content

What's hot

Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
Hanborq Inc.
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first classalogarg
 
Apache PIG
Apache PIGApache PIG
Apache PIG
Prashant Gupta
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
David Chiu
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyData
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 

What's hot (20)

Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 

Similar to MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Codemotion
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
TerrierTeam
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
Gwen (Chen) Shapira
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
AtulYadav218546
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
iot.pptx
iot.pptxiot.pptx
iot.pptx
SabthamiS1
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
NelakurthyVasanthRed1
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
David Gleich
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
CheeWeiTan10
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Avinash Pandu
 

Similar to MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
iot.pptx
iot.pptxiot.pptx
iot.pptx
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
CloudxLab
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. MapReduce TODAY’S CLASS ● Thinking in MapReduce ○ Word Frequency Problem ■ Solution 1 - Coding ■ Solution 2 - SQL ■ Solution 3 - Unix Pipes ■ Solution 4 - External Sort ● Map/Reduce Overview ● Visualisation ● Analogies to groupby ● Assignments
  • 4. MapReduce BIG DATA PROBLEM - PROCESSING Q: How fast can 1GHz processor sort 1TB data? This data is made up of 10 billion 100 byte size strings. A: Around 6-10 hours What's wrong 6-10 hours? 1. Faster Sort 2. Bigger Data Sorting 3. More often We need
  • 5. MapReduce BIG DATA PROBLEM - PROCESSING Google, 8 Sept, 2011: Sorting 10PB took 6.5 hrs on 8000 computers
  • 6. MapReduce 1. Every SQL Query is impacted by Sorting: ○ Where clause - Index (Sorting) ○ Group By - Involves Sorting ○ Joins - immensly enhanced by Sorting ○ Distinct ○ Order BY 2. Most of the algorithms depend on sorting Why Sorting is such as big deal
  • 7. MapReduce • Programming Paradigm • To help solve Big Data problems • Specifically sorting intensive jobs or disc read intensive • You would have to code two functions: • Mapper - Converts Input into “key - value” pairs • Reducer - Aggregates all the values for a key THINKING IN MAP / REDUCE What is Map/Reduce?
  • 8. MapReduce • Also supported by many other systems such as • MongoDB / CouchDB / Cassandra • Apache Spark • Mapper & Reducers in hadoop • can be written in Java, Shell, Python or any binary THINKING IN MAP / REDUCE What is Map/Reduce?
  • 9. MapReduce MAP REDUCEMAP REDUCE Split 0 Map Sort Split 1 Map Sort Split 2 Map Sort Reduce Part 0 Copy Merge HDFS Block HDFS Block HDFS Block TO HDFS
  • 11. MapReduce THINKING IN MAP / REDUCE If you have the plain text file of containing 100s of text books,[500 mb] how would you find the frequencies of words?
  • 12. MapReduce THINKING IN MAP / REDUCE If you have the plain text file of all the Lord Of Rings books, how would you find the frequencies of words? Approach 1 (Programmatic): • Create a frequency hash table / dictionary • For each word in the files • Increase its frequency in the hash table • When no more words left in file, print the hash table Problems?
  • 13. MapReduce THINKING IN MAP / REDUCE Problems? Start Initialize a dictionary or hashtable (word, count) Read next word from file Is Any word left? Find word in dictionary Does the word exist in dictionary? Increase the count by 1 Add new word with count as 0 End Print the word and counts 1. wordcount={} 2. for word in file.read().split(): 3. if word not in wordcount: 4. wordcount[word] = 0 5. wordcount[word] += 1 6. for k,v in wordcount.items(): 7. print k, v Line 1 2 2 3 4 5 6&7
  • 14. MapReduce THINKING IN MAP / REDUCE If you have the plain text file of all the Lord Of Rings books, how would you find the frequencies of words? Approach 1 (Programmatic): • Create a frequency hash table / dictionary • For each word in the file • Increase its frequency in the hash table • When no more words left in file, print the hash table Problems? Can not process the data beyond RAM size.
  • 15. MapReduce THINKING IN MAP / REDUCE If you have the plain text file of all the Lord Of Rings books, how would you find the frequencies of words? Approach2 (SQL): • Break the books into one word per line • Insert one word per row in database table • Execute: select word, count(*) from table group by word.
  • 17. MapReduce Understanding Unix Pipeline A program can take input from you.
  • 18. MapReduce Understanding Unix Pipeline A program may also print some output
  • 19. MapReduce Understanding Unix Pipeline command1 | command2 Command1 Command2Pipe
  • 20. MapReduce THINKING IN MAP / REDUCE If you have the plain text file of all the Lord Of Rings books, how would you find the frequencies of words? Approach 3 (Unix): • Replace space with a newline • Order lines with a sort command • Then find frequencies using uniq • Scans from top to bottom • prints the count when line value changes cat myfile| sed -E 's/[t ]+/n/g'| sort -S 1g | uniq -c
  • 21. MapReduce THINKING IN MAP / REDUCE Problems in Approach 2 (SQL) & Approach 3 (Unix)?
  • 22. MapReduce THINKING IN MAP / REDUCE Problems in Approach 2 (SQL) & Approach 3 (Unix)? The moment the data starts going beyond RAM the time taken starts increasing. The following become bottlenecks: • CPU • Disk Speed • Disk Space
  • 23. MapReduce THINKING IN MAP / REDUCE Then? Approach 4: Use a external sort. • Split the files to a size that fits RAM • Use the previous approaches (2&3) to find freq • Merge (sort -m) and sum-up frequencies Machine 1 Machine 2 Launcher sa, re, re ga, ga, re re:2 sa:1 ga:2 re:1 merge ga:2 re:3 sa:1
  • 24. MapReduce • Takes O(n) time to merge sorted data • Or the time is proportional to the number of elements to be merged THINKING IN MAP / REDUCE Merging
  • 25. MapReduce Merging Merge the two sorted queues to form another sorted queue
  • 39. MapReduce Merging Since no one is left on second queue. Put remaining from first
  • 40. MapReduce Merging This merges the two queues into one
  • 41. MapReduce • For more than two lists ○ Use min-heap THINKING IN MAP / REDUCE Merging 1 4 6 9 10 12 6 7 8 8 9 9 3 5 7 5 10 17 To the output
  • 42. MapReduce • For more than two lists ○ Use min-heap THINKING IN MAP / REDUCE Merging 1 4 6 9 10 12 6 7 8 8 9 9 3 5 7 5 10 17 1,
  • 43. MapReduce • For more than two lists ○ Or merge two at a time THINKING IN MAP / REDUCE Merging
  • 44. MapReduce THINKING IN MAP / REDUCE Problems with Approach 4? Machine 1 Machine 2 Launcher sa, re, re ga, ga, re re:2 sa:1 ga:2 re:1 merge ga:2 re:3 sa:1
  • 45. MapReduce THINKING IN MAP / REDUCE Problems with external Sort? Time is consumed in transport of data. + For each requirement we would need to special purpose network oriented program. + Would Require A lot of Engineering. Solution? Use Map/Reduce
  • 46. MapReduce • Programming Paradigm • To help solve Big Data problems • Specifically sorting intensive jobs or disc read intensive • You would have to code two functions: • Mapper - Convert Input into “key - value” pairs • Reducer - Aggregates all the values for a key THINKING IN MAP / REDUCE What is Map/Reduce?
  • 47. MapReduce • Also supported by many other systems such as • MongoDB / CouchDB / Cassandra • Apache Spark • Mapper & Reducers in hadoop • can be written in Java, Shell, Python or any binary THINKING IN MAP / REDUCE What is Map/Reduce?
  • 48. MapReduce Function Mapper (Image): Convert image to 100x100 pixel EXAMPLE OF ONLY MAPPER Directory Of Profile Pictures in HDFS Function Mapper (Image): Convert image to 100x100 pixel Function Mapper (Image): Convert image to 100x100 pixel HDFS - Output Directory Of 100x100px Profile Pictures Machine 1 Machine 2 Machine 3
  • 50. MapReduce With Both mapper() & Reducer() code HDFS Input HDFS
  • 51. MapReduce MAP / REDUCE Mapper/Reducer for word frequency problem. function map(line): foreach(word in line) : print(word, 1); sa 1 re 1 re 1 sa 1 ga 1 hdfs sa re re sa ga
  • 52. MapReduce MAP / REDUCE Mapper/Reducer for word frequency problem. function map(line): foreach(word in line) : print(word, 1); sa re re sa ga function reduce(word, freqArray): return Array.sum(freqArray); sa 1 re 1 re 1 sa 1 ga 1 ga [1] re [1, 1] sa [1, 1] ga 1 re 2 sa 2 hdfs
  • 53. MapReduce Mapper/Reducer for computing max temp def mapp(line): (t, c, time) = line.split(",") print(c, t) def reduce(key, values): return max(values) 20, NYC, 2014-01-01 20, NYC, 2015-01-01 21, NYC, 2014-01-02 23, BLR, 2012-01-01 25, Seatle, 2016-01-01 21, CHICAGO, 2013-01-05 24, NYC, 2016-5-05 NYC 20 NYC 20 NYC 21 BLR 23 SEATLE 25 CHICAGO 21 NYC 24 BLR 23 CHICAGO 21 NYC 20,20,21,24 SEATLE 25 BLR 23 CHICAGo 21 NYC 24 SEATLE 25 Temp, City, Date
  • 54. MapReduce Mapper/Reducer for computing max temp def mapp(line): (t, c, date) = line.split(",") print(c, (t, date)) def reduce(key, values): maxt = -19191919; date = '' for i in values: T = i[0] If T > maxt: maxt = T, date=i[1 return (maxt, date) 20, NYC, 2014-01-01 20, NYC, 2015-01-01 21, NYC, 2014-01-02 23, BLR, 2012-01-01 25, Seatle, 2016-01-01 21, CHICAGO, 2013-01-05 24, NYC, 2016-5-05 NYC (20, 2014-01-01) NYC (20, 2015-01-01) NYC 21 BLR 23 SEATLE 25 CHICAGO 21 NYC 24 BLR (23, '2014-01-01' CHICAGO (21, '2015-01-01'). NYC 20,20,21,24 SEATLE 25 BLR (23, '2015-01-01') CHICAGo 21 NYC 24 SEATLE 25 Temp, City, Date
  • 55. MapReduce MAP / REDUCE Analogous to Group By function map(): (temp, city, time) = line.split(",") print(city, temp) function reduce(city, arr_temps): return max(arr_temps); select city, max(temp) from table group by city.
  • 56. MapReduce MAP / REDUCE Analogous to Group By function map(): foreach(word in input) : print(word, 1); function reduce(word, freqArray): return Array.sum(freqArray); select word, count(*) from table group by word.
  • 57. MapReduce MAP REDUCE - Multiple Reducers Split 0 Map Sort Split 1 Map Sort Split 2 Map Sort Reduce Part 0 Reduce Part 1 Copy Merge HDFS Block HDFS Block HDFS Block TO HDFS TO HDFS Apple Banana Apricot Carrots
  • 58. MapReduce MAP REDUCE - Paritioning Reducer 0 Reducer 1 Reducer 2 Reducer 3 1 5 9 . . . 3201 2 6 10 . . . 3202 3 7 11 . . . 3203 0 4 8 . . . 3200 Key k will go to this reducer: hashcode(k) % total_reducers Keys
  • 60. MapReduce Thank you. Hadoop & Spark support@knowbigdata.com +1 419 665 3276 (US) +91 803 959 1464 (IN) Subscribe to our Youtube channel for latest videos - https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA
  • 62. MapReduce MAP / REDUCE The data generated by the mapper is given to reducer and then it is sorted / shuffled [Yes/No]?
  • 63. MapReduce MAP / REDUCE The data generated by the mapper is given to reducer and then it is sorted / shuffled [Yes/No]? No. The output of mapper is first shuffled/sorted and then given to reducers.
  • 64. MapReduce MAP / REDUCE The mapper can only generate a single key value pair for an input value [True/False]?
  • 65. MapReduce MAP / REDUCE The mapper can only generate a single key value pair for an input value [True/False]? False. Mapper can generate as many key-value pair as it wants for an input.
  • 66. MapReduce MAP / REDUCE A mapper always have to generate at least a key-value pair[Correct/Wrong]?
  • 67. MapReduce MAP / REDUCE A mapper always generates at least a key-value pair[Correct/Wrong]? Wrong
  • 68. MapReduce MAP / REDUCE By default there is only one reducer in case of streaming job [Yes/No]?
  • 69. MapReduce MAP / REDUCE By default there is only one reducer in case of streaming job [Yes/No]? Yes. By default there is a single reducer job but it can be split by specifying cmd option : mapred.reduce.tasks.
  • 70. MapReduce MAP / REDUCE In hadoop 1.0, What is the role of job tracker? A: Executing the Map/Reduce Logic B: Delegate the Map/Reduce Logic to task tracker.
  • 71. MapReduce MAP / REDUCE What is the role of job tracker? A: Executing the Map/Reduce Logic B: Delegate the Map/Reduce Logic to task tracker. B.
  • 72. MapReduce MAP / REDUCE Q: The Map logic is executed preferably on the nodes that have the required data [Yes/No]?
  • 73. MapReduce MAP / REDUCE Q: The Map logic is executed preferably on the nodes that have the required data [Yes/No]? Yes.
  • 74. MapReduce MAP / REDUCE Q: The Map logic is always executed on the nodes that have the required data [Correct/Wrong]?
  • 75. MapReduce MAP / REDUCE Wrong Q: The Map logic is always executed on the nodes that have the required data [Correct/Wrong]?
  • 76. MapReduce MAP / REDUCE Where does Hadoop Store the result of reducer? In HDFS or Local File System?
  • 77. MapReduce MAP / REDUCE In HDFS. Where does Hadoop Store the result of reducer? In HDFS or Local File System?
  • 78. MapReduce MAP / REDUCE Where does Hadoop Store the intermediate data such as output of Map Tasks? In HDFS or Local File System or Memory?
  • 79. MapReduce MAP / REDUCE First in Memory and purged to Local File System. Output of mapper is saved in HDFS directly only if there is no reduce phase. Where does Hadoop Store the intermediate data such as output of Map Tasks? In HDFS or File System or Memory?
  • 80. MapReduce MAP / REDUCE Assignment For Tomorrow 1. Frequencies of letters [a-z] - Do you need Map/Reduce? 2. Find anagrams in a huge text. An anagram is basically a different arrangement of letters in a word. Anagram does not need have a meaning. Input: “the cat act in tic tac toe” Output: cat, tac, act the toe in tic
  • 81. MapReduce MAP / REDUCE 3a. A file contains the DNA sequence of people. Find all the people who have same DNAs. Output: User1, User4 User2 User3, User 5 User6 Input: “User1 ACGT” “User2 TGCA” “User3 ACG” “User4 ACGT” “User5 ACG” “User6 AGCT” Assignment For Tomorrow
  • 82. MapReduce MAP / REDUCE Assignment For Tomorrow 3b. A file contains the DNA sequence of people. Find all the people who have same or mirror image of DNAs. Input: “User1 ACGT” “User2 TGCA” “User3 ACG” “User4 ACGT” “User5 ACG” “User6 ACCT” Output: User1, User2, User4 User3, User 5 User6
  • 83. MapReduce MAP / REDUCE Assignment For Tomorrow 4. In an unusual democracy, everyone is not equal. The vote count is a function of worth of the voter. Though everyone is voting for each other. As example, if A with a worth of 5 and B with a worth of 1 are voting for C, the vote count of C would be 6. You are given a list of people with their value of vote. You are also given another list describing who voted for who all. List1 Voter Votee A C B C C F Find out what is the vote count of everyone? List2 PersonWorth A 5 B 1 C 11 Result PersonVoteCount A 0 B 0 C 6 F 11
  • 89. MapReduce QUICK - CLUSTER HANDS ON MapReduce Command The Example is available here Remove old output directory hadoop fs -rm -r /user/student/wordcount/output Execute the mapReduce Command: hadoop jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /data/mr/wordcount/input mrout