SlideShare a Scribd company logo
Distributed Computing with
Apache Hadoop
Introduction to MapReduce
Konstantin V. Shvachko
Birmingham Big Data Science Group
October 19, 2011
Computing
• History of computing started long time ago
• Fascination with numbers
– Vast universe with simple strict rules
– Computing devices
– Crunch numbers
• The Internet
– Universe of words, fuzzy rules
– Different type of computing
– Understand meaning of things
– Human thinking
– Errors & deviations are a
part of study
2
Computer History Museum, San Jose
Words vs. Numbers
• In 1997 IBM built Deep Blue supercomputer
– Playing chess game with the champion G. Kasparov
– Human race was defeated
– Strict rules for Chess
– Fast deep analyses of current state
– Still numbers
3
• In 2011 IBM built Watson computer to
play Jeopardy
– Questions and hints in human terms
– Analysis of texts from library and the
Internet
– Human champions defeated
Big Data
• Computations that need the power of many computers
– Large datasets: hundreds of TBs, PBs
– Or use of thousands of CPUs in parallel
– Or both
• Cluster as a computer
4
What is a PB?
1 KB = 1000 Bytes
1 MB = 1000 KB
1 GB = 1000 MB
1 TB = 1000 GB
1 PB = 1000 TB
???? = 1000 PB
Examples – Science
• Fundamental physics: Large Hadron Collider (LHC)
– Smashing high-energy protons at the speed of light
– 1 PB of event data per sec, most filtered out
– 15 PB of data per year
– 150 computing centers around the World
– 160 PB of disk + 90 PB of tape storage
• Math: Big Numbers
– 2 quadrillionth (1015) digit of π is 0
– pure CPU workload
– 12 days of cluster time
– 208 years of CPU-time on a cluster with 7600 CPU cores
• Big Data – Big Science
5
Examples – Web
• Search engine Webmap
– Map of the Internet
– 2008 @ Yahoo, 1500 nodes, 5 PB raw storage
• Internet Search Index
– Traditional application
• Social Network Analysis
– Intelligence
– Trends
6
The Sorting Problem
• Classic in-memory sorting
– Complexity: number of comparisons
• External sorting
– Cannot load all data in memory
– 16 GB RAM vs. 200 GB file
– Complexity: + disk IOs (bytes read or written)
• Distributed sorting
– Cannot load data on a single server
– 12 drives * 2 TB = 24 TB disc space vs. 200 TB data set
– Complexity: + network transfers
7
Worst Average Space
Bubble Sort O(n2) O(n2) In-place
Quicksort O(n2) O(n log n) In-place
Merge Sort O(n log n) O(n log n) Double
What do we do?
• Need a lot of computers
• How to make them work together
8
Hadoop
• Apache Hadoop is an ecosystem of
tools for processing “Big Data”
• Started in 2005 by D. Cutting and M. Cafarella
• Consists of two main components: Providing unified cluster view
1. HDFS – a distributed file system
– File system API connecting thousands of drives
2. MapReduce – a framework for distributed computations
– Splitting jobs into parts executable on one node
– Scheduling and monitoring of job execution
• Today used everywhere: Becoming a standard of distributed computing
• Hadoop is an open source project
9
MapReduce
• MapReduce
– 2004 Jeffrey Dean, Sanjay Ghemawat. Google.
– “MapReduce: Simplified Data Processing on Large Clusters”
• Computational model
– What is a comp. model ???
• Turing machine, Java
– Split large input data into small enough pieces, process in parallel
• Execution framework
– Compilers, interpreters
– Scheduling, Processing, Coordination
– Failure recovery
10
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
11
Functional Programming: reduce
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
12
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
• Reduce( x * y, [0,1,2,3,4,5] ) = ?
13
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
• Reduce( x * y, [0,1,2,3,4,5] ) = 0
14
Example: Sum of Squares
• Composition of
– a map followed by
– a reduce applied to the results of the map
• Example.
– Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25]
– Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55
• Map easily parallelizable
– Compute x2 for 1,2,3 on one node and for 4,5 on another
• Reduce notoriously sequential
– Need all squares at one node to compute the total sum.
15
Square Pyramid Number
1 + 4 + … + n2 =
n(n+1)(2n+1) / 6
Computational Model
• MapReduce is a Parallel Computational Model
• Map-Reduce algorithm = job
• Operates with key-value pairs: (k, V)
– Primitive types, Strings or more complex Structures
• Map-Reduce job input and output is a list of pairs {(k, V)}
• MR Job as defined by 2 functions
• map: (k1; v1) → {(k2; v2)}
• reduce: (k2; {v2}) → {(k3; v3)}
16
Job Workflow
17
dogs C, 3
like
cats
V, 1
C, 2 V, 2
C, 3 V, 1
C, 8
V, 4
The Algorithm
18
Map ( null, word)
nC = Consonants(word)
nV = Vowels(word)
Emit(“Consonants”, nC)
Emit(“Vowels”, nV)
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Computation Framework
• Two virtual clusters: HDFS and MapReduce
– Physically tightly coupled. Designed to work together
• Hadoop Distributed File System. View data as files and directories
• MapReduce is a Parallel Computation Framework
– Job scheduling and execution framework
19
HDFS Architecture Principles
• The name space is a hierarchy of files and directories
• Files are divided into blocks (typically 128 MB)
• Namespace (metadata) is decoupled from data
– Fast namespace operations, not slowed down by
– Data streaming
• Single NameNode keeps the entire name space in RAM
• DataNodes store data blocks on local drives
• Blocks are replicated on 3 DataNodes for redundancy and availability
20
MapReduce Framework
• Job Input is a file or a set of files in a distributed file system (HDFS)
– Input is split into blocks of roughly the same size
– Blocks are replicated to multiple nodes
– Block holds a list of key-value pairs
• Map task is scheduled to one of the nodes containing the block
– Map task input is node-local
– Map task result is node-local
• Map task results are grouped: one group per reducer
Each group is sorted
• Reduce task is scheduled to a node
– Reduce task transfers the targeted groups from all mapper nodes
– Computes and stores results in a separate HDFS file
• Job Output is a set of files in HDFS. With #files = #reducers
21
Map Reduce Example: Mean
• Mean
• Input: large text file
• Output: average length of words in the file µ
• Example: µ({dogs, like, cats}) = 4
22
n
ix
n 1
1
Mean Mapper
• Map input is the set of words {w} in the partition
– Key = null Value = w
• Map computes
– Number of words in the partition
– Total length of the words ∑length(w)
• Map output
– <“count”, #words>
– <“length”, #totalLength>
23
Map (null, w)
Emit(“count”, 1)
Emit(“length”, length(w))
Single Mean Reducer
• Reduce input
– {<key, {value}>}, where
– key = “count”, “length”
– value is an integer
• Reduce computes
– Total number of words: N = sum of all “count” values
– Total length of words: L = sum of all “length” values
• Reduce Output
– <“count”, N>
– <“length”, L>
• The result
– µ = L / N
24
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Analyze ()
read(“part-r-00000”)
print(“mean = ” + L/N)
Mean: Mapper, Reducer
25
public class WordMean {
private final static Text COUNT_KEY = new Text(new String("count"));
private final static Text LENGTH_KEY = new Text(new String("length"));
private final static LongWritable ONE = new LongWritable(1);
public static class WordMeanMapper
extends Mapper<Object, Text, Text, LongWritable> {
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String word = itr.nextToken();
context.write(LENGTH_KEY, new LongWritable(word.length()));
context.write(COUNT_KEY, ONE);
} } }
public static class WordMeanReducer
extends Reducer<Text,LongWritable,Text,LongWritable> {
public void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (LongWritable val : values)
sum += val.get();
context.write(key, new LongWritable(sum));
} }
. . . . . . . . . . . . . . . .
Mean: main()
26
. . . . . . . . . . . . . . . .
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(
conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordmean <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word mean");
job.setJarByClass(WordMean.class);
job.setMapperClass(WordMeanMapper.class);
job.setReducerClass(WordMeanReducer.class);
job.setCombinerClass(WordMeanReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(1);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
Path outputpath = new Path(otherArgs[1]);
FileOutputFormat.setOutputPath(job, outputpath);
boolean result = job.waitForCompletion(true);
analyzeResult(outputpath);
System.exit(result ? 0 : 1);
}
. . . . . . . . . . . . . . . .
Mean: analyzeResult()
27
. . . . . . . . . . . . . . . .
private static void analyzeResult(Path outDir) throws IOException {
FileSystem fs = FileSystem.get(new Configuration());
Path reduceFile = new Path(outDir, "part-r-00000");
if(!fs.exists(reduceFile)) return;
long count = 0, length = 0;
BufferedReader in =
new BufferedReader(new InputStreamReader(fs.open(reduceFile)));
while(in != null && in.ready()) {
StringTokenizer st = new StringTokenizer(in.readLine());
String key = st.nextToken();
String value = st.nextToken();
if(key.equals("count")) count = Long.parseLong(value);
else if(key.equals("length")) length = Long.parseLong(value);
}
double average = (double)length / count;
System.out.println("The mean is: " + average);
}
} // end WordMean
MapReduce Implementation
• Single master JobTracker shepherds the distributed heard of TaskTrackers
1. Job scheduling and resource allocation
2. Job monitoring and job lifecycle coordination
3. Cluster health and resource tracking
• Job is defined
– Program: myJob.jar file
– Configuration: conf.xml
– Input, output paths
• JobClient submits the job to the JobTracker
– Calculates and creates splits based on the input
– Write myJob.jar and conf.xml to HDFS
28
MapReduce Implementation
• JobTracker divides the job into tasks: one map task per split.
– Assigns a TaskTracker for each task, collocated with the split
• TaskTrackers execute tasks and report status to the JobTracker
– TaskTracker can run multiple map and reduce tasks
– Map and Reduce Slots
• Failed attempts reassigned to other TaskTrackers
• Job execution status and results reported back to the client
• Scheduler lets many jobs run in parallel
29
Example: Standard Deviation
• Standard deviation
• Input: large text file
• Output: standard deviation σ of word lengths
• Example: σ({dogs, like, cats}) = 0
• How many jobs
30
n
ix
n 1
2
)(
1
Standard Deviation: Hint
31
2
1
22
1
2
11
22
1
22
1
1
)
1
(2
1
)(
1
n
i
nn
i
n
i
n
i
x
n
n
x
n
x
n
x
n
Standard Deviation Mapper
• Map input is the set of words {w} in the partition
– Key = null Value = w
• Map computes
– Number of words in the partition
– Total length of the words ∑length(w)
– The sum of lengths squared ∑length(w)2
• Map output
– <“count”, #words>
– <“length”, #totalLength>
– <“squared”, #sumLengthSquared>
32
Map (null, w)
Emit(“count”, 1)
Emit(“length”, length(w))
Emit(“squared”, length(w)2)
Standard Deviation Reducer
• Reduce input
– {<key, {value}>}, where
– key = “count”, “length”, “squared”
– value is an integer
• Reduce computes
– Total number of words: N = sum of all “count” values
– Total length of words: L = sum of all “length” values
– Sum of length squares: S = sum of all “squared” values
• Reduce Output
– <“count”, N>
– <“length”, L>
– <“squared”, S>
• The result
– µ = L / N
– σ = sqrt(S / N - µ2)
33
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Analyze ()
read(“part-r-00000”)
print(“mean = ” + L/N)
print(“std.dev = ” +
sqrt(S/N – L*L / N*N))
Combiner, Partitioner
• Combiners perform local aggregation before the shuffle & sort phase
– Optimization to reduce data transfers during shuffle
– In Mean example reduces transfer of many keys to only two
• Partitioners assign intermediate (map) key-value pairs to reducers
– Responsible for dividing up the intermediate key space
– Not used with single Reducer
34
Input
Data
Input
Data
Map Reduce
Input Map Shuffle
& sort
Reduce OutputCombiner
Partitioner
Distributed Sorting
• Sort a dataset, which cannot be entirely stored on one node.
• Input:
– Set of files. 100 byte records.
– The first 10 bytes of each record is the key and the rest is the value.
• Output:
– Ordered list of files: f1, … fN
– Each file fi is sorted, and
– If i < j then for any keys k Є fi and r Є fj (k ≤ r)
– Concatenation of files in the given order must form a completely sorted record set
35
Input
Data
Naïve MapReduce Sorting
• If the output could be stored on one node
• The input to any Reducer is always sorted by key
– Shuffle sorts Map outputs
• One identity Mapper and one identity Reducer would do the trick
– Identity: <k,v> → <k,v>
36
Input
Data
Map Reduce
dogs
like
cats
cats
dogs
like
Input Map Shuffle Reduce Output
cats dogs like
Naïve Sorting: Multiple Maps
• Multiple identity Mappers and one identity Reducer – same result
– Does not work for multiple Reducers
37
Input
Data
Output
Data
Map
Map
Map
Reduce
dogs
like
cats
cats
dogs
like
Input Map Shuffle Reduce Output
Sorting: Generalization
• Define a hash function, such that
– h: {k} → [1,N]
– Preserves the order: k ≤ s → h(k) ≤ h(s)
– h(k) is a fixed size prefix of string k (2 first bytes)
• Identity Mapper
• With a specialized Partitioner
– Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k)
• Identity Reducer
– Number of reducers is N: R1, …, RN
– Inputs for Ri are all pairs that have key h(k) = i
– Ri is an identity reducer, which writes output to HDFS file fi
– Hash function choice guarantees that
keys from fi are less than keys from fj if i < j
• The algorithm was implemented to win Gray’s Terasort Benchmark in 2008
38
Undirected Graphs
• “A Discipline of Programming” E. W. Dijkstra. Ch. 23.
– Good old classics
• Graph is defined by V = {v}, E = {<v,w> | v,w Є V}
• Undirected graph. E is symmetrical, that is <v,w> Є E ≡ <w,v> Є E
• Different representations of E
1. Set of pairs
2. <v, {direct neighbors}>
3. Adjacency matrix
• From 1 to 2 in one MR job
– Identity Mapper
– Combiner = Reducer
– Reducer joins values for each vertex
39
Connected Components
• Partition set of nodes V into disjoint subsets V1, …, VN
– V = V1 U … U VN
– No paths using E from Vi to Vj if i ≠ j
– Gi = <Vi, Ei >
• Representation of connected component
– key = min{Vi}
– value = Vi
• Chain of MR jobs
• Initial data representation
– E is partitioned into sets of records (blocks)
– <v,w> Є E → <min(v,w), {v,w}> = <k, C>
40
MR Connected Components
• Mapper / Reducer Input
– {<k, C>}, where C is a subset of V, k = min(C)
• Mapper
• Reducer
• Iterate. Stop when stabilized
41
Map {<k, C>}
For all <ki, Ci> and <kj, Cj>
if Ci ∩ Cj ≠ Ǿ then
C = Ci U Cj
Emit(min(C), C)
Reduce(k, {C1, C2, …})
resC = C1 U C2 U …
Emit(k, resC)
The End
42

More Related Content

What's hot

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
joelcrabb
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
Denis Shestakov
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Hadoop
HadoopHadoop
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
Cyanny LIANG
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ran Ziv
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
Robert Grossman
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
University College Cork
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
Kannappan Sirchabesan
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
Denis Shestakov
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Siva Pandeti
 

What's hot (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 

Similar to Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
Vibrant Technologies & Computers
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
Harisankar H
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
karthikks82
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
GiannisPagges
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
TAGADPALLEWARPARTHVA
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
Gwen (Chen) Shapira
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
Tushar557668
 
Hadoop
HadoopHadoop
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 

Similar to Distributed Computing with Apache Hadoop. Introduction to MapReduce. (20)

Hadoop
HadoopHadoop
Hadoop
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
Hadoop
HadoopHadoop
Hadoop
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 

Recently uploaded

Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Undress Baby
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Distributed Computing with Apache Hadoop. Introduction to MapReduce.

  • 1. Distributed Computing with Apache Hadoop Introduction to MapReduce Konstantin V. Shvachko Birmingham Big Data Science Group October 19, 2011
  • 2. Computing • History of computing started long time ago • Fascination with numbers – Vast universe with simple strict rules – Computing devices – Crunch numbers • The Internet – Universe of words, fuzzy rules – Different type of computing – Understand meaning of things – Human thinking – Errors & deviations are a part of study 2 Computer History Museum, San Jose
  • 3. Words vs. Numbers • In 1997 IBM built Deep Blue supercomputer – Playing chess game with the champion G. Kasparov – Human race was defeated – Strict rules for Chess – Fast deep analyses of current state – Still numbers 3 • In 2011 IBM built Watson computer to play Jeopardy – Questions and hints in human terms – Analysis of texts from library and the Internet – Human champions defeated
  • 4. Big Data • Computations that need the power of many computers – Large datasets: hundreds of TBs, PBs – Or use of thousands of CPUs in parallel – Or both • Cluster as a computer 4 What is a PB? 1 KB = 1000 Bytes 1 MB = 1000 KB 1 GB = 1000 MB 1 TB = 1000 GB 1 PB = 1000 TB ???? = 1000 PB
  • 5. Examples – Science • Fundamental physics: Large Hadron Collider (LHC) – Smashing high-energy protons at the speed of light – 1 PB of event data per sec, most filtered out – 15 PB of data per year – 150 computing centers around the World – 160 PB of disk + 90 PB of tape storage • Math: Big Numbers – 2 quadrillionth (1015) digit of π is 0 – pure CPU workload – 12 days of cluster time – 208 years of CPU-time on a cluster with 7600 CPU cores • Big Data – Big Science 5
  • 6. Examples – Web • Search engine Webmap – Map of the Internet – 2008 @ Yahoo, 1500 nodes, 5 PB raw storage • Internet Search Index – Traditional application • Social Network Analysis – Intelligence – Trends 6
  • 7. The Sorting Problem • Classic in-memory sorting – Complexity: number of comparisons • External sorting – Cannot load all data in memory – 16 GB RAM vs. 200 GB file – Complexity: + disk IOs (bytes read or written) • Distributed sorting – Cannot load data on a single server – 12 drives * 2 TB = 24 TB disc space vs. 200 TB data set – Complexity: + network transfers 7 Worst Average Space Bubble Sort O(n2) O(n2) In-place Quicksort O(n2) O(n log n) In-place Merge Sort O(n log n) O(n log n) Double
  • 8. What do we do? • Need a lot of computers • How to make them work together 8
  • 9. Hadoop • Apache Hadoop is an ecosystem of tools for processing “Big Data” • Started in 2005 by D. Cutting and M. Cafarella • Consists of two main components: Providing unified cluster view 1. HDFS – a distributed file system – File system API connecting thousands of drives 2. MapReduce – a framework for distributed computations – Splitting jobs into parts executable on one node – Scheduling and monitoring of job execution • Today used everywhere: Becoming a standard of distributed computing • Hadoop is an open source project 9
  • 10. MapReduce • MapReduce – 2004 Jeffrey Dean, Sanjay Ghemawat. Google. – “MapReduce: Simplified Data Processing on Large Clusters” • Computational model – What is a comp. model ??? • Turing machine, Java – Split large input data into small enough pieces, process in parallel • Execution framework – Compilers, interpreters – Scheduling, Processing, Coordination – Failure recovery 10
  • 11. Functional Programming • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] 11
  • 12. Functional Programming: reduce • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] • Reduce / fold a higher-order function – Iterates given function over a list of elements – Applies function to previous result and current element – Return single result • Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15 12
  • 13. Functional Programming • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] • Reduce / fold a higher-order function – Iterates given function over a list of elements – Applies function to previous result and current element – Return single result • Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15 • Reduce( x * y, [0,1,2,3,4,5] ) = ? 13
  • 14. Functional Programming • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] • Reduce / fold a higher-order function – Iterates given function over a list of elements – Applies function to previous result and current element – Return single result • Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15 • Reduce( x * y, [0,1,2,3,4,5] ) = 0 14
  • 15. Example: Sum of Squares • Composition of – a map followed by – a reduce applied to the results of the map • Example. – Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25] – Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55 • Map easily parallelizable – Compute x2 for 1,2,3 on one node and for 4,5 on another • Reduce notoriously sequential – Need all squares at one node to compute the total sum. 15 Square Pyramid Number 1 + 4 + … + n2 = n(n+1)(2n+1) / 6
  • 16. Computational Model • MapReduce is a Parallel Computational Model • Map-Reduce algorithm = job • Operates with key-value pairs: (k, V) – Primitive types, Strings or more complex Structures • Map-Reduce job input and output is a list of pairs {(k, V)} • MR Job as defined by 2 functions • map: (k1; v1) → {(k2; v2)} • reduce: (k2; {v2}) → {(k3; v3)} 16
  • 17. Job Workflow 17 dogs C, 3 like cats V, 1 C, 2 V, 2 C, 3 V, 1 C, 8 V, 4
  • 18. The Algorithm 18 Map ( null, word) nC = Consonants(word) nV = Vowels(word) Emit(“Consonants”, nC) Emit(“Vowels”, nV) Reduce(key, {n1, n2, …}) nRes = n1 + n2 + … Emit(key, nRes)
  • 19. Computation Framework • Two virtual clusters: HDFS and MapReduce – Physically tightly coupled. Designed to work together • Hadoop Distributed File System. View data as files and directories • MapReduce is a Parallel Computation Framework – Job scheduling and execution framework 19
  • 20. HDFS Architecture Principles • The name space is a hierarchy of files and directories • Files are divided into blocks (typically 128 MB) • Namespace (metadata) is decoupled from data – Fast namespace operations, not slowed down by – Data streaming • Single NameNode keeps the entire name space in RAM • DataNodes store data blocks on local drives • Blocks are replicated on 3 DataNodes for redundancy and availability 20
  • 21. MapReduce Framework • Job Input is a file or a set of files in a distributed file system (HDFS) – Input is split into blocks of roughly the same size – Blocks are replicated to multiple nodes – Block holds a list of key-value pairs • Map task is scheduled to one of the nodes containing the block – Map task input is node-local – Map task result is node-local • Map task results are grouped: one group per reducer Each group is sorted • Reduce task is scheduled to a node – Reduce task transfers the targeted groups from all mapper nodes – Computes and stores results in a separate HDFS file • Job Output is a set of files in HDFS. With #files = #reducers 21
  • 22. Map Reduce Example: Mean • Mean • Input: large text file • Output: average length of words in the file µ • Example: µ({dogs, like, cats}) = 4 22 n ix n 1 1
  • 23. Mean Mapper • Map input is the set of words {w} in the partition – Key = null Value = w • Map computes – Number of words in the partition – Total length of the words ∑length(w) • Map output – <“count”, #words> – <“length”, #totalLength> 23 Map (null, w) Emit(“count”, 1) Emit(“length”, length(w))
  • 24. Single Mean Reducer • Reduce input – {<key, {value}>}, where – key = “count”, “length” – value is an integer • Reduce computes – Total number of words: N = sum of all “count” values – Total length of words: L = sum of all “length” values • Reduce Output – <“count”, N> – <“length”, L> • The result – µ = L / N 24 Reduce(key, {n1, n2, …}) nRes = n1 + n2 + … Emit(key, nRes) Analyze () read(“part-r-00000”) print(“mean = ” + L/N)
  • 25. Mean: Mapper, Reducer 25 public class WordMean { private final static Text COUNT_KEY = new Text(new String("count")); private final static Text LENGTH_KEY = new Text(new String("length")); private final static LongWritable ONE = new LongWritable(1); public static class WordMeanMapper extends Mapper<Object, Text, Text, LongWritable> { public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { String word = itr.nextToken(); context.write(LENGTH_KEY, new LongWritable(word.length())); context.write(COUNT_KEY, ONE); } } } public static class WordMeanReducer extends Reducer<Text,LongWritable,Text,LongWritable> { public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (LongWritable val : values) sum += val.get(); context.write(key, new LongWritable(sum)); } } . . . . . . . . . . . . . . . .
  • 26. Mean: main() 26 . . . . . . . . . . . . . . . . public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser( conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordmean <in> <out>"); System.exit(2); } Job job = new Job(conf, "word mean"); job.setJarByClass(WordMean.class); job.setMapperClass(WordMeanMapper.class); job.setReducerClass(WordMeanReducer.class); job.setCombinerClass(WordMeanReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); job.setNumReduceTasks(1); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); Path outputpath = new Path(otherArgs[1]); FileOutputFormat.setOutputPath(job, outputpath); boolean result = job.waitForCompletion(true); analyzeResult(outputpath); System.exit(result ? 0 : 1); } . . . . . . . . . . . . . . . .
  • 27. Mean: analyzeResult() 27 . . . . . . . . . . . . . . . . private static void analyzeResult(Path outDir) throws IOException { FileSystem fs = FileSystem.get(new Configuration()); Path reduceFile = new Path(outDir, "part-r-00000"); if(!fs.exists(reduceFile)) return; long count = 0, length = 0; BufferedReader in = new BufferedReader(new InputStreamReader(fs.open(reduceFile))); while(in != null && in.ready()) { StringTokenizer st = new StringTokenizer(in.readLine()); String key = st.nextToken(); String value = st.nextToken(); if(key.equals("count")) count = Long.parseLong(value); else if(key.equals("length")) length = Long.parseLong(value); } double average = (double)length / count; System.out.println("The mean is: " + average); } } // end WordMean
  • 28. MapReduce Implementation • Single master JobTracker shepherds the distributed heard of TaskTrackers 1. Job scheduling and resource allocation 2. Job monitoring and job lifecycle coordination 3. Cluster health and resource tracking • Job is defined – Program: myJob.jar file – Configuration: conf.xml – Input, output paths • JobClient submits the job to the JobTracker – Calculates and creates splits based on the input – Write myJob.jar and conf.xml to HDFS 28
  • 29. MapReduce Implementation • JobTracker divides the job into tasks: one map task per split. – Assigns a TaskTracker for each task, collocated with the split • TaskTrackers execute tasks and report status to the JobTracker – TaskTracker can run multiple map and reduce tasks – Map and Reduce Slots • Failed attempts reassigned to other TaskTrackers • Job execution status and results reported back to the client • Scheduler lets many jobs run in parallel 29
  • 30. Example: Standard Deviation • Standard deviation • Input: large text file • Output: standard deviation σ of word lengths • Example: σ({dogs, like, cats}) = 0 • How many jobs 30 n ix n 1 2 )( 1
  • 32. Standard Deviation Mapper • Map input is the set of words {w} in the partition – Key = null Value = w • Map computes – Number of words in the partition – Total length of the words ∑length(w) – The sum of lengths squared ∑length(w)2 • Map output – <“count”, #words> – <“length”, #totalLength> – <“squared”, #sumLengthSquared> 32 Map (null, w) Emit(“count”, 1) Emit(“length”, length(w)) Emit(“squared”, length(w)2)
  • 33. Standard Deviation Reducer • Reduce input – {<key, {value}>}, where – key = “count”, “length”, “squared” – value is an integer • Reduce computes – Total number of words: N = sum of all “count” values – Total length of words: L = sum of all “length” values – Sum of length squares: S = sum of all “squared” values • Reduce Output – <“count”, N> – <“length”, L> – <“squared”, S> • The result – µ = L / N – σ = sqrt(S / N - µ2) 33 Reduce(key, {n1, n2, …}) nRes = n1 + n2 + … Emit(key, nRes) Analyze () read(“part-r-00000”) print(“mean = ” + L/N) print(“std.dev = ” + sqrt(S/N – L*L / N*N))
  • 34. Combiner, Partitioner • Combiners perform local aggregation before the shuffle & sort phase – Optimization to reduce data transfers during shuffle – In Mean example reduces transfer of many keys to only two • Partitioners assign intermediate (map) key-value pairs to reducers – Responsible for dividing up the intermediate key space – Not used with single Reducer 34 Input Data Input Data Map Reduce Input Map Shuffle & sort Reduce OutputCombiner Partitioner
  • 35. Distributed Sorting • Sort a dataset, which cannot be entirely stored on one node. • Input: – Set of files. 100 byte records. – The first 10 bytes of each record is the key and the rest is the value. • Output: – Ordered list of files: f1, … fN – Each file fi is sorted, and – If i < j then for any keys k Є fi and r Є fj (k ≤ r) – Concatenation of files in the given order must form a completely sorted record set 35
  • 36. Input Data Naïve MapReduce Sorting • If the output could be stored on one node • The input to any Reducer is always sorted by key – Shuffle sorts Map outputs • One identity Mapper and one identity Reducer would do the trick – Identity: <k,v> → <k,v> 36 Input Data Map Reduce dogs like cats cats dogs like Input Map Shuffle Reduce Output cats dogs like
  • 37. Naïve Sorting: Multiple Maps • Multiple identity Mappers and one identity Reducer – same result – Does not work for multiple Reducers 37 Input Data Output Data Map Map Map Reduce dogs like cats cats dogs like Input Map Shuffle Reduce Output
  • 38. Sorting: Generalization • Define a hash function, such that – h: {k} → [1,N] – Preserves the order: k ≤ s → h(k) ≤ h(s) – h(k) is a fixed size prefix of string k (2 first bytes) • Identity Mapper • With a specialized Partitioner – Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k) • Identity Reducer – Number of reducers is N: R1, …, RN – Inputs for Ri are all pairs that have key h(k) = i – Ri is an identity reducer, which writes output to HDFS file fi – Hash function choice guarantees that keys from fi are less than keys from fj if i < j • The algorithm was implemented to win Gray’s Terasort Benchmark in 2008 38
  • 39. Undirected Graphs • “A Discipline of Programming” E. W. Dijkstra. Ch. 23. – Good old classics • Graph is defined by V = {v}, E = {<v,w> | v,w Є V} • Undirected graph. E is symmetrical, that is <v,w> Є E ≡ <w,v> Є E • Different representations of E 1. Set of pairs 2. <v, {direct neighbors}> 3. Adjacency matrix • From 1 to 2 in one MR job – Identity Mapper – Combiner = Reducer – Reducer joins values for each vertex 39
  • 40. Connected Components • Partition set of nodes V into disjoint subsets V1, …, VN – V = V1 U … U VN – No paths using E from Vi to Vj if i ≠ j – Gi = <Vi, Ei > • Representation of connected component – key = min{Vi} – value = Vi • Chain of MR jobs • Initial data representation – E is partitioned into sets of records (blocks) – <v,w> Є E → <min(v,w), {v,w}> = <k, C> 40
  • 41. MR Connected Components • Mapper / Reducer Input – {<k, C>}, where C is a subset of V, k = min(C) • Mapper • Reducer • Iterate. Stop when stabilized 41 Map {<k, C>} For all <ki, Ci> and <kj, Cj> if Ci ∩ Cj ≠ Ǿ then C = Ci U Cj Emit(min(C), C) Reduce(k, {C1, C2, …}) resC = C1 U C2 U … Emit(k, resC)