SlideShare a Scribd company logo
First MR job -Inverted Index construction
Map Reduce -Introduction 
•Parallel Job processing framework 
•Written in java 
•Close integration with HDFS 
•Provides : 
–Auto partitioning of job into sub tasks 
–Auto retry on failures 
–Linear Scalability 
–Locality of task execution 
–Pluginbased framework for extensibility
Lets think scalability 
•Let’s go through an exercise of scaling a simple program to process a large data set. 
•Problem: count the number of times each word occurs in a set of documents. 
•Example: only one document with only one sentence –“Do as I say, not as I do.” 
•Pseudocode: A multisetis a set where each element also has a count 
define wordCountas Multiset; (assume a hash table) 
for each document in documentSet{ 
T = tokenize(document); 
for each token in T { 
wordCount[token]++; 
} 
} 
display(wordCount);
How about a billion documents? 
•Looping through all the documents using a single computer will be extremely time consuming. 
•You speed it up by rewriting the program so that it distributes the work over several machines. 
•Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines. 
define wordCountas Multiset; 
for each document in documentSubset{ 
T = tokenize(document); 
for each token in T { 
wordCount[token]++; 
} 
} 
sendToSecondPhase(wordCount); 
define totalWordCountas Multiset; 
for each wordCountreceived from firstPhase{ 
multisetAdd(totalWordCount, wordCount); 
}
Problems 
•Where are documents stored? 
–Having more machines for processing only helps up to a certain point— until the storage server can’t keep up. 
–You’ll also need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it. 
•wordCount(and totalWordCount) are stored in memory 
–When processing large document sets, the number of unique words can exceed the RAM storage of a machine. 
–Furthermore phase two has only one machine, which will process wordCountsent from all the machines in phase one. The single machine in phase two will become the bottleneck. 
•Solution: divide based on expected output! 
–Let’s say we have 26 machines for phase two. We assign each machine to only handle wordCountfor words beginning with a particular letter in the alphabet.
Map-Reduce 
•MapReduceprograms are executed in two main phases, called 
–mapping and 
–reducing. 
•In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. 
•In the reducing phase, the reducer processes all the outputs from the mapperand arrives at a final result. 
•The mapperis meant to filter and transform the input into something 
•That the reducer can aggregate over. 
•MapReduceuses lists and (key/value) pairs as its main data primitives.
Map-Reduce 
•Map-Reduce Program 
–Based on two functions: Map and Reduce 
–Every Map/Reduce program must specify a Mapperand optionally a Reducer 
–Operate on key and value pairs 
Map-Reduce works like a Unix pipeline: 
cat input | grep| sort | uniq-c | cat > output 
Input| Map| Shuffle & Sort | Reduce| Output 
cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist 
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) 
Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Putting things in context 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File N-2 
File N-1 
File N 
Input 
files 
Splits 
Mapper 
Machine -1 
Machine -2 
Machine -M 
Split 1 
Split 2 
Split 3 
Split M-2 
Split M-1 
Split M 
Map 1 
Map 2 
Map 3 
Map M-2 
Map M-1 
Map M 
Combiner 1 
Combiner C 
(Kay, Value) 
pairs 
Record Reader 
combiner 
. 
. 
. 
Partition 1 
Partition 2 
Partition P-1 
Partition P 
Partitionar 
Reducer 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File O-2 
File O-1 
File O 
Reducer 1 
Reducer 2 
Reducer R-1 
Reducer R 
Input 
Output 
Machine -x
Some MapReduce Terminology 
•Job–A “full program” -an execution of a Mapperand Reducer across a data set 
•Task –An execution of a Mapperor a Reducer on a slice of data 
–a.k.a. Task-In-Progress (TIP) 
•Task Attempt –A particular instance of an attempt to execute a task on a machine
Terminology Example 
•Running “Word Count” across 20 files is one job 
•20 files to be mapped simply 20 map tasks+ some number of reduce tasks 
•At least 20 map task attemptswill be performed… more if a machine crashes, due to speculative execution etc.
Task Attempts 
•A particular task will be attempted at least once, possibly more times if it crashes 
–If the same input causes crashes over and over, that input will eventually be abandoned 
•Multiple attempts at one task may occur in parallel with speculative execution turned on 
–Task ID from TaskInProgress is not a unique identifier; don’t use it that way
MapReduce: High Level 
JobTrackerMapReduce job submitted by client computerMaster nodeTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instance
Node-to-Node Communication 
•Hadoop uses its own RPC protocol 
•All communication begins in slave nodes 
–Prevents circular-wait deadlock 
–Slaves periodically poll for “status” message 
•Classes must provide explicit serialization
Nodes, Trackers, Tasks 
•Master node runs JobTrackerinstance, which accepts Job requests from clients 
•TaskTrackerinstances run on slave nodes 
•TaskTrackerforks separate Java process for task instances
Job Distribution 
•MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options 
•Running a MapReduce job places these files into the HDFS and notifies TaskTrackerswhere to retrieve the relevant program code 
•… Where’s the data distribution?
Data Distribution 
•Implicit in design of MapReduce! 
–All mappersare equivalent; so map whatever data is local to a particular node in HDFS 
•If lots of data does happen to pile up on the same node, nearby nodes will map instead 
–Data transfer is handled implicitly by HDFS
Configuring With JobConf 
•MR Programs have many configurable options 
•JobConfobjects hold (key, value) components mapping String ’a’ 
–e.g., “mapred.map.tasks” 20 
–JobConfis serialized and distributed before running the job 
•Objects implementing JobConfigurablecan retrieve elements from a JobConf
Job Launch Process: Client 
•Client program creates a JobConf 
–Identify classes implementing Mapperand Reducerinterfaces 
•JobConf.setMapperClass(), setReducerClass() 
–Specify inputs, outputs 
•JobConf.setInputPath(), setOutputPath() 
–Optionally, other options too: 
•JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…
Job Launch Process: JobClient 
•Pass JobConfto JobClient.runJob() or submitJob() 
–runJob() blocks, submitJob() does not 
•JobClient: 
–Determines proper division of input into InputSplits 
–Sends job data to master JobTrackerserver
Job Launch Process: JobTracker 
•JobTracker: 
–Inserts jar and JobConf(serialized to XML) in shared location 
–Posts a JobInProgressto its run queue
Job Launch Process: TaskTracker 
•TaskTrackersrunning on slave nodes periodically query JobTrackerfor work 
•Retrieve job-specific jar and config 
•Launch task in separate instance of Java 
–main() is provided by Hadoop
Job Launch Process: Task 
•TaskTracker.Child.main(): 
–Sets up the child TaskInProgressattempt 
–Reads XML configuration 
–Connects back to necessary MapReduce components via RPC 
–Uses TaskRunnerto launch user process
Job Launch Process: TaskRunner 
•TaskRunner, MapTaskRunner, MapRunnerwork in a daisy-chain to launch your Mapper 
–Task knows ahead of time which InputSplitsit should be mapping 
–Calls Mapperonce for each record retrieved from the InputSplit 
•Running the Reduceris much the same
Creating the Mapper 
•You provide the instance of Mapper 
–Should extend MapReduceBase 
•One instance of your Mapperis initialized by the MapTaskRunnerfor a TaskInProgress 
–Exists in separate process from all other instances of Mapper–no data sharing!
Mapper 
•void map(WritableComparablekey, 
Writable value, 
OutputCollectoroutput, 
Reporter reporter)
What is Writable? 
•Hadoop defines its own classes for strings (Text),integers (IntWritable), etc. 
•All values are instances of Writable 
•All keys are instances of WritableComparable
Getting Data To The Mapper 
Input fileInputSplitInputSplitInputSplitInputSplitInput fileRecordReaderRecordReaderRecordReaderRecordReaderMapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates)InputFormat
Reading Data 
•Data sets are specified by InputFormats 
–Defines input data (e.g., a directory) 
–Identifies partitions of the data that form an InputSplit 
–Factory for RecordReaderobjects to extract (k, v) records from the input source
FileInputFormatand Friends 
•TextInputFormat–Treats each ‘n’-terminated line of a file as a value 
•KeyValueTextInputFormat–Maps ‘n’-terminated text lines of “k SEP v” 
•SequenceFileInputFormat–Binary file of (k, v) pairs with some additional metadata 
•SequenceFileAsTextInputFormat–Same, but maps (k.toString(), v.toString())
Filtering File Inputs 
•FileInputFormatwill read all files out of a specified directory and send them to the mapper 
•Delegates filtering this file list to a method subclasses may override 
–e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
Record Readers 
•Each InputFormatprovides its own RecordReaderimplementation 
–Provides capability multiplexing 
•LineRecordReader–Reads a line from a text file 
•KeyValueRecordReader–Used by KeyValueTextInputFormat
Input Split Size 
•FileInputFormatwill divide large files into chunks 
–Exact size controlled by mapred.min.split.size 
•RecordReadersreceive file, offset, and length of chunk 
•Custom InputFormatimplementations may override split size –e.g., “NeverChunkFile”
Sending Data To Reducers 
•Map function receives OutputCollectorobject 
–OutputCollector.collect() takes (k, v) elements 
•Any (WritableComparable, Writable)can be used
WritableComparator 
•Compares WritableComparabledata 
–Will call WritableComparable.compare() 
–Can provide fast path for serialized data 
•JobConf.setOutputValueGroupingComparator()
Sending Data To The Client 
•Reporterobject sent to Mapperallows simple asynchronous feedback 
–incrCounter(Enumkey, long amount) 
–setStatus(String msg) 
•Allows self-identification of input 
–InputSplitgetInputSplit()
Partition And Shuffle 
Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) ReducerReducerReducer(intermediates)(intermediates)(intermediates) PartitionerPartitionerPartitionerPartitioner shuffling
Partitioner 
•intgetPartition(key, val, numPartitions) 
–Outputs the partition number for a given key 
–One partition == values sent to one Reduce task 
•HashPartitionerused by default 
–Uses key.hashCode() to return partition num 
•JobConfsets Partitioner implementation
Reduction 
•reduce(WritableComparablekey, 
Iteratorvalues, 
OutputCollectoroutput, 
Reporter reporter) 
•Keys & values sent to one partition all go to the same reduce task 
•Calls are sorted by key –“earlier” keys are reduced and output before “later” keys
Finally: Writing The Output 
ReducerReducerReducerRecordWriterRecordWriterRecordWriteroutput fileoutput fileoutput file OutputFormat
WordCountM/R 
map(String filename, String document) 
{ 
List<String> T = tokenize(document); 
for each token in T { 
emit ((String)token, (Integer) 1); 
} 
} 
reduce(String token, List<Integer> values) 
{ 
Integer sum = 0; 
for each value in values { 
sum = sum + value; 
} 
emit ((String)token, (Integer) sum); 
}
Word Count: Java Mapper 
public static class MapClassextendsMapReduceBase 
implementsMapper<LongWritable, Text, Text, IntWritable>{ 
public voidmap(LongWritablekey, Text value, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throwsIOException{ 
String line = value.toString(); 
StringTokenizeritr= newStringTokenizer(line); 
while(itr.hasMoreTokens()) { 
Text word = new Text(itr.nextToken()); 
output.collect(word,newIntWritable(1)); 
} 
} 
} 
42
Word Count: Java Reduce 
public static class Reduce extendsMapReduceBase 
implementsReducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text key, 
Iterator<IntWritable> values, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throwsIOException{ 
intsum = 0; 
while(values.hasNext()) { 
sum += values.next().get(); 
} 
output.collect(key, newIntWritable(sum)); 
} 
} 
43
Word Count: Java Driver 
public void run(String inPath, String outPath) 
throwsException { 
JobConfconf = newJobConf(WordCount.class); 
conf.setJobName("wordcount"); 
conf.setMapperClass(MapClass.class); 
conf.setReducerClass(Reduce.class); 
FileInputFormat.addInputPath(conf, newPath(inPath)); 
FileOutputFormat.setOutputPath(conf, newPath(outPath)); 
JobClient.runJob(conf); 
} 
44
WordCountwith many mapperand One reducer
Job, Task, and Task Attempt IDs 
•The format of a job ID is composed of the time that the jobtracker(not the job) started and an incrementing counter maintained by the jobtrackerto uniquely identify the job to that instance of the jobtracker. 
•job_201206111011_0002 : 
–is the second (0002, job IDs are 1-based) job run by the jobtracker 
–which started at 10:11 on June 11, 2012 
•Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job. 
•task_201206111011_0002_m_000003: 
–is the fourth (000003, task IDs are 0-based) 
–map (m) task of the job with ID job_201206111011_0002. 
–The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in. 
•Tasks may be executed more than once, due to failure or speculative execution, so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker. 
•attempt_200904110811_0002_m_000003_0: 
–is the first (0, attempt IDs are 0-based) attempt at running task task_201206111011_0002_m_000003.
Exercise -description 
•The objectives for this exercise are: 
–Become familiar with decomposing a problem into Map and Reduce stages. 
–Get a sense for how MapReducecan be used in the real world. 
•An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. 
For example, if given the following 2 documents: 
Doc1: Buffalo buffalobuffalo. 
Doc2: Buffalo are mammals. 
we could construct the following inverted file index: 
Buffalo -> Doc1, Doc2 
buffalo -> Doc1 
buffalo. -> Doc1 
are -> Doc2 
mammals. -> Doc2
Exercise -tasks 
•Task -1: (30 min) 
–Write pseudo-code for map and reduce to solve inverted index problem 
–What are your K1 V1, K2, V2 etc. 
–“Execute” your pseudo-code with following example and explain what shuffle & Sort stage do with keys and values 
•Task –2: (30min) 
–Use distributed code Python/Java and execute them following instruction 
•Where input and out data was stored, and in what format? 
•What were K1 V1, K2, V2 data types used? 
•Task –3: (45min) 
•Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. How can you determine whether a word is "noisy“? 
–Re-write your pseudo-code with determination (your algorithms) and removal of “noisy” words using map-reduce framework. 
•Group / individual presentation (45 min)
Example: Inverted Index 
•Input:(filename, text) records 
•Output:list of files containing each word 
•Map: foreachword in text.split(): output(word, filename) 
•Combine:unique filenames for each word 
•Reduce: defreduce(word, filenames): output(word, sort(filenames)) 
49
Inverted Index 
50 
to be or not to be 
afraid, (1Xth.txt) 
be, (1Xth.txt, hamlet.txt) 
greatness, (12th.txt) 
not, (1Xth.txt, hamlet.txt) 
of, (12th.txt) 
or, (hamlet.txt) 
to, (hamlet.txt) 
hamlet.txt 
be not afraid of greatness 
1Xth.txt 
to, hamlet.txt 
be, hamlet.txt 
or, hamlet.txt 
not, hamlet.txt 
be, 1Xth.txt 
not, 1Xth.txt 
afraid, 1Xth.txt 
of, 1Xth.txt 
greatness, 1Xth.txt
A better example 
•Billions of crawled pages and links 
•Generate an index of words linking to web urlsin which they occur. 
–Input is split into url->pages (lines of pages) 
–Map looks for words in lines of page and puts out word -> link pairs 
–Group k,vpairs to generate word->{list of links} 
–Reduce puts out pairs to output
Search Reverse Index 
public static class MapClassextends MapReduceBase 
implements Mapper<Text, Text, Text, IntWritable> { 
private Text word = new Text(); 
public void map(Text url, Text pageText, 
OutputCollector<Text, Text> output, 
Reporter reporter) throws IOException{ 
String line = pageText.toString(); 
StringTokenizeritr= new StringTokenizer(line); 
while (itr.hasMoreTokens()) { 
//ignore unwanted and redundant words 
word.set(itr.nextToken()); 
output.collect(word, url); 
} 
} 
}
Search Reverse Index 
public static class Reduce extends MapReduceBase 
implements Reducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text word, Iterator<Text> urls, 
OutputCollector<Text, Iterator<Text>> output, 
Reporter reporter) throws IOException{ 
output.collect(word, urls); 
} 
}
End of sesssion 
Day –1: First MR job -Inverted Index construction

More Related Content

What's hot

Lecture 6 introduction to open gl and glut
Lecture 6   introduction to open gl and glutLecture 6   introduction to open gl and glut
Lecture 6 introduction to open gl and glut
simpleok
 
BSPTreesGameEngines-1
BSPTreesGameEngines-1BSPTreesGameEngines-1
BSPTreesGameEngines-1
Jason Calvert
 
Analytics for software development
Analytics for software developmentAnalytics for software development
Analytics for software development
Thomas Zimmermann
 

What's hot (20)

Cloud Reference Model
Cloud Reference ModelCloud Reference Model
Cloud Reference Model
 
Cloud computing for Smart City
Cloud computing for Smart CityCloud computing for Smart City
Cloud computing for Smart City
 
symmetric key encryption algorithms
 symmetric key encryption algorithms symmetric key encryption algorithms
symmetric key encryption algorithms
 
Lecture 6 introduction to open gl and glut
Lecture 6   introduction to open gl and glutLecture 6   introduction to open gl and glut
Lecture 6 introduction to open gl and glut
 
Secure Hash Algorithm (SHA-512)
Secure Hash Algorithm (SHA-512)Secure Hash Algorithm (SHA-512)
Secure Hash Algorithm (SHA-512)
 
Ch14
Ch14Ch14
Ch14
 
PGP S/MIME
PGP S/MIMEPGP S/MIME
PGP S/MIME
 
CS6701 CRYPTOGRAPHY AND NETWORK SECURITY
CS6701 CRYPTOGRAPHY AND NETWORK SECURITYCS6701 CRYPTOGRAPHY AND NETWORK SECURITY
CS6701 CRYPTOGRAPHY AND NETWORK SECURITY
 
Quality & Reliability in Software Engineering
Quality & Reliability in Software EngineeringQuality & Reliability in Software Engineering
Quality & Reliability in Software Engineering
 
Software configuration items
Software configuration itemsSoftware configuration items
Software configuration items
 
The Quality Standard: ISO 9000 , CMM and Six Sigma
The Quality Standard: ISO 9000 , CMM and Six SigmaThe Quality Standard: ISO 9000 , CMM and Six Sigma
The Quality Standard: ISO 9000 , CMM and Six Sigma
 
Kerberos
KerberosKerberos
Kerberos
 
BSPTreesGameEngines-1
BSPTreesGameEngines-1BSPTreesGameEngines-1
BSPTreesGameEngines-1
 
4. THREE DIMENSIONAL DISPLAY METHODS
4.	THREE DIMENSIONAL DISPLAY METHODS4.	THREE DIMENSIONAL DISPLAY METHODS
4. THREE DIMENSIONAL DISPLAY METHODS
 
Illumination models
Illumination modelsIllumination models
Illumination models
 
Virtualization security threats in cloud computing
Virtualization security threats in cloud computingVirtualization security threats in cloud computing
Virtualization security threats in cloud computing
 
Open gl
Open glOpen gl
Open gl
 
Distribution of public keys and hmac
Distribution of public keys and hmacDistribution of public keys and hmac
Distribution of public keys and hmac
 
Analytics for software development
Analytics for software developmentAnalytics for software development
Analytics for software development
 
Software management disciplines
Software management disciplinesSoftware management disciplines
Software management disciplines
 

Viewers also liked

Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 

Viewers also liked (9)

03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Genetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial IntelligenceGenetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial Intelligence
 
Car accident insurance claim
Car accident insurance claimCar accident insurance claim
Car accident insurance claim
 
2017 Lincoln Continental Near Wilmington DE
2017 Lincoln Continental Near Wilmington DE2017 Lincoln Continental Near Wilmington DE
2017 Lincoln Continental Near Wilmington DE
 
Choose the safest car for your teen - Floyd Arthur Presentation
Choose the safest car for your teen - Floyd Arthur PresentationChoose the safest car for your teen - Floyd Arthur Presentation
Choose the safest car for your teen - Floyd Arthur Presentation
 
Funeral burial insurance
Funeral burial insuranceFuneral burial insurance
Funeral burial insurance
 
Home insurance hollywood
Home insurance hollywoodHome insurance hollywood
Home insurance hollywood
 

Similar to Hadoop first mr job - inverted index construction

Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 

Similar to Hadoop first mr job - inverted index construction (20)

MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 

More from Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 

More from Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 

Recently uploaded

Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
zahraomer517
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 

Recently uploaded (20)

Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 

Hadoop first mr job - inverted index construction

  • 1. First MR job -Inverted Index construction
  • 2. Map Reduce -Introduction •Parallel Job processing framework •Written in java •Close integration with HDFS •Provides : –Auto partitioning of job into sub tasks –Auto retry on failures –Linear Scalability –Locality of task execution –Pluginbased framework for extensibility
  • 3. Lets think scalability •Let’s go through an exercise of scaling a simple program to process a large data set. •Problem: count the number of times each word occurs in a set of documents. •Example: only one document with only one sentence –“Do as I say, not as I do.” •Pseudocode: A multisetis a set where each element also has a count define wordCountas Multiset; (assume a hash table) for each document in documentSet{ T = tokenize(document); for each token in T { wordCount[token]++; } } display(wordCount);
  • 4. How about a billion documents? •Looping through all the documents using a single computer will be extremely time consuming. •You speed it up by rewriting the program so that it distributes the work over several machines. •Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines. define wordCountas Multiset; for each document in documentSubset{ T = tokenize(document); for each token in T { wordCount[token]++; } } sendToSecondPhase(wordCount); define totalWordCountas Multiset; for each wordCountreceived from firstPhase{ multisetAdd(totalWordCount, wordCount); }
  • 5. Problems •Where are documents stored? –Having more machines for processing only helps up to a certain point— until the storage server can’t keep up. –You’ll also need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it. •wordCount(and totalWordCount) are stored in memory –When processing large document sets, the number of unique words can exceed the RAM storage of a machine. –Furthermore phase two has only one machine, which will process wordCountsent from all the machines in phase one. The single machine in phase two will become the bottleneck. •Solution: divide based on expected output! –Let’s say we have 26 machines for phase two. We assign each machine to only handle wordCountfor words beginning with a particular letter in the alphabet.
  • 6. Map-Reduce •MapReduceprograms are executed in two main phases, called –mapping and –reducing. •In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. •In the reducing phase, the reducer processes all the outputs from the mapperand arrives at a final result. •The mapperis meant to filter and transform the input into something •That the reducer can aggregate over. •MapReduceuses lists and (key/value) pairs as its main data primitives.
  • 7. Map-Reduce •Map-Reduce Program –Based on two functions: Map and Reduce –Every Map/Reduce program must specify a Mapperand optionally a Reducer –Operate on key and value pairs Map-Reduce works like a Unix pipeline: cat input | grep| sort | uniq-c | cat > output Input| Map| Shuffle & Sort | Reduce| Output cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
  • 9. Putting things in context HDFS . . . File 1 File 2 File 3 File N-2 File N-1 File N Input files Splits Mapper Machine -1 Machine -2 Machine -M Split 1 Split 2 Split 3 Split M-2 Split M-1 Split M Map 1 Map 2 Map 3 Map M-2 Map M-1 Map M Combiner 1 Combiner C (Kay, Value) pairs Record Reader combiner . . . Partition 1 Partition 2 Partition P-1 Partition P Partitionar Reducer HDFS . . . File 1 File 2 File 3 File O-2 File O-1 File O Reducer 1 Reducer 2 Reducer R-1 Reducer R Input Output Machine -x
  • 10. Some MapReduce Terminology •Job–A “full program” -an execution of a Mapperand Reducer across a data set •Task –An execution of a Mapperor a Reducer on a slice of data –a.k.a. Task-In-Progress (TIP) •Task Attempt –A particular instance of an attempt to execute a task on a machine
  • 11. Terminology Example •Running “Word Count” across 20 files is one job •20 files to be mapped simply 20 map tasks+ some number of reduce tasks •At least 20 map task attemptswill be performed… more if a machine crashes, due to speculative execution etc.
  • 12. Task Attempts •A particular task will be attempted at least once, possibly more times if it crashes –If the same input causes crashes over and over, that input will eventually be abandoned •Multiple attempts at one task may occur in parallel with speculative execution turned on –Task ID from TaskInProgress is not a unique identifier; don’t use it that way
  • 13. MapReduce: High Level JobTrackerMapReduce job submitted by client computerMaster nodeTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instance
  • 14. Node-to-Node Communication •Hadoop uses its own RPC protocol •All communication begins in slave nodes –Prevents circular-wait deadlock –Slaves periodically poll for “status” message •Classes must provide explicit serialization
  • 15. Nodes, Trackers, Tasks •Master node runs JobTrackerinstance, which accepts Job requests from clients •TaskTrackerinstances run on slave nodes •TaskTrackerforks separate Java process for task instances
  • 16. Job Distribution •MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options •Running a MapReduce job places these files into the HDFS and notifies TaskTrackerswhere to retrieve the relevant program code •… Where’s the data distribution?
  • 17. Data Distribution •Implicit in design of MapReduce! –All mappersare equivalent; so map whatever data is local to a particular node in HDFS •If lots of data does happen to pile up on the same node, nearby nodes will map instead –Data transfer is handled implicitly by HDFS
  • 18. Configuring With JobConf •MR Programs have many configurable options •JobConfobjects hold (key, value) components mapping String ’a’ –e.g., “mapred.map.tasks” 20 –JobConfis serialized and distributed before running the job •Objects implementing JobConfigurablecan retrieve elements from a JobConf
  • 19. Job Launch Process: Client •Client program creates a JobConf –Identify classes implementing Mapperand Reducerinterfaces •JobConf.setMapperClass(), setReducerClass() –Specify inputs, outputs •JobConf.setInputPath(), setOutputPath() –Optionally, other options too: •JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…
  • 20. Job Launch Process: JobClient •Pass JobConfto JobClient.runJob() or submitJob() –runJob() blocks, submitJob() does not •JobClient: –Determines proper division of input into InputSplits –Sends job data to master JobTrackerserver
  • 21. Job Launch Process: JobTracker •JobTracker: –Inserts jar and JobConf(serialized to XML) in shared location –Posts a JobInProgressto its run queue
  • 22. Job Launch Process: TaskTracker •TaskTrackersrunning on slave nodes periodically query JobTrackerfor work •Retrieve job-specific jar and config •Launch task in separate instance of Java –main() is provided by Hadoop
  • 23. Job Launch Process: Task •TaskTracker.Child.main(): –Sets up the child TaskInProgressattempt –Reads XML configuration –Connects back to necessary MapReduce components via RPC –Uses TaskRunnerto launch user process
  • 24. Job Launch Process: TaskRunner •TaskRunner, MapTaskRunner, MapRunnerwork in a daisy-chain to launch your Mapper –Task knows ahead of time which InputSplitsit should be mapping –Calls Mapperonce for each record retrieved from the InputSplit •Running the Reduceris much the same
  • 25. Creating the Mapper •You provide the instance of Mapper –Should extend MapReduceBase •One instance of your Mapperis initialized by the MapTaskRunnerfor a TaskInProgress –Exists in separate process from all other instances of Mapper–no data sharing!
  • 26. Mapper •void map(WritableComparablekey, Writable value, OutputCollectoroutput, Reporter reporter)
  • 27. What is Writable? •Hadoop defines its own classes for strings (Text),integers (IntWritable), etc. •All values are instances of Writable •All keys are instances of WritableComparable
  • 28. Getting Data To The Mapper Input fileInputSplitInputSplitInputSplitInputSplitInput fileRecordReaderRecordReaderRecordReaderRecordReaderMapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates)InputFormat
  • 29. Reading Data •Data sets are specified by InputFormats –Defines input data (e.g., a directory) –Identifies partitions of the data that form an InputSplit –Factory for RecordReaderobjects to extract (k, v) records from the input source
  • 30. FileInputFormatand Friends •TextInputFormat–Treats each ‘n’-terminated line of a file as a value •KeyValueTextInputFormat–Maps ‘n’-terminated text lines of “k SEP v” •SequenceFileInputFormat–Binary file of (k, v) pairs with some additional metadata •SequenceFileAsTextInputFormat–Same, but maps (k.toString(), v.toString())
  • 31. Filtering File Inputs •FileInputFormatwill read all files out of a specified directory and send them to the mapper •Delegates filtering this file list to a method subclasses may override –e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
  • 32. Record Readers •Each InputFormatprovides its own RecordReaderimplementation –Provides capability multiplexing •LineRecordReader–Reads a line from a text file •KeyValueRecordReader–Used by KeyValueTextInputFormat
  • 33. Input Split Size •FileInputFormatwill divide large files into chunks –Exact size controlled by mapred.min.split.size •RecordReadersreceive file, offset, and length of chunk •Custom InputFormatimplementations may override split size –e.g., “NeverChunkFile”
  • 34. Sending Data To Reducers •Map function receives OutputCollectorobject –OutputCollector.collect() takes (k, v) elements •Any (WritableComparable, Writable)can be used
  • 35. WritableComparator •Compares WritableComparabledata –Will call WritableComparable.compare() –Can provide fast path for serialized data •JobConf.setOutputValueGroupingComparator()
  • 36. Sending Data To The Client •Reporterobject sent to Mapperallows simple asynchronous feedback –incrCounter(Enumkey, long amount) –setStatus(String msg) •Allows self-identification of input –InputSplitgetInputSplit()
  • 37. Partition And Shuffle Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) ReducerReducerReducer(intermediates)(intermediates)(intermediates) PartitionerPartitionerPartitionerPartitioner shuffling
  • 38. Partitioner •intgetPartition(key, val, numPartitions) –Outputs the partition number for a given key –One partition == values sent to one Reduce task •HashPartitionerused by default –Uses key.hashCode() to return partition num •JobConfsets Partitioner implementation
  • 39. Reduction •reduce(WritableComparablekey, Iteratorvalues, OutputCollectoroutput, Reporter reporter) •Keys & values sent to one partition all go to the same reduce task •Calls are sorted by key –“earlier” keys are reduced and output before “later” keys
  • 40. Finally: Writing The Output ReducerReducerReducerRecordWriterRecordWriterRecordWriteroutput fileoutput fileoutput file OutputFormat
  • 41. WordCountM/R map(String filename, String document) { List<String> T = tokenize(document); for each token in T { emit ((String)token, (Integer) 1); } } reduce(String token, List<Integer> values) { Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); }
  • 42. Word Count: Java Mapper public static class MapClassextendsMapReduceBase implementsMapper<LongWritable, Text, Text, IntWritable>{ public voidmap(LongWritablekey, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throwsIOException{ String line = value.toString(); StringTokenizeritr= newStringTokenizer(line); while(itr.hasMoreTokens()) { Text word = new Text(itr.nextToken()); output.collect(word,newIntWritable(1)); } } } 42
  • 43. Word Count: Java Reduce public static class Reduce extendsMapReduceBase implementsReducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throwsIOException{ intsum = 0; while(values.hasNext()) { sum += values.next().get(); } output.collect(key, newIntWritable(sum)); } } 43
  • 44. Word Count: Java Driver public void run(String inPath, String outPath) throwsException { JobConfconf = newJobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, newPath(inPath)); FileOutputFormat.setOutputPath(conf, newPath(outPath)); JobClient.runJob(conf); } 44
  • 46. Job, Task, and Task Attempt IDs •The format of a job ID is composed of the time that the jobtracker(not the job) started and an incrementing counter maintained by the jobtrackerto uniquely identify the job to that instance of the jobtracker. •job_201206111011_0002 : –is the second (0002, job IDs are 1-based) job run by the jobtracker –which started at 10:11 on June 11, 2012 •Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job. •task_201206111011_0002_m_000003: –is the fourth (000003, task IDs are 0-based) –map (m) task of the job with ID job_201206111011_0002. –The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in. •Tasks may be executed more than once, due to failure or speculative execution, so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker. •attempt_200904110811_0002_m_000003_0: –is the first (0, attempt IDs are 0-based) attempt at running task task_201206111011_0002_m_000003.
  • 47. Exercise -description •The objectives for this exercise are: –Become familiar with decomposing a problem into Map and Reduce stages. –Get a sense for how MapReducecan be used in the real world. •An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents: Doc1: Buffalo buffalobuffalo. Doc2: Buffalo are mammals. we could construct the following inverted file index: Buffalo -> Doc1, Doc2 buffalo -> Doc1 buffalo. -> Doc1 are -> Doc2 mammals. -> Doc2
  • 48. Exercise -tasks •Task -1: (30 min) –Write pseudo-code for map and reduce to solve inverted index problem –What are your K1 V1, K2, V2 etc. –“Execute” your pseudo-code with following example and explain what shuffle & Sort stage do with keys and values •Task –2: (30min) –Use distributed code Python/Java and execute them following instruction •Where input and out data was stored, and in what format? •What were K1 V1, K2, V2 data types used? •Task –3: (45min) •Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. How can you determine whether a word is "noisy“? –Re-write your pseudo-code with determination (your algorithms) and removal of “noisy” words using map-reduce framework. •Group / individual presentation (45 min)
  • 49. Example: Inverted Index •Input:(filename, text) records •Output:list of files containing each word •Map: foreachword in text.split(): output(word, filename) •Combine:unique filenames for each word •Reduce: defreduce(word, filenames): output(word, sort(filenames)) 49
  • 50. Inverted Index 50 to be or not to be afraid, (1Xth.txt) be, (1Xth.txt, hamlet.txt) greatness, (12th.txt) not, (1Xth.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) hamlet.txt be not afraid of greatness 1Xth.txt to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt be, 1Xth.txt not, 1Xth.txt afraid, 1Xth.txt of, 1Xth.txt greatness, 1Xth.txt
  • 51. A better example •Billions of crawled pages and links •Generate an index of words linking to web urlsin which they occur. –Input is split into url->pages (lines of pages) –Map looks for words in lines of page and puts out word -> link pairs –Group k,vpairs to generate word->{list of links} –Reduce puts out pairs to output
  • 52. Search Reverse Index public static class MapClassextends MapReduceBase implements Mapper<Text, Text, Text, IntWritable> { private Text word = new Text(); public void map(Text url, Text pageText, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{ String line = pageText.toString(); StringTokenizeritr= new StringTokenizer(line); while (itr.hasMoreTokens()) { //ignore unwanted and redundant words word.set(itr.nextToken()); output.collect(word, url); } } }
  • 53. Search Reverse Index public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text word, Iterator<Text> urls, OutputCollector<Text, Iterator<Text>> output, Reporter reporter) throws IOException{ output.collect(word, urls); } }
  • 54. End of sesssion Day –1: First MR job -Inverted Index construction