SlideShare a Scribd company logo
1 of 54
Download to read offline
First MR job -Inverted Index construction
Map Reduce -Introduction 
•Parallel Job processing framework 
•Written in java 
•Close integration with HDFS 
•Provides : 
–Auto partitioning of job into sub tasks 
–Auto retry on failures 
–Linear Scalability 
–Locality of task execution 
–Pluginbased framework for extensibility
Lets think scalability 
•Let’s go through an exercise of scaling a simple program to process a large data set. 
•Problem: count the number of times each word occurs in a set of documents. 
•Example: only one document with only one sentence –“Do as I say, not as I do.” 
•Pseudocode: A multisetis a set where each element also has a count 
define wordCountas Multiset; (assume a hash table) 
for each document in documentSet{ 
T = tokenize(document); 
for each token in T { 
wordCount[token]++; 
} 
} 
display(wordCount);
How about a billion documents? 
•Looping through all the documents using a single computer will be extremely time consuming. 
•You speed it up by rewriting the program so that it distributes the work over several machines. 
•Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines. 
define wordCountas Multiset; 
for each document in documentSubset{ 
T = tokenize(document); 
for each token in T { 
wordCount[token]++; 
} 
} 
sendToSecondPhase(wordCount); 
define totalWordCountas Multiset; 
for each wordCountreceived from firstPhase{ 
multisetAdd(totalWordCount, wordCount); 
}
Problems 
•Where are documents stored? 
–Having more machines for processing only helps up to a certain point— until the storage server can’t keep up. 
–You’ll also need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it. 
•wordCount(and totalWordCount) are stored in memory 
–When processing large document sets, the number of unique words can exceed the RAM storage of a machine. 
–Furthermore phase two has only one machine, which will process wordCountsent from all the machines in phase one. The single machine in phase two will become the bottleneck. 
•Solution: divide based on expected output! 
–Let’s say we have 26 machines for phase two. We assign each machine to only handle wordCountfor words beginning with a particular letter in the alphabet.
Map-Reduce 
•MapReduceprograms are executed in two main phases, called 
–mapping and 
–reducing. 
•In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. 
•In the reducing phase, the reducer processes all the outputs from the mapperand arrives at a final result. 
•The mapperis meant to filter and transform the input into something 
•That the reducer can aggregate over. 
•MapReduceuses lists and (key/value) pairs as its main data primitives.
Map-Reduce 
•Map-Reduce Program 
–Based on two functions: Map and Reduce 
–Every Map/Reduce program must specify a Mapperand optionally a Reducer 
–Operate on key and value pairs 
Map-Reduce works like a Unix pipeline: 
cat input | grep| sort | uniq-c | cat > output 
Input| Map| Shuffle & Sort | Reduce| Output 
cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist 
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) 
Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Putting things in context 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File N-2 
File N-1 
File N 
Input 
files 
Splits 
Mapper 
Machine -1 
Machine -2 
Machine -M 
Split 1 
Split 2 
Split 3 
Split M-2 
Split M-1 
Split M 
Map 1 
Map 2 
Map 3 
Map M-2 
Map M-1 
Map M 
Combiner 1 
Combiner C 
(Kay, Value) 
pairs 
Record Reader 
combiner 
. 
. 
. 
Partition 1 
Partition 2 
Partition P-1 
Partition P 
Partitionar 
Reducer 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File O-2 
File O-1 
File O 
Reducer 1 
Reducer 2 
Reducer R-1 
Reducer R 
Input 
Output 
Machine -x
Some MapReduce Terminology 
•Job–A “full program” -an execution of a Mapperand Reducer across a data set 
•Task –An execution of a Mapperor a Reducer on a slice of data 
–a.k.a. Task-In-Progress (TIP) 
•Task Attempt –A particular instance of an attempt to execute a task on a machine
Terminology Example 
•Running “Word Count” across 20 files is one job 
•20 files to be mapped simply 20 map tasks+ some number of reduce tasks 
•At least 20 map task attemptswill be performed… more if a machine crashes, due to speculative execution etc.
Task Attempts 
•A particular task will be attempted at least once, possibly more times if it crashes 
–If the same input causes crashes over and over, that input will eventually be abandoned 
•Multiple attempts at one task may occur in parallel with speculative execution turned on 
–Task ID from TaskInProgress is not a unique identifier; don’t use it that way
MapReduce: High Level 
JobTrackerMapReduce job submitted by client computerMaster nodeTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instance
Node-to-Node Communication 
•Hadoop uses its own RPC protocol 
•All communication begins in slave nodes 
–Prevents circular-wait deadlock 
–Slaves periodically poll for “status” message 
•Classes must provide explicit serialization
Nodes, Trackers, Tasks 
•Master node runs JobTrackerinstance, which accepts Job requests from clients 
•TaskTrackerinstances run on slave nodes 
•TaskTrackerforks separate Java process for task instances
Job Distribution 
•MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options 
•Running a MapReduce job places these files into the HDFS and notifies TaskTrackerswhere to retrieve the relevant program code 
•… Where’s the data distribution?
Data Distribution 
•Implicit in design of MapReduce! 
–All mappersare equivalent; so map whatever data is local to a particular node in HDFS 
•If lots of data does happen to pile up on the same node, nearby nodes will map instead 
–Data transfer is handled implicitly by HDFS
Configuring With JobConf 
•MR Programs have many configurable options 
•JobConfobjects hold (key, value) components mapping String ’a’ 
–e.g., “mapred.map.tasks” 20 
–JobConfis serialized and distributed before running the job 
•Objects implementing JobConfigurablecan retrieve elements from a JobConf
Job Launch Process: Client 
•Client program creates a JobConf 
–Identify classes implementing Mapperand Reducerinterfaces 
•JobConf.setMapperClass(), setReducerClass() 
–Specify inputs, outputs 
•JobConf.setInputPath(), setOutputPath() 
–Optionally, other options too: 
•JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…
Job Launch Process: JobClient 
•Pass JobConfto JobClient.runJob() or submitJob() 
–runJob() blocks, submitJob() does not 
•JobClient: 
–Determines proper division of input into InputSplits 
–Sends job data to master JobTrackerserver
Job Launch Process: JobTracker 
•JobTracker: 
–Inserts jar and JobConf(serialized to XML) in shared location 
–Posts a JobInProgressto its run queue
Job Launch Process: TaskTracker 
•TaskTrackersrunning on slave nodes periodically query JobTrackerfor work 
•Retrieve job-specific jar and config 
•Launch task in separate instance of Java 
–main() is provided by Hadoop
Job Launch Process: Task 
•TaskTracker.Child.main(): 
–Sets up the child TaskInProgressattempt 
–Reads XML configuration 
–Connects back to necessary MapReduce components via RPC 
–Uses TaskRunnerto launch user process
Job Launch Process: TaskRunner 
•TaskRunner, MapTaskRunner, MapRunnerwork in a daisy-chain to launch your Mapper 
–Task knows ahead of time which InputSplitsit should be mapping 
–Calls Mapperonce for each record retrieved from the InputSplit 
•Running the Reduceris much the same
Creating the Mapper 
•You provide the instance of Mapper 
–Should extend MapReduceBase 
•One instance of your Mapperis initialized by the MapTaskRunnerfor a TaskInProgress 
–Exists in separate process from all other instances of Mapper–no data sharing!
Mapper 
•void map(WritableComparablekey, 
Writable value, 
OutputCollectoroutput, 
Reporter reporter)
What is Writable? 
•Hadoop defines its own classes for strings (Text),integers (IntWritable), etc. 
•All values are instances of Writable 
•All keys are instances of WritableComparable
Getting Data To The Mapper 
Input fileInputSplitInputSplitInputSplitInputSplitInput fileRecordReaderRecordReaderRecordReaderRecordReaderMapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates)InputFormat
Reading Data 
•Data sets are specified by InputFormats 
–Defines input data (e.g., a directory) 
–Identifies partitions of the data that form an InputSplit 
–Factory for RecordReaderobjects to extract (k, v) records from the input source
FileInputFormatand Friends 
•TextInputFormat–Treats each ‘n’-terminated line of a file as a value 
•KeyValueTextInputFormat–Maps ‘n’-terminated text lines of “k SEP v” 
•SequenceFileInputFormat–Binary file of (k, v) pairs with some additional metadata 
•SequenceFileAsTextInputFormat–Same, but maps (k.toString(), v.toString())
Filtering File Inputs 
•FileInputFormatwill read all files out of a specified directory and send them to the mapper 
•Delegates filtering this file list to a method subclasses may override 
–e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
Record Readers 
•Each InputFormatprovides its own RecordReaderimplementation 
–Provides capability multiplexing 
•LineRecordReader–Reads a line from a text file 
•KeyValueRecordReader–Used by KeyValueTextInputFormat
Input Split Size 
•FileInputFormatwill divide large files into chunks 
–Exact size controlled by mapred.min.split.size 
•RecordReadersreceive file, offset, and length of chunk 
•Custom InputFormatimplementations may override split size –e.g., “NeverChunkFile”
Sending Data To Reducers 
•Map function receives OutputCollectorobject 
–OutputCollector.collect() takes (k, v) elements 
•Any (WritableComparable, Writable)can be used
WritableComparator 
•Compares WritableComparabledata 
–Will call WritableComparable.compare() 
–Can provide fast path for serialized data 
•JobConf.setOutputValueGroupingComparator()
Sending Data To The Client 
•Reporterobject sent to Mapperallows simple asynchronous feedback 
–incrCounter(Enumkey, long amount) 
–setStatus(String msg) 
•Allows self-identification of input 
–InputSplitgetInputSplit()
Partition And Shuffle 
Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) ReducerReducerReducer(intermediates)(intermediates)(intermediates) PartitionerPartitionerPartitionerPartitioner shuffling
Partitioner 
•intgetPartition(key, val, numPartitions) 
–Outputs the partition number for a given key 
–One partition == values sent to one Reduce task 
•HashPartitionerused by default 
–Uses key.hashCode() to return partition num 
•JobConfsets Partitioner implementation
Reduction 
•reduce(WritableComparablekey, 
Iteratorvalues, 
OutputCollectoroutput, 
Reporter reporter) 
•Keys & values sent to one partition all go to the same reduce task 
•Calls are sorted by key –“earlier” keys are reduced and output before “later” keys
Finally: Writing The Output 
ReducerReducerReducerRecordWriterRecordWriterRecordWriteroutput fileoutput fileoutput file OutputFormat
WordCountM/R 
map(String filename, String document) 
{ 
List<String> T = tokenize(document); 
for each token in T { 
emit ((String)token, (Integer) 1); 
} 
} 
reduce(String token, List<Integer> values) 
{ 
Integer sum = 0; 
for each value in values { 
sum = sum + value; 
} 
emit ((String)token, (Integer) sum); 
}
Word Count: Java Mapper 
public static class MapClassextendsMapReduceBase 
implementsMapper<LongWritable, Text, Text, IntWritable>{ 
public voidmap(LongWritablekey, Text value, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throwsIOException{ 
String line = value.toString(); 
StringTokenizeritr= newStringTokenizer(line); 
while(itr.hasMoreTokens()) { 
Text word = new Text(itr.nextToken()); 
output.collect(word,newIntWritable(1)); 
} 
} 
} 
42
Word Count: Java Reduce 
public static class Reduce extendsMapReduceBase 
implementsReducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text key, 
Iterator<IntWritable> values, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throwsIOException{ 
intsum = 0; 
while(values.hasNext()) { 
sum += values.next().get(); 
} 
output.collect(key, newIntWritable(sum)); 
} 
} 
43
Word Count: Java Driver 
public void run(String inPath, String outPath) 
throwsException { 
JobConfconf = newJobConf(WordCount.class); 
conf.setJobName("wordcount"); 
conf.setMapperClass(MapClass.class); 
conf.setReducerClass(Reduce.class); 
FileInputFormat.addInputPath(conf, newPath(inPath)); 
FileOutputFormat.setOutputPath(conf, newPath(outPath)); 
JobClient.runJob(conf); 
} 
44
WordCountwith many mapperand One reducer
Job, Task, and Task Attempt IDs 
•The format of a job ID is composed of the time that the jobtracker(not the job) started and an incrementing counter maintained by the jobtrackerto uniquely identify the job to that instance of the jobtracker. 
•job_201206111011_0002 : 
–is the second (0002, job IDs are 1-based) job run by the jobtracker 
–which started at 10:11 on June 11, 2012 
•Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job. 
•task_201206111011_0002_m_000003: 
–is the fourth (000003, task IDs are 0-based) 
–map (m) task of the job with ID job_201206111011_0002. 
–The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in. 
•Tasks may be executed more than once, due to failure or speculative execution, so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker. 
•attempt_200904110811_0002_m_000003_0: 
–is the first (0, attempt IDs are 0-based) attempt at running task task_201206111011_0002_m_000003.
Exercise -description 
•The objectives for this exercise are: 
–Become familiar with decomposing a problem into Map and Reduce stages. 
–Get a sense for how MapReducecan be used in the real world. 
•An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. 
For example, if given the following 2 documents: 
Doc1: Buffalo buffalobuffalo. 
Doc2: Buffalo are mammals. 
we could construct the following inverted file index: 
Buffalo -> Doc1, Doc2 
buffalo -> Doc1 
buffalo. -> Doc1 
are -> Doc2 
mammals. -> Doc2
Exercise -tasks 
•Task -1: (30 min) 
–Write pseudo-code for map and reduce to solve inverted index problem 
–What are your K1 V1, K2, V2 etc. 
–“Execute” your pseudo-code with following example and explain what shuffle & Sort stage do with keys and values 
•Task –2: (30min) 
–Use distributed code Python/Java and execute them following instruction 
•Where input and out data was stored, and in what format? 
•What were K1 V1, K2, V2 data types used? 
•Task –3: (45min) 
•Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. How can you determine whether a word is "noisy“? 
–Re-write your pseudo-code with determination (your algorithms) and removal of “noisy” words using map-reduce framework. 
•Group / individual presentation (45 min)
Example: Inverted Index 
•Input:(filename, text) records 
•Output:list of files containing each word 
•Map: foreachword in text.split(): output(word, filename) 
•Combine:unique filenames for each word 
•Reduce: defreduce(word, filenames): output(word, sort(filenames)) 
49
Inverted Index 
50 
to be or not to be 
afraid, (1Xth.txt) 
be, (1Xth.txt, hamlet.txt) 
greatness, (12th.txt) 
not, (1Xth.txt, hamlet.txt) 
of, (12th.txt) 
or, (hamlet.txt) 
to, (hamlet.txt) 
hamlet.txt 
be not afraid of greatness 
1Xth.txt 
to, hamlet.txt 
be, hamlet.txt 
or, hamlet.txt 
not, hamlet.txt 
be, 1Xth.txt 
not, 1Xth.txt 
afraid, 1Xth.txt 
of, 1Xth.txt 
greatness, 1Xth.txt
A better example 
•Billions of crawled pages and links 
•Generate an index of words linking to web urlsin which they occur. 
–Input is split into url->pages (lines of pages) 
–Map looks for words in lines of page and puts out word -> link pairs 
–Group k,vpairs to generate word->{list of links} 
–Reduce puts out pairs to output
Search Reverse Index 
public static class MapClassextends MapReduceBase 
implements Mapper<Text, Text, Text, IntWritable> { 
private Text word = new Text(); 
public void map(Text url, Text pageText, 
OutputCollector<Text, Text> output, 
Reporter reporter) throws IOException{ 
String line = pageText.toString(); 
StringTokenizeritr= new StringTokenizer(line); 
while (itr.hasMoreTokens()) { 
//ignore unwanted and redundant words 
word.set(itr.nextToken()); 
output.collect(word, url); 
} 
} 
}
Search Reverse Index 
public static class Reduce extends MapReduceBase 
implements Reducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text word, Iterator<Text> urls, 
OutputCollector<Text, Iterator<Text>> output, 
Reporter reporter) throws IOException{ 
output.collect(word, urls); 
} 
}
End of sesssion 
Day –1: First MR job -Inverted Index construction

More Related Content

What's hot

Scalability
ScalabilityScalability
Scalability
felho
 
Automatas y compiladores clase3
Automatas y compiladores clase3Automatas y compiladores clase3
Automatas y compiladores clase3
Germania Rodriguez
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
theyaseen51
 
Estructura de un compilador 2
Estructura de un compilador 2Estructura de un compilador 2
Estructura de un compilador 2
perlallamas
 

What's hot (20)

Multi Head, Multi Tape Turing Machine
Multi Head, Multi Tape Turing MachineMulti Head, Multi Tape Turing Machine
Multi Head, Multi Tape Turing Machine
 
Bellman ford algorithm
Bellman ford algorithmBellman ford algorithm
Bellman ford algorithm
 
Recovery with concurrent transaction
Recovery with concurrent transactionRecovery with concurrent transaction
Recovery with concurrent transaction
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
Signature files
Signature filesSignature files
Signature files
 
asymptotic notations i
asymptotic notations iasymptotic notations i
asymptotic notations i
 
Computational Complexity
Computational ComplexityComputational Complexity
Computational Complexity
 
Scalability
ScalabilityScalability
Scalability
 
Distributed Deadlock Detection.ppt
Distributed Deadlock Detection.pptDistributed Deadlock Detection.ppt
Distributed Deadlock Detection.ppt
 
Undecidability.pptx
Undecidability.pptxUndecidability.pptx
Undecidability.pptx
 
Turing machine by_deep
Turing machine by_deepTuring machine by_deep
Turing machine by_deep
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological Parsing
 
Naming in Distributed System
Naming in Distributed SystemNaming in Distributed System
Naming in Distributed System
 
Types of Compilers
Types of CompilersTypes of Compilers
Types of Compilers
 
Automatas y compiladores clase3
Automatas y compiladores clase3Automatas y compiladores clase3
Automatas y compiladores clase3
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
 
Context free grammar
Context free grammarContext free grammar
Context free grammar
 
Asymptotic notation
Asymptotic notationAsymptotic notation
Asymptotic notation
 
Estructura de un compilador 2
Estructura de un compilador 2Estructura de un compilador 2
Estructura de un compilador 2
 

Viewers also liked

Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 

Viewers also liked (9)

03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Genetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial IntelligenceGenetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial Intelligence
 
Car accident insurance claim
Car accident insurance claimCar accident insurance claim
Car accident insurance claim
 
2017 Lincoln Continental Near Wilmington DE
2017 Lincoln Continental Near Wilmington DE2017 Lincoln Continental Near Wilmington DE
2017 Lincoln Continental Near Wilmington DE
 
Choose the safest car for your teen - Floyd Arthur Presentation
Choose the safest car for your teen - Floyd Arthur PresentationChoose the safest car for your teen - Floyd Arthur Presentation
Choose the safest car for your teen - Floyd Arthur Presentation
 
Funeral burial insurance
Funeral burial insuranceFuneral burial insurance
Funeral burial insurance
 
Home insurance hollywood
Home insurance hollywoodHome insurance hollywood
Home insurance hollywood
 

Similar to Hadoop first mr job - inverted index construction

Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 

Similar to Hadoop first mr job - inverted index construction (20)

MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 

More from Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 

More from Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 

Recently uploaded

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 

Recently uploaded (20)

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 

Hadoop first mr job - inverted index construction

  • 1. First MR job -Inverted Index construction
  • 2. Map Reduce -Introduction •Parallel Job processing framework •Written in java •Close integration with HDFS •Provides : –Auto partitioning of job into sub tasks –Auto retry on failures –Linear Scalability –Locality of task execution –Pluginbased framework for extensibility
  • 3. Lets think scalability •Let’s go through an exercise of scaling a simple program to process a large data set. •Problem: count the number of times each word occurs in a set of documents. •Example: only one document with only one sentence –“Do as I say, not as I do.” •Pseudocode: A multisetis a set where each element also has a count define wordCountas Multiset; (assume a hash table) for each document in documentSet{ T = tokenize(document); for each token in T { wordCount[token]++; } } display(wordCount);
  • 4. How about a billion documents? •Looping through all the documents using a single computer will be extremely time consuming. •You speed it up by rewriting the program so that it distributes the work over several machines. •Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines. define wordCountas Multiset; for each document in documentSubset{ T = tokenize(document); for each token in T { wordCount[token]++; } } sendToSecondPhase(wordCount); define totalWordCountas Multiset; for each wordCountreceived from firstPhase{ multisetAdd(totalWordCount, wordCount); }
  • 5. Problems •Where are documents stored? –Having more machines for processing only helps up to a certain point— until the storage server can’t keep up. –You’ll also need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it. •wordCount(and totalWordCount) are stored in memory –When processing large document sets, the number of unique words can exceed the RAM storage of a machine. –Furthermore phase two has only one machine, which will process wordCountsent from all the machines in phase one. The single machine in phase two will become the bottleneck. •Solution: divide based on expected output! –Let’s say we have 26 machines for phase two. We assign each machine to only handle wordCountfor words beginning with a particular letter in the alphabet.
  • 6. Map-Reduce •MapReduceprograms are executed in two main phases, called –mapping and –reducing. •In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. •In the reducing phase, the reducer processes all the outputs from the mapperand arrives at a final result. •The mapperis meant to filter and transform the input into something •That the reducer can aggregate over. •MapReduceuses lists and (key/value) pairs as its main data primitives.
  • 7. Map-Reduce •Map-Reduce Program –Based on two functions: Map and Reduce –Every Map/Reduce program must specify a Mapperand optionally a Reducer –Operate on key and value pairs Map-Reduce works like a Unix pipeline: cat input | grep| sort | uniq-c | cat > output Input| Map| Shuffle & Sort | Reduce| Output cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
  • 9. Putting things in context HDFS . . . File 1 File 2 File 3 File N-2 File N-1 File N Input files Splits Mapper Machine -1 Machine -2 Machine -M Split 1 Split 2 Split 3 Split M-2 Split M-1 Split M Map 1 Map 2 Map 3 Map M-2 Map M-1 Map M Combiner 1 Combiner C (Kay, Value) pairs Record Reader combiner . . . Partition 1 Partition 2 Partition P-1 Partition P Partitionar Reducer HDFS . . . File 1 File 2 File 3 File O-2 File O-1 File O Reducer 1 Reducer 2 Reducer R-1 Reducer R Input Output Machine -x
  • 10. Some MapReduce Terminology •Job–A “full program” -an execution of a Mapperand Reducer across a data set •Task –An execution of a Mapperor a Reducer on a slice of data –a.k.a. Task-In-Progress (TIP) •Task Attempt –A particular instance of an attempt to execute a task on a machine
  • 11. Terminology Example •Running “Word Count” across 20 files is one job •20 files to be mapped simply 20 map tasks+ some number of reduce tasks •At least 20 map task attemptswill be performed… more if a machine crashes, due to speculative execution etc.
  • 12. Task Attempts •A particular task will be attempted at least once, possibly more times if it crashes –If the same input causes crashes over and over, that input will eventually be abandoned •Multiple attempts at one task may occur in parallel with speculative execution turned on –Task ID from TaskInProgress is not a unique identifier; don’t use it that way
  • 13. MapReduce: High Level JobTrackerMapReduce job submitted by client computerMaster nodeTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instance
  • 14. Node-to-Node Communication •Hadoop uses its own RPC protocol •All communication begins in slave nodes –Prevents circular-wait deadlock –Slaves periodically poll for “status” message •Classes must provide explicit serialization
  • 15. Nodes, Trackers, Tasks •Master node runs JobTrackerinstance, which accepts Job requests from clients •TaskTrackerinstances run on slave nodes •TaskTrackerforks separate Java process for task instances
  • 16. Job Distribution •MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options •Running a MapReduce job places these files into the HDFS and notifies TaskTrackerswhere to retrieve the relevant program code •… Where’s the data distribution?
  • 17. Data Distribution •Implicit in design of MapReduce! –All mappersare equivalent; so map whatever data is local to a particular node in HDFS •If lots of data does happen to pile up on the same node, nearby nodes will map instead –Data transfer is handled implicitly by HDFS
  • 18. Configuring With JobConf •MR Programs have many configurable options •JobConfobjects hold (key, value) components mapping String ’a’ –e.g., “mapred.map.tasks” 20 –JobConfis serialized and distributed before running the job •Objects implementing JobConfigurablecan retrieve elements from a JobConf
  • 19. Job Launch Process: Client •Client program creates a JobConf –Identify classes implementing Mapperand Reducerinterfaces •JobConf.setMapperClass(), setReducerClass() –Specify inputs, outputs •JobConf.setInputPath(), setOutputPath() –Optionally, other options too: •JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…
  • 20. Job Launch Process: JobClient •Pass JobConfto JobClient.runJob() or submitJob() –runJob() blocks, submitJob() does not •JobClient: –Determines proper division of input into InputSplits –Sends job data to master JobTrackerserver
  • 21. Job Launch Process: JobTracker •JobTracker: –Inserts jar and JobConf(serialized to XML) in shared location –Posts a JobInProgressto its run queue
  • 22. Job Launch Process: TaskTracker •TaskTrackersrunning on slave nodes periodically query JobTrackerfor work •Retrieve job-specific jar and config •Launch task in separate instance of Java –main() is provided by Hadoop
  • 23. Job Launch Process: Task •TaskTracker.Child.main(): –Sets up the child TaskInProgressattempt –Reads XML configuration –Connects back to necessary MapReduce components via RPC –Uses TaskRunnerto launch user process
  • 24. Job Launch Process: TaskRunner •TaskRunner, MapTaskRunner, MapRunnerwork in a daisy-chain to launch your Mapper –Task knows ahead of time which InputSplitsit should be mapping –Calls Mapperonce for each record retrieved from the InputSplit •Running the Reduceris much the same
  • 25. Creating the Mapper •You provide the instance of Mapper –Should extend MapReduceBase •One instance of your Mapperis initialized by the MapTaskRunnerfor a TaskInProgress –Exists in separate process from all other instances of Mapper–no data sharing!
  • 26. Mapper •void map(WritableComparablekey, Writable value, OutputCollectoroutput, Reporter reporter)
  • 27. What is Writable? •Hadoop defines its own classes for strings (Text),integers (IntWritable), etc. •All values are instances of Writable •All keys are instances of WritableComparable
  • 28. Getting Data To The Mapper Input fileInputSplitInputSplitInputSplitInputSplitInput fileRecordReaderRecordReaderRecordReaderRecordReaderMapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates)InputFormat
  • 29. Reading Data •Data sets are specified by InputFormats –Defines input data (e.g., a directory) –Identifies partitions of the data that form an InputSplit –Factory for RecordReaderobjects to extract (k, v) records from the input source
  • 30. FileInputFormatand Friends •TextInputFormat–Treats each ‘n’-terminated line of a file as a value •KeyValueTextInputFormat–Maps ‘n’-terminated text lines of “k SEP v” •SequenceFileInputFormat–Binary file of (k, v) pairs with some additional metadata •SequenceFileAsTextInputFormat–Same, but maps (k.toString(), v.toString())
  • 31. Filtering File Inputs •FileInputFormatwill read all files out of a specified directory and send them to the mapper •Delegates filtering this file list to a method subclasses may override –e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
  • 32. Record Readers •Each InputFormatprovides its own RecordReaderimplementation –Provides capability multiplexing •LineRecordReader–Reads a line from a text file •KeyValueRecordReader–Used by KeyValueTextInputFormat
  • 33. Input Split Size •FileInputFormatwill divide large files into chunks –Exact size controlled by mapred.min.split.size •RecordReadersreceive file, offset, and length of chunk •Custom InputFormatimplementations may override split size –e.g., “NeverChunkFile”
  • 34. Sending Data To Reducers •Map function receives OutputCollectorobject –OutputCollector.collect() takes (k, v) elements •Any (WritableComparable, Writable)can be used
  • 35. WritableComparator •Compares WritableComparabledata –Will call WritableComparable.compare() –Can provide fast path for serialized data •JobConf.setOutputValueGroupingComparator()
  • 36. Sending Data To The Client •Reporterobject sent to Mapperallows simple asynchronous feedback –incrCounter(Enumkey, long amount) –setStatus(String msg) •Allows self-identification of input –InputSplitgetInputSplit()
  • 37. Partition And Shuffle Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) ReducerReducerReducer(intermediates)(intermediates)(intermediates) PartitionerPartitionerPartitionerPartitioner shuffling
  • 38. Partitioner •intgetPartition(key, val, numPartitions) –Outputs the partition number for a given key –One partition == values sent to one Reduce task •HashPartitionerused by default –Uses key.hashCode() to return partition num •JobConfsets Partitioner implementation
  • 39. Reduction •reduce(WritableComparablekey, Iteratorvalues, OutputCollectoroutput, Reporter reporter) •Keys & values sent to one partition all go to the same reduce task •Calls are sorted by key –“earlier” keys are reduced and output before “later” keys
  • 40. Finally: Writing The Output ReducerReducerReducerRecordWriterRecordWriterRecordWriteroutput fileoutput fileoutput file OutputFormat
  • 41. WordCountM/R map(String filename, String document) { List<String> T = tokenize(document); for each token in T { emit ((String)token, (Integer) 1); } } reduce(String token, List<Integer> values) { Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); }
  • 42. Word Count: Java Mapper public static class MapClassextendsMapReduceBase implementsMapper<LongWritable, Text, Text, IntWritable>{ public voidmap(LongWritablekey, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throwsIOException{ String line = value.toString(); StringTokenizeritr= newStringTokenizer(line); while(itr.hasMoreTokens()) { Text word = new Text(itr.nextToken()); output.collect(word,newIntWritable(1)); } } } 42
  • 43. Word Count: Java Reduce public static class Reduce extendsMapReduceBase implementsReducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throwsIOException{ intsum = 0; while(values.hasNext()) { sum += values.next().get(); } output.collect(key, newIntWritable(sum)); } } 43
  • 44. Word Count: Java Driver public void run(String inPath, String outPath) throwsException { JobConfconf = newJobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, newPath(inPath)); FileOutputFormat.setOutputPath(conf, newPath(outPath)); JobClient.runJob(conf); } 44
  • 46. Job, Task, and Task Attempt IDs •The format of a job ID is composed of the time that the jobtracker(not the job) started and an incrementing counter maintained by the jobtrackerto uniquely identify the job to that instance of the jobtracker. •job_201206111011_0002 : –is the second (0002, job IDs are 1-based) job run by the jobtracker –which started at 10:11 on June 11, 2012 •Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job. •task_201206111011_0002_m_000003: –is the fourth (000003, task IDs are 0-based) –map (m) task of the job with ID job_201206111011_0002. –The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in. •Tasks may be executed more than once, due to failure or speculative execution, so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker. •attempt_200904110811_0002_m_000003_0: –is the first (0, attempt IDs are 0-based) attempt at running task task_201206111011_0002_m_000003.
  • 47. Exercise -description •The objectives for this exercise are: –Become familiar with decomposing a problem into Map and Reduce stages. –Get a sense for how MapReducecan be used in the real world. •An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents: Doc1: Buffalo buffalobuffalo. Doc2: Buffalo are mammals. we could construct the following inverted file index: Buffalo -> Doc1, Doc2 buffalo -> Doc1 buffalo. -> Doc1 are -> Doc2 mammals. -> Doc2
  • 48. Exercise -tasks •Task -1: (30 min) –Write pseudo-code for map and reduce to solve inverted index problem –What are your K1 V1, K2, V2 etc. –“Execute” your pseudo-code with following example and explain what shuffle & Sort stage do with keys and values •Task –2: (30min) –Use distributed code Python/Java and execute them following instruction •Where input and out data was stored, and in what format? •What were K1 V1, K2, V2 data types used? •Task –3: (45min) •Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. How can you determine whether a word is "noisy“? –Re-write your pseudo-code with determination (your algorithms) and removal of “noisy” words using map-reduce framework. •Group / individual presentation (45 min)
  • 49. Example: Inverted Index •Input:(filename, text) records •Output:list of files containing each word •Map: foreachword in text.split(): output(word, filename) •Combine:unique filenames for each word •Reduce: defreduce(word, filenames): output(word, sort(filenames)) 49
  • 50. Inverted Index 50 to be or not to be afraid, (1Xth.txt) be, (1Xth.txt, hamlet.txt) greatness, (12th.txt) not, (1Xth.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) hamlet.txt be not afraid of greatness 1Xth.txt to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt be, 1Xth.txt not, 1Xth.txt afraid, 1Xth.txt of, 1Xth.txt greatness, 1Xth.txt
  • 51. A better example •Billions of crawled pages and links •Generate an index of words linking to web urlsin which they occur. –Input is split into url->pages (lines of pages) –Map looks for words in lines of page and puts out word -> link pairs –Group k,vpairs to generate word->{list of links} –Reduce puts out pairs to output
  • 52. Search Reverse Index public static class MapClassextends MapReduceBase implements Mapper<Text, Text, Text, IntWritable> { private Text word = new Text(); public void map(Text url, Text pageText, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{ String line = pageText.toString(); StringTokenizeritr= new StringTokenizer(line); while (itr.hasMoreTokens()) { //ignore unwanted and redundant words word.set(itr.nextToken()); output.collect(word, url); } } }
  • 53. Search Reverse Index public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text word, Iterator<Text> urls, OutputCollector<Text, Iterator<Text>> output, Reporter reporter) throws IOException{ output.collect(word, urls); } }
  • 54. End of sesssion Day –1: First MR job -Inverted Index construction