SlideShare a Scribd company logo
Hadoop Mapreduce
Introduction
Terminology
Google calls it: Hadoop equivalent:
MapReduce Hadoop
GFS HDFS
Bigtable HBase
Chubby Zookeeper
Some MapReduce Terminology
 Job – A “full program” - an execution of a Mapper and
Reducer across a data set
 Task – An execution of a Mapper or a Reducer on a slice
of data
 a.k.a. Task-In-Progress (TIP)
 Task Attempt – A particular instance of an attempt to
execute a task on a machine
Task Attempts
 A particular task will be attempted at least once,
possibly more times if it crashes
 If the same input causes crashes over and over, that input
will eventually be abandoned
 Multiple attempts at one task may occur in parallel
with speculative execution turned on
 Task ID from TaskInProgress is not a unique identifier;
don’t use it that way
MapReduce: High Level
In our case: circe.rc.usf.edu
Nodes, Trackers, Tasks
 Master node runs JobTracker instance, which accepts
Job requests from clients
 TaskTracker instances run on slave nodes
 TaskTracker forks separate Java process for task
instances
Job Distribution
 MapReduce programs are contained in a Java “jar”
file + an XML file containing serialized program
configuration options
 Running a MapReduce job places these files into
the HDFS and notifies TaskTrackers where to
retrieve the relevant program code
 … Where’s the data distribution?
Data Distribution
 Implicit in design of MapReduce!
 All mappers are equivalent; so map whatever data
is local to a particular node in HDFS
 If lots of data does happen to pile up on the
same node, nearby nodes will map instead
 Data transfer is handled implicitly by HDFS
What Happens In Hadoop?
Depth First
Job Launch Process: Client
 Client program creates a JobConf
 Identify classes implementing Mapper and
Reducer interfaces
 JobConf.setMapperClass(), setReducerClass()
 Specify inputs, outputs
 FileInputFormat.setInputPath(),
 FileOutputFormat.setOutputPath()
 Optionally, other options too:
 JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()…
Job Launch Process: JobClient
 Pass JobConf to JobClient.runJob() or
submitJob()
 runJob() blocks, submitJob() does not
 JobClient:
 Determines proper division of input into InputSplits
 Sends job data to master JobTracker server
Job Launch Process: JobTracker
 JobTracker:
 Inserts jar and JobConf (serialized to XML) in
shared location
 Posts a JobInProgress to its run queue
Job Launch Process: TaskTracker
 TaskTrackers running on slave nodes
periodically query JobTracker for work
 Retrieve job-specific jar and config
 Launch task in separate instance of Java
 main() is provided by Hadoop
Job Launch Process: Task
 TaskTracker.Child.main():
 Sets up the child TaskInProgress attempt
 Reads XML configuration
 Connects back to necessary MapReduce
components via RPC
 Uses TaskRunner to launch user process
Job Launch Process: TaskRunner
 TaskRunner, MapTaskRunner, MapRunner
work in a daisy-chain to launch your Mapper
 Task knows ahead of time which InputSplits it
should be mapping
 Calls Mapper once for each record retrieved from
the InputSplit
 Running the Reducer is much the same
Creating the Mapper
 You provide the instance of Mapper
 Should extend MapReduceBase
 One instance of your Mapper is initialized by
the MapTaskRunner for a TaskInProgress
 Exists in separate process from all other instances
of Mapper – no data sharing!
Mapper
 void map(K1 key,
V1 value,
OutputCollector<K2, V2> output,
Reporter reporter)
 K types implement WritableComparable
 V types implement Writable
What is Writable?
 Hadoop defines its own “box” classes for
strings (Text), integers (IntWritable), etc.
 All values are instances of Writable
 All keys are instances of WritableComparable
Reading Data
 Data sets are specified by InputFormats
 Defines input data (e.g., a directory)
 Identifies partitions of the data that form an
InputSplit
 Factory for RecordReader objects to extract (k, v)
records from the input source
FileInputFormat and Friends
 TextInputFormat – Treats each ‘n’-terminated line of a
file as a value
 KeyValueTextInputFormat – Maps ‘n’- terminated text
lines of “k SEP v”
 SequenceFileInputFormat – Binary file of (k, v) pairs with
some add’l metadata
 SequenceFileAsTextInputFormat – Same, but maps
(k.toString(), v.toString())
Filtering File Inputs
 FileInputFormat will read all files out of a
specified directory and send them to the
mapper
 Delegates filtering this file list to a method
subclasses may override
 e.g., Create your own “xyzFileInputFormat” to
read *.xyz from directory list
Record Readers
 Each InputFormat provides its own
RecordReader implementation
 Provides (unused?) capability multiplexing
 LineRecordReader – Reads a line from a text
file
 KeyValueRecordReader – Used by
KeyValueTextInputFormat
Input Split Size
 FileInputFormat will divide large files into
chunks
 Exact size controlled by mapred.min.split.size
 RecordReaders receive file, offset, and
length of chunk
 Custom InputFormat implementations may
override split size – e.g., “NeverChunkFile”
Sending Data To Reducers
 Map function receives OutputCollector object
 OutputCollector.collect() takes (k, v) elements
 Any (WritableComparable, Writable) can be
used
 By default, mapper output type assumed to
be same as reducer output type
WritableComparator
 Compares WritableComparable data
 Will call WritableComparable.compare()
 Can provide fast path for serialized data
 JobConf.setOutputValueGroupingComparator()
Sending Data To The Client
 Reporter object sent to Mapper allows simple
asynchronous feedback
 incrCounter(Enum key, long amount)
 setStatus(String msg)
 Allows self-identification of input
 InputSplit getInputSplit()
Partitioner
 int getPartition(key, val, numPartitions)
 Outputs the partition number for a given key
 One partition == values sent to one Reduce task
 HashPartitioner used by default
 Uses key.hashCode() to return partition num
 JobConf sets Partitioner implementation
Reduction
 reduce( K2 key,
Iterator<V2> values,
OutputCollector<K3, V3> output,
Reporter reporter )
 Keys & values sent to one partition all go to
the same reduce task
 Calls are sorted by key – “earlier” keys are
reduced and output before “later” keys
Finally: Writing The Output
OutputFormat
 Analogous to InputFormat
 TextOutputFormat – Writes “key valn” strings
to output file
 SequenceFileOutputFormat – Uses a binary
format to pack (k, v) pairs
 NullOutputFormat – Discards output
 Only useful if defining own output methods within
reduce()
Example Program - Wordcount
 map()
 Receives a chunk of text
 Outputs a set of word/count pairs
 reduce()
 Receives a key and all its associated values
 Outputs the key and the sum of the values
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
Wordcount – main( )
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Wordcount – map( )
public static class Map extends MapReduceBase … {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, …) … {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Wordcount – reduce( )
public static class Reduce extends MapReduceBase … {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, …) … {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
}
Hadoop Streaming
 Allows you to create and run map/reduce
jobs with any executable
 Similar to unix pipes, e.g.:
 format is: Input | Mapper | Reducer
 echo “this sentence has five lines” | cat | wc
Hadoop Streaming
 Mapper and Reducer receive data from stdin and output to
stdout
 Hadoop takes care of the transmission of data between the
map/reduce tasks
 It is still the programmer’s responsibility to set the correct
key/value
 Default format: “key t valuen”
 Let’s look at a Python example of a MapReduce word count
program…
Streaming_Mapper.py
# read in one line of input at a time from stdin
for line in sys.stdin:
line = line.strip() # string
words = line.split() # list of strings
# write data on stdout
for word in words:
print ‘%st%i’ % (word, 1)
Hadoop Streaming
 What are we outputting?
 Example output: “the 1”
 By default, “the” is the key, and “1” is the value
 Hadoop Streaming handles delivering this key/value pair to a
Reducer
 Able to send similar keys to the same Reducer or to an
intermediary Combiner
Streaming_Reducer.py
wordcount = { } # empty dictionary
# read in one line of input at a time from stdin
for line in sys.stdin:
line = line.strip() # string
key,value = line.split()
wordcount[key] = wordcount.get(key, 0) + value
# write data on stdout
for word, count in sorted(wordcount.items()):
print ‘%st%i’ % (word, count)
Hadoop Streaming Gotcha
 Streaming Reducer receives single lines
(which are key/value pairs) from stdin
 Regular Reducer receives a collection of all the
values for a particular key
 It is still the case that all the values for a particular
key will go to a single Reducer
Using Hadoop Distributed File System
(HDFS)
 Can access HDFS through various shell
commands (see Further Resources slide for
link to documentation)
 hadoop –put <localsrc> … <dst>
 hadoop –get <src> <localdst>
 hadoop –ls
 hadoop –rm file
Configuring Number of Tasks
 Normal method
 jobConf.setNumMapTasks(400)
 jobConf.setNumReduceTasks(4)
 Hadoop Streaming method
 -jobconf mapred.map.tasks=400
 -jobconf mapred.reduce.tasks=4
 Note: # of map tasks is only a hint to the framework. Actual
number depends on the number of InputSplits generated
Running a Hadoop Job
 Place input file into HDFS:
 hadoop fs –put ./input-file input-file
 Run either normal or streaming version:
 hadoop jar Wordcount.jar org.myorg.Wordcount input-file
output-file
 hadoop jar hadoop-streaming.jar 
-input input-file 
-output output-file 
-file Streaming_Mapper.py 
-mapper python Streaming_Mapper.py 
-file Streaming_Reducer.py 
-reducer python Streaming_Reducer.py
Submitting to RC’s GridEngine
 Add appropriate modules
 module add apps/jdk/1.6.0_22.x86_64 apps/hadoop/0.20.2
 Use the submit script posted in the Further Resources slide
 Script calls internal functions hadoop_start and hadoop_end
 Adjust the lines for transferring the input file to HDFS and starting the
hadoop job using the commands on the previous slide
 Adjust the expected runtime (generally good practice to overshoot your
estimate)
 #$ -l h_rt=02:00:00
 NOTICE: “All jobs are required to have a hard run-time specification. Jobs
that do not have this specification will have a default run-time of 10 minutes
and will be stopped at that point.”
Output Parsing
 Output of the reduce tasks must be retrieved:
 hadoop fs –get output-file hadoop-output
 This creates a directory of output files, 1 per reduce task
 Output files numbered part-00000, part-00001, etc.
 Sample output of Wordcount
 head –n5 part-00000
“’tis 1
“come 2
“coming 1
“edwin 1
“found 1
Extra Output
 The stdout/stderr streams of Hadoop itself will be stored in an output file
(whichever one is named in the startup script)
 #$ -o output.$job_id
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.205
…
11/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1
11/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_0001
…
11/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 1
…
11/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments
11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total
size: 43927 bytes
11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0%
…
Thank You !!!
For More Information click below link:
Follow Us on:
http://vibranttechnologies.co.in/hadoop-classes-in-mumbai.html

More Related Content

What's hot

Cppt
CpptCppt
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
Hadoop online training
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
Kalyan Hadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Nitesh Ghosh
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
Big Data Interview Questions
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
techieguy85
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 

What's hot (17)

Cppt
CpptCppt
Cppt
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Viewers also liked

Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first classalogarg
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Hadoop map reduce data flow
Hadoop map reduce data flowHadoop map reduce data flow
Hadoop map reduce data flow
Intellipaat
 
Big data gaurav
Big data gauravBig data gaurav
Big data gaurav
JigsawAcademy2014
 
Secrets in Kubernetes
Secrets in KubernetesSecrets in Kubernetes
Secrets in Kubernetes
Jerry Jalava
 
Talend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities Overview
Rajan Kanitkar
 
Hadoop File System Shell Commands,
Hadoop File System Shell Commands,Hadoop File System Shell Commands,
Hadoop File System Shell Commands,
Hadoop online training
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
Chirag Ahuja
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 

Viewers also liked (13)

Hadoop story
Hadoop storyHadoop story
Hadoop story
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hadoop map reduce data flow
Hadoop map reduce data flowHadoop map reduce data flow
Hadoop map reduce data flow
 
Map reduce
Map reduceMap reduce
Map reduce
 
Big data gaurav
Big data gauravBig data gaurav
Big data gaurav
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
 
Secrets in Kubernetes
Secrets in KubernetesSecrets in Kubernetes
Secrets in Kubernetes
 
Talend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities Overview
 
Hadoop File System Shell Commands,
Hadoop File System Shell Commands,Hadoop File System Shell Commands,
Hadoop File System Shell Commands,
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 

Similar to Hadoop - Introduction to mapreduce

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_PennonsoftPennonSoft
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
Kelly Technologies
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
AnushkaChauhan68
 
Hadoop 3
Hadoop 3Hadoop 3
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Apache Crunch
Apache CrunchApache Crunch
Apache Crunch
Alwin James
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
srikanthhadoop
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分
sg7879
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
Archana Gopinath
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
Jay Nagar
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
Kuldeep Dhole
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 

Similar to Hadoop - Introduction to mapreduce (20)

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Apache Crunch
Apache CrunchApache Crunch
Apache Crunch
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 

More from Vibrant Technologies & Computers

Buisness analyst business analysis overview ppt 5
Buisness analyst business analysis overview ppt 5Buisness analyst business analysis overview ppt 5
Buisness analyst business analysis overview ppt 5
Vibrant Technologies & Computers
 
SQL Introduction to displaying data from multiple tables
SQL Introduction to displaying data from multiple tables  SQL Introduction to displaying data from multiple tables
SQL Introduction to displaying data from multiple tables
Vibrant Technologies & Computers
 
SQL- Introduction to MySQL
SQL- Introduction to MySQLSQL- Introduction to MySQL
SQL- Introduction to MySQL
Vibrant Technologies & Computers
 
SQL- Introduction to SQL database
SQL- Introduction to SQL database SQL- Introduction to SQL database
SQL- Introduction to SQL database
Vibrant Technologies & Computers
 
ITIL - introduction to ITIL
ITIL - introduction to ITILITIL - introduction to ITIL
ITIL - introduction to ITIL
Vibrant Technologies & Computers
 
Salesforce - Introduction to Security & Access
Salesforce -  Introduction to Security & Access Salesforce -  Introduction to Security & Access
Salesforce - Introduction to Security & Access
Vibrant Technologies & Computers
 
Data ware housing- Introduction to olap .
Data ware housing- Introduction to  olap .Data ware housing- Introduction to  olap .
Data ware housing- Introduction to olap .
Vibrant Technologies & Computers
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
Vibrant Technologies & Computers
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
Vibrant Technologies & Computers
 
Salesforce - classification of cloud computing
Salesforce - classification of cloud computingSalesforce - classification of cloud computing
Salesforce - classification of cloud computing
Vibrant Technologies & Computers
 
Salesforce - cloud computing fundamental
Salesforce - cloud computing fundamentalSalesforce - cloud computing fundamental
Salesforce - cloud computing fundamental
Vibrant Technologies & Computers
 
SQL- Introduction to PL/SQL
SQL- Introduction to  PL/SQLSQL- Introduction to  PL/SQL
SQL- Introduction to PL/SQL
Vibrant Technologies & Computers
 
SQL- Introduction to advanced sql concepts
SQL- Introduction to  advanced sql conceptsSQL- Introduction to  advanced sql concepts
SQL- Introduction to advanced sql concepts
Vibrant Technologies & Computers
 
SQL Inteoduction to SQL manipulating of data
SQL Inteoduction to SQL manipulating of data   SQL Inteoduction to SQL manipulating of data
SQL Inteoduction to SQL manipulating of data
Vibrant Technologies & Computers
 
SQL- Introduction to SQL Set Operations
SQL- Introduction to SQL Set OperationsSQL- Introduction to SQL Set Operations
SQL- Introduction to SQL Set Operations
Vibrant Technologies & Computers
 
Sas - Introduction to designing the data mart
Sas - Introduction to designing the data martSas - Introduction to designing the data mart
Sas - Introduction to designing the data mart
Vibrant Technologies & Computers
 
Sas - Introduction to working under change management
Sas - Introduction to working under change managementSas - Introduction to working under change management
Sas - Introduction to working under change management
Vibrant Technologies & Computers
 
SAS - overview of SAS
SAS - overview of SASSAS - overview of SAS
SAS - overview of SAS
Vibrant Technologies & Computers
 
Teradata - Architecture of Teradata
Teradata - Architecture of TeradataTeradata - Architecture of Teradata
Teradata - Architecture of Teradata
Vibrant Technologies & Computers
 
Teradata - Restoring Data
Teradata - Restoring Data Teradata - Restoring Data
Teradata - Restoring Data
Vibrant Technologies & Computers
 

More from Vibrant Technologies & Computers (20)

Buisness analyst business analysis overview ppt 5
Buisness analyst business analysis overview ppt 5Buisness analyst business analysis overview ppt 5
Buisness analyst business analysis overview ppt 5
 
SQL Introduction to displaying data from multiple tables
SQL Introduction to displaying data from multiple tables  SQL Introduction to displaying data from multiple tables
SQL Introduction to displaying data from multiple tables
 
SQL- Introduction to MySQL
SQL- Introduction to MySQLSQL- Introduction to MySQL
SQL- Introduction to MySQL
 
SQL- Introduction to SQL database
SQL- Introduction to SQL database SQL- Introduction to SQL database
SQL- Introduction to SQL database
 
ITIL - introduction to ITIL
ITIL - introduction to ITILITIL - introduction to ITIL
ITIL - introduction to ITIL
 
Salesforce - Introduction to Security & Access
Salesforce -  Introduction to Security & Access Salesforce -  Introduction to Security & Access
Salesforce - Introduction to Security & Access
 
Data ware housing- Introduction to olap .
Data ware housing- Introduction to  olap .Data ware housing- Introduction to  olap .
Data ware housing- Introduction to olap .
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
 
Salesforce - classification of cloud computing
Salesforce - classification of cloud computingSalesforce - classification of cloud computing
Salesforce - classification of cloud computing
 
Salesforce - cloud computing fundamental
Salesforce - cloud computing fundamentalSalesforce - cloud computing fundamental
Salesforce - cloud computing fundamental
 
SQL- Introduction to PL/SQL
SQL- Introduction to  PL/SQLSQL- Introduction to  PL/SQL
SQL- Introduction to PL/SQL
 
SQL- Introduction to advanced sql concepts
SQL- Introduction to  advanced sql conceptsSQL- Introduction to  advanced sql concepts
SQL- Introduction to advanced sql concepts
 
SQL Inteoduction to SQL manipulating of data
SQL Inteoduction to SQL manipulating of data   SQL Inteoduction to SQL manipulating of data
SQL Inteoduction to SQL manipulating of data
 
SQL- Introduction to SQL Set Operations
SQL- Introduction to SQL Set OperationsSQL- Introduction to SQL Set Operations
SQL- Introduction to SQL Set Operations
 
Sas - Introduction to designing the data mart
Sas - Introduction to designing the data martSas - Introduction to designing the data mart
Sas - Introduction to designing the data mart
 
Sas - Introduction to working under change management
Sas - Introduction to working under change managementSas - Introduction to working under change management
Sas - Introduction to working under change management
 
SAS - overview of SAS
SAS - overview of SASSAS - overview of SAS
SAS - overview of SAS
 
Teradata - Architecture of Teradata
Teradata - Architecture of TeradataTeradata - Architecture of Teradata
Teradata - Architecture of Teradata
 
Teradata - Restoring Data
Teradata - Restoring Data Teradata - Restoring Data
Teradata - Restoring Data
 

Recently uploaded

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 

Recently uploaded (20)

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

Hadoop - Introduction to mapreduce

  • 1.
  • 3. Terminology Google calls it: Hadoop equivalent: MapReduce Hadoop GFS HDFS Bigtable HBase Chubby Zookeeper
  • 4. Some MapReduce Terminology  Job – A “full program” - an execution of a Mapper and Reducer across a data set  Task – An execution of a Mapper or a Reducer on a slice of data  a.k.a. Task-In-Progress (TIP)  Task Attempt – A particular instance of an attempt to execute a task on a machine
  • 5. Task Attempts  A particular task will be attempted at least once, possibly more times if it crashes  If the same input causes crashes over and over, that input will eventually be abandoned  Multiple attempts at one task may occur in parallel with speculative execution turned on  Task ID from TaskInProgress is not a unique identifier; don’t use it that way
  • 6. MapReduce: High Level In our case: circe.rc.usf.edu
  • 7. Nodes, Trackers, Tasks  Master node runs JobTracker instance, which accepts Job requests from clients  TaskTracker instances run on slave nodes  TaskTracker forks separate Java process for task instances
  • 8. Job Distribution  MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options  Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code  … Where’s the data distribution?
  • 9. Data Distribution  Implicit in design of MapReduce!  All mappers are equivalent; so map whatever data is local to a particular node in HDFS  If lots of data does happen to pile up on the same node, nearby nodes will map instead  Data transfer is handled implicitly by HDFS
  • 10. What Happens In Hadoop? Depth First
  • 11. Job Launch Process: Client  Client program creates a JobConf  Identify classes implementing Mapper and Reducer interfaces  JobConf.setMapperClass(), setReducerClass()  Specify inputs, outputs  FileInputFormat.setInputPath(),  FileOutputFormat.setOutputPath()  Optionally, other options too:  JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…
  • 12. Job Launch Process: JobClient  Pass JobConf to JobClient.runJob() or submitJob()  runJob() blocks, submitJob() does not  JobClient:  Determines proper division of input into InputSplits  Sends job data to master JobTracker server
  • 13. Job Launch Process: JobTracker  JobTracker:  Inserts jar and JobConf (serialized to XML) in shared location  Posts a JobInProgress to its run queue
  • 14. Job Launch Process: TaskTracker  TaskTrackers running on slave nodes periodically query JobTracker for work  Retrieve job-specific jar and config  Launch task in separate instance of Java  main() is provided by Hadoop
  • 15. Job Launch Process: Task  TaskTracker.Child.main():  Sets up the child TaskInProgress attempt  Reads XML configuration  Connects back to necessary MapReduce components via RPC  Uses TaskRunner to launch user process
  • 16. Job Launch Process: TaskRunner  TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper  Task knows ahead of time which InputSplits it should be mapping  Calls Mapper once for each record retrieved from the InputSplit  Running the Reducer is much the same
  • 17. Creating the Mapper  You provide the instance of Mapper  Should extend MapReduceBase  One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress  Exists in separate process from all other instances of Mapper – no data sharing!
  • 18. Mapper  void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)  K types implement WritableComparable  V types implement Writable
  • 19. What is Writable?  Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.  All values are instances of Writable  All keys are instances of WritableComparable
  • 20.
  • 21. Reading Data  Data sets are specified by InputFormats  Defines input data (e.g., a directory)  Identifies partitions of the data that form an InputSplit  Factory for RecordReader objects to extract (k, v) records from the input source
  • 22. FileInputFormat and Friends  TextInputFormat – Treats each ‘n’-terminated line of a file as a value  KeyValueTextInputFormat – Maps ‘n’- terminated text lines of “k SEP v”  SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata  SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())
  • 23. Filtering File Inputs  FileInputFormat will read all files out of a specified directory and send them to the mapper  Delegates filtering this file list to a method subclasses may override  e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
  • 24. Record Readers  Each InputFormat provides its own RecordReader implementation  Provides (unused?) capability multiplexing  LineRecordReader – Reads a line from a text file  KeyValueRecordReader – Used by KeyValueTextInputFormat
  • 25. Input Split Size  FileInputFormat will divide large files into chunks  Exact size controlled by mapred.min.split.size  RecordReaders receive file, offset, and length of chunk  Custom InputFormat implementations may override split size – e.g., “NeverChunkFile”
  • 26. Sending Data To Reducers  Map function receives OutputCollector object  OutputCollector.collect() takes (k, v) elements  Any (WritableComparable, Writable) can be used  By default, mapper output type assumed to be same as reducer output type
  • 27. WritableComparator  Compares WritableComparable data  Will call WritableComparable.compare()  Can provide fast path for serialized data  JobConf.setOutputValueGroupingComparator()
  • 28. Sending Data To The Client  Reporter object sent to Mapper allows simple asynchronous feedback  incrCounter(Enum key, long amount)  setStatus(String msg)  Allows self-identification of input  InputSplit getInputSplit()
  • 29.
  • 30. Partitioner  int getPartition(key, val, numPartitions)  Outputs the partition number for a given key  One partition == values sent to one Reduce task  HashPartitioner used by default  Uses key.hashCode() to return partition num  JobConf sets Partitioner implementation
  • 31. Reduction  reduce( K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter )  Keys & values sent to one partition all go to the same reduce task  Calls are sorted by key – “earlier” keys are reduced and output before “later” keys
  • 33. OutputFormat  Analogous to InputFormat  TextOutputFormat – Writes “key valn” strings to output file  SequenceFileOutputFormat – Uses a binary format to pack (k, v) pairs  NullOutputFormat – Discards output  Only useful if defining own output methods within reduce()
  • 34. Example Program - Wordcount  map()  Receives a chunk of text  Outputs a set of word/count pairs  reduce()  Receives a key and all its associated values  Outputs the key and the sum of the values package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount {
  • 35. Wordcount – main( ) public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }
  • 36. Wordcount – map( ) public static class Map extends MapReduceBase … { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, …) … { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 37. Wordcount – reduce( ) public static class Reduce extends MapReduceBase … { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, …) … { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } }
  • 38. Hadoop Streaming  Allows you to create and run map/reduce jobs with any executable  Similar to unix pipes, e.g.:  format is: Input | Mapper | Reducer  echo “this sentence has five lines” | cat | wc
  • 39. Hadoop Streaming  Mapper and Reducer receive data from stdin and output to stdout  Hadoop takes care of the transmission of data between the map/reduce tasks  It is still the programmer’s responsibility to set the correct key/value  Default format: “key t valuen”  Let’s look at a Python example of a MapReduce word count program…
  • 40. Streaming_Mapper.py # read in one line of input at a time from stdin for line in sys.stdin: line = line.strip() # string words = line.split() # list of strings # write data on stdout for word in words: print ‘%st%i’ % (word, 1)
  • 41. Hadoop Streaming  What are we outputting?  Example output: “the 1”  By default, “the” is the key, and “1” is the value  Hadoop Streaming handles delivering this key/value pair to a Reducer  Able to send similar keys to the same Reducer or to an intermediary Combiner
  • 42. Streaming_Reducer.py wordcount = { } # empty dictionary # read in one line of input at a time from stdin for line in sys.stdin: line = line.strip() # string key,value = line.split() wordcount[key] = wordcount.get(key, 0) + value # write data on stdout for word, count in sorted(wordcount.items()): print ‘%st%i’ % (word, count)
  • 43. Hadoop Streaming Gotcha  Streaming Reducer receives single lines (which are key/value pairs) from stdin  Regular Reducer receives a collection of all the values for a particular key  It is still the case that all the values for a particular key will go to a single Reducer
  • 44. Using Hadoop Distributed File System (HDFS)  Can access HDFS through various shell commands (see Further Resources slide for link to documentation)  hadoop –put <localsrc> … <dst>  hadoop –get <src> <localdst>  hadoop –ls  hadoop –rm file
  • 45. Configuring Number of Tasks  Normal method  jobConf.setNumMapTasks(400)  jobConf.setNumReduceTasks(4)  Hadoop Streaming method  -jobconf mapred.map.tasks=400  -jobconf mapred.reduce.tasks=4  Note: # of map tasks is only a hint to the framework. Actual number depends on the number of InputSplits generated
  • 46. Running a Hadoop Job  Place input file into HDFS:  hadoop fs –put ./input-file input-file  Run either normal or streaming version:  hadoop jar Wordcount.jar org.myorg.Wordcount input-file output-file  hadoop jar hadoop-streaming.jar -input input-file -output output-file -file Streaming_Mapper.py -mapper python Streaming_Mapper.py -file Streaming_Reducer.py -reducer python Streaming_Reducer.py
  • 47. Submitting to RC’s GridEngine  Add appropriate modules  module add apps/jdk/1.6.0_22.x86_64 apps/hadoop/0.20.2  Use the submit script posted in the Further Resources slide  Script calls internal functions hadoop_start and hadoop_end  Adjust the lines for transferring the input file to HDFS and starting the hadoop job using the commands on the previous slide  Adjust the expected runtime (generally good practice to overshoot your estimate)  #$ -l h_rt=02:00:00  NOTICE: “All jobs are required to have a hard run-time specification. Jobs that do not have this specification will have a default run-time of 10 minutes and will be stopped at that point.”
  • 48. Output Parsing  Output of the reduce tasks must be retrieved:  hadoop fs –get output-file hadoop-output  This creates a directory of output files, 1 per reduce task  Output files numbered part-00000, part-00001, etc.  Sample output of Wordcount  head –n5 part-00000 “’tis 1 “come 2 “coming 1 “edwin 1 “found 1
  • 49. Extra Output  The stdout/stderr streams of Hadoop itself will be stored in an output file (whichever one is named in the startup script)  #$ -o output.$job_id STARTUP_MSG: Starting NameNode STARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.205 … 11/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1 11/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_0001 … 11/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 1 … 11/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments 11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 43927 bytes 11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0% …
  • 50. Thank You !!! For More Information click below link: Follow Us on: http://vibranttechnologies.co.in/hadoop-classes-in-mumbai.html