SlideShare a Scribd company logo
1 of 49
Hadoop Technical Introduction
Presented By
Terminology
Google calls it: Hadoop equivalent:
MapReduce Hadoop
GFS HDFS
Bigtable HBase
Chubby Zookeeper
www.kellytechno.comww
Some MapReduce Terminology
 Job – A “full program” - an execution of a Mapper and
Reducer across a data set
 Task – An execution of a Mapper or a Reducer on a
slice of data
 a.k.a. Task-In-Progress (TIP)
 Task Attempt – A particular instance of an attempt to
execute a task on a machine
www.kellytechno.comww
Task Attempts
 A particular task will be attempted at least once,
possibly more times if it crashes
 If the same input causes crashes over and over, that input will
eventually be abandoned
 Multiple attempts at one task may occur in parallel
with speculative execution turned on
 Task ID from TaskInProgress is not a unique identifier; don’t
use it that way
www.kellytechno.comww
MapReduce: High Level
JobTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
In our case: circe.rc.usf.edu
www.kellytechno.comww
Nodes, Trackers, Tasks
 Master node runs JobTracker instance, which accepts
Job requests from clients
 TaskTracker instances run on slave nodes
 TaskTracker forks separate Java process for task
instances
www.kellytechno.comww
Job Distribution
 MapReduce programs are contained in a Java “jar” file
+ an XML file containing serialized program
configuration options
 Running a MapReduce job places these files into the
HDFS and notifies TaskTrackers where to retrieve the
relevant program code
 … Where’s the data distribution?
www.kellytechno.comww
Data Distribution
 Implicit in design of MapReduce!
 All mappers are equivalent; so map whatever data is
local to a particular node in HDFS
 If lots of data does happen to pile up on the same
node, nearby nodes will map instead
 Data transfer is handled implicitly by HDFS
www.kellytechno.comww
What Happens In Hadoop?
Depth First
www.kellytechno.comww
Job Launch Process: Client
 Client program creates a JobConf
 Identify classes implementing Mapper and Reducer
interfaces
 JobConf.setMapperClass(), setReducerClass()
 Specify inputs, outputs
 FileInputFormat.setInputPath(),
 FileOutputFormat.setOutputPath()
 Optionally, other options too:
 JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()…
www.kellytechno.comww
Job Launch Process: JobClient
 Pass JobConf to JobClient.runJob() or submitJob()
 runJob() blocks, submitJob() does not
 JobClient:
 Determines proper division of input into InputSplits
 Sends job data to master JobTracker server
www.kellytechno.comww
Job Launch Process: JobTracker
 JobTracker:
 Inserts jar and JobConf (serialized to XML) in shared
location
 Posts a JobInProgress to its run queue
www.kellytechno.comww
Job Launch Process: TaskTracker
 TaskTrackers running on slave nodes periodically
query JobTracker for work
 Retrieve job-specific jar and config
 Launch task in separate instance of Java
 main() is provided by Hadoop
www.kellytechno.comww
Job Launch Process: Task
 TaskTracker.Child.main():
 Sets up the child TaskInProgress attempt
 Reads XML configuration
 Connects back to necessary MapReduce components via
RPC
 Uses TaskRunner to launch user process
www.kellytechno.comww
Job Launch Process: TaskRunner
 TaskRunner, MapTaskRunner, MapRunner work in a
daisy-chain to launch your Mapper
 Task knows ahead of time which InputSplits it should be
mapping
 Calls Mapper once for each record retrieved from the
InputSplit
 Running the Reducer is much the same
www.kellytechno.comww
Creating the Mapper
 You provide the instance of Mapper
 Should extend MapReduceBase
 One instance of your Mapper is initialized by the
MapTaskRunner for a TaskInProgress
 Exists in separate process from all other instances of
Mapper – no data sharing!
www.kellytechno.comww
Mapper
 void map(K1 key,
V1 value,
OutputCollector<K2, V2> output,
Reporter reporter)
 K types implement WritableComparable
 V types implement Writable
www.kellytechno.comww
What is Writable?
 Hadoop defines its own “box” classes for strings
(Text), integers (IntWritable), etc.
 All values are instances of Writable
 All keys are instances of WritableComparable
www.kellytechno.comww
Getting Data To The Mapper
Input file
InputSplit InputSplit InputSplit InputSplit
Input file
RecordReader RecordReader RecordReader RecordReader
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
InputFormat
www.kellytechno.comww
Reading Data
 Data sets are specified by InputFormats
 Defines input data (e.g., a directory)
 Identifies partitions of the data that form an InputSplit
 Factory for RecordReader objects to extract (k, v) records
from the input source
www.kellytechno.comww
FileInputFormat and Friends
 TextInputFormat – Treats each ‘n’-terminated line of
a file as a value
 KeyValueTextInputFormat – Maps ‘n’- terminated
text lines of “k SEP v”
 SequenceFileInputFormat – Binary file of (k, v) pairs
with some add’l metadata
 SequenceFileAsTextInputFormat – Same, but maps
(k.toString(), v.toString())
www.kellytechno.comww
Filtering File Inputs
 FileInputFormat will read all files out of a specified
directory and send them to the mapper
 Delegates filtering this file list to a method subclasses
may override
 e.g., Create your own “xyzFileInputFormat” to read *.xyz
from directory list
www.kellytechno.comww
Record Readers
 Each InputFormat provides its own RecordReader
implementation
 Provides (unused?) capability multiplexing
 LineRecordReader – Reads a line from a text file
 KeyValueRecordReader – Used by
KeyValueTextInputFormat
www.kellytechno.comww
Input Split Size
 FileInputFormat will divide large files into chunks
 Exact size controlled by mapred.min.split.size
 RecordReaders receive file, offset, and length of chunk
 Custom InputFormat implementations may override
split size – e.g., “NeverChunkFile”
www.kellytechno.comww
Sending Data To Reducers
 Map function receives OutputCollector object
 OutputCollector.collect() takes (k, v) elements
 Any (WritableComparable, Writable) can be used
 By default, mapper output type assumed to be same as
reducer output type
www.kellytechno.comww
WritableComparator
 Compares WritableComparable data
 Will call WritableComparable.compare()
 Can provide fast path for serialized data
 JobConf.setOutputValueGroupingComparator()
www.kellytechno.comww
Sending Data To The Client
 Reporter object sent to Mapper allows simple
asynchronous feedback
 incrCounter(Enum key, long amount)
 setStatus(String msg)
 Allows self-identification of input
 InputSplit getInputSplit()
www.kellytechno.comww
Partition And Shuffle
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
shuffling
www.kellytechno.comww
Partitioner
 int getPartition(key, val, numPartitions)
 Outputs the partition number for a given key
 One partition == values sent to one Reduce task
 HashPartitioner used by default
 Uses key.hashCode() to return partition num
 JobConf sets Partitioner implementation
www.kellytechno.comww
Reduction
 reduce( K2 key,
Iterator<V2> values,
OutputCollector<K3, V3> output,
Reporter reporter )
 Keys & values sent to one partition all go to the same
reduce task
 Calls are sorted by key – “earlier” keys are reduced and
output before “later” keys
www.kellytechno.comww
Finally: Writing The Output
Reducer Reducer Reducer
RecordWriter RecordWriter RecordWriter
output file output file output file
OutputFormat
www.kellytechno.comww
OutputFormat
 Analogous to InputFormat
 TextOutputFormat – Writes “key valn” strings to
output file
 SequenceFileOutputFormat – Uses a binary format to
pack (k, v) pairs
 NullOutputFormat – Discards output
 Only useful if defining own output methods within
reduce()
www.kellytechno.comww
Example Program - Wordcount
 map()
 Receives a chunk of text
 Outputs a set of word/count pairs
 reduce()
 Receives a key and all its associated values
 Outputs the key and the sum of the values
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
www.kellytechno.comww
Wordcount – main( )
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
www.kellytechno.comww
Wordcount – map( )
public static class Map extends MapReduceBase … {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, …) … {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
www.kellytechno.comww
Wordcount – reduce( )
public static class Reduce extends MapReduceBase … {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, …) … {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
}
www.kellytechno.comww
Hadoop Streaming
 Allows you to create and run map/reduce jobs with
any executable
 Similar to unix pipes, e.g.:
 format is: Input | Mapper | Reducer
 echo “this sentence has five lines” | cat | wc
www.kellytechno.comww
Hadoop Streaming
 Mapper and Reducer receive data from stdin and
output to stdout
 Hadoop takes care of the transmission of data
between the map/reduce tasks
 It is still the programmer’s responsibility to set the
correct key/value
 Default format: “key t valuen”
 Let’s look at a Python example of a MapReduce word
count program…
www.kellytechno.comww
Streaming_Mapper.py
# read in one line of input at a time from stdin
for line in sys.stdin:
line = line.strip() # string
words = line.split() # list of strings
# write data on stdout
for word in words:
print ‘%st%i’ % (word, 1)
www.kellytechno.comww
Hadoop Streaming
 What are we outputting?
 Example output: “the 1”
 By default, “the” is the key, and “1” is the value
 Hadoop Streaming handles delivering this key/value
pair to a Reducer
 Able to send similar keys to the same Reducer or to an
intermediary Combiner
www.kellytechno.comww
Streaming_Reducer.py
wordcount = { } # empty dictionary
# read in one line of input at a time from stdin
for line in sys.stdin:
line = line.strip() # string
key,value = line.split()
wordcount[key] = wordcount.get(key, 0) + value
# write data on stdout
for word, count in sorted(wordcount.items()):
print ‘%st%i’ % (word, count)
www.kellytechno.comww
Hadoop Streaming Gotcha
 Streaming Reducer receives single lines (which are
key/value pairs) from stdin
 Regular Reducer receives a collection of all the values for
a particular key
 It is still the case that all the values for a particular key
will go to a single Reducer
www.kellytechno.comww
Using Hadoop Distributed File System
(HDFS)
 Can access HDFS through various shell commands
(see Further Resources slide for link to
documentation)
 hadoop –put <localsrc> … <dst>
 hadoop –get <src> <localdst>
 hadoop –ls
 hadoop –rm file
www.kellytechno.comww
Configuring Number of Tasks
 Normal method
 jobConf.setNumMapTasks(400)
 jobConf.setNumReduceTasks(4)
 Hadoop Streaming method
 -jobconf mapred.map.tasks=400
 -jobconf mapred.reduce.tasks=4
 Note: # of map tasks is only a hint to the framework.
Actual number depends on the number of InputSplits
generated
www.kellytechno.comww
Running a Hadoop Job
 Place input file into HDFS:
 hadoop fs –put ./input-file input-file
 Run either normal or streaming version:
 hadoop jar Wordcount.jar org.myorg.Wordcount input-file
output-file
 hadoop jar hadoop-streaming.jar 
-input input-file 
-output output-file 
-file Streaming_Mapper.py 
-mapper python Streaming_Mapper.py 
-file Streaming_Reducer.py 
-reducer python Streaming_Reducer.py 
www.kellytechno.comww
Submitting to RC’s GridEngine
 Add appropriate modules
 module add apps/jdk/1.6.0_22.x86_64 apps/hadoop/0.20.2
 Use the submit script posted in the Further Resources slide
 Script calls internal functions hadoop_start and hadoop_end
 Adjust the lines for transferring the input file to HDFS and starting
the hadoop job using the commands on the previous slide
 Adjust the expected runtime (generally good practice to overshoot
your estimate)
 #$ -l h_rt=02:00:00
 NOTICE: “All jobs are required to have a hard run-time
specification. Jobs that do not have this specification will have a
default run-time of 10 minutes and will be stopped at that point.”
www.kellytechno.comww
Output Parsing
 Output of the reduce tasks must be retrieved:
 hadoop fs –get output-file hadoop-output
 This creates a directory of output files, 1 per reduce
task
 Output files numbered part-00000, part-00001, etc.
 Sample output of Wordcount
 head –n5 part-00000
“’tis 1
“come 2
“coming 1
“edwin 1
“found 1
www.kellytechno.comww
Extra Output
 The stdout/stderr streams of Hadoop itself will be stored in an output file
(whichever one is named in the startup script)
 #$ -o output.$job_id
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.205
…
11/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1
11/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_0001
…
11/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 1
…
11/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments
11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size:
43927 bytes
11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0%
…
11/03/02 18:28:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
11/03/02 18:28:49 INFO mapred.JobClient: Job complete: job_local_0001
www.kellytechno.comww
Thank You
www.kellytechno.comww

More Related Content

What's hot

Reactive cocoa made Simple with Swift
Reactive cocoa made Simple with SwiftReactive cocoa made Simple with Swift
Reactive cocoa made Simple with SwiftColin Eberhardt
 
ReactiveCocoa and Swift, Better Together
ReactiveCocoa and Swift, Better TogetherReactiveCocoa and Swift, Better Together
ReactiveCocoa and Swift, Better TogetherColin Eberhardt
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分sg7879
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Deep Dive Java 17 Devoxx UK
Deep Dive Java 17 Devoxx UKDeep Dive Java 17 Devoxx UK
Deep Dive Java 17 Devoxx UKJosé Paumard
 
Creating Lazy stream in CSharp
Creating Lazy stream in CSharpCreating Lazy stream in CSharp
Creating Lazy stream in CSharpDhaval Dalal
 
Reactive Programming for a demanding world: building event-driven and respons...
Reactive Programming for a demanding world: building event-driven and respons...Reactive Programming for a demanding world: building event-driven and respons...
Reactive Programming for a demanding world: building event-driven and respons...Mario Fusco
 
Currying and Partial Function Application (PFA)
Currying and Partial Function Application (PFA)Currying and Partial Function Application (PFA)
Currying and Partial Function Application (PFA)Dhaval Dalal
 
Intro to ReactiveCocoa
Intro to ReactiveCocoaIntro to ReactiveCocoa
Intro to ReactiveCocoakleneau
 
What's new in Scala 2.13?
What's new in Scala 2.13?What's new in Scala 2.13?
What's new in Scala 2.13?Hermann Hueck
 
An a z index of windows power shell commandss
An a z index of windows power shell commandssAn a z index of windows power shell commandss
An a z index of windows power shell commandssBen Pope
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and IterationsSameer Wadkar
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestPavan Chitumalla
 
Refactoring to Java 8 (Devoxx BE)
Refactoring to Java 8 (Devoxx BE)Refactoring to Java 8 (Devoxx BE)
Refactoring to Java 8 (Devoxx BE)Trisha Gee
 
Reactive programming with RxJava
Reactive programming with RxJavaReactive programming with RxJava
Reactive programming with RxJavaJobaer Chowdhury
 
Swiss army knife Spring
Swiss army knife SpringSwiss army knife Spring
Swiss army knife SpringMario Fusco
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 

What's hot (20)

Reactive cocoa made Simple with Swift
Reactive cocoa made Simple with SwiftReactive cocoa made Simple with Swift
Reactive cocoa made Simple with Swift
 
ReactiveCocoa and Swift, Better Together
ReactiveCocoa and Swift, Better TogetherReactiveCocoa and Swift, Better Together
ReactiveCocoa and Swift, Better Together
 
Pragmatic sbt
Pragmatic sbtPragmatic sbt
Pragmatic sbt
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分
 
Monads in Swift
Monads in SwiftMonads in Swift
Monads in Swift
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Deep Dive Java 17 Devoxx UK
Deep Dive Java 17 Devoxx UKDeep Dive Java 17 Devoxx UK
Deep Dive Java 17 Devoxx UK
 
Creating Lazy stream in CSharp
Creating Lazy stream in CSharpCreating Lazy stream in CSharp
Creating Lazy stream in CSharp
 
Best Of Jdk 7
Best Of Jdk 7Best Of Jdk 7
Best Of Jdk 7
 
Reactive Programming for a demanding world: building event-driven and respons...
Reactive Programming for a demanding world: building event-driven and respons...Reactive Programming for a demanding world: building event-driven and respons...
Reactive Programming for a demanding world: building event-driven and respons...
 
Currying and Partial Function Application (PFA)
Currying and Partial Function Application (PFA)Currying and Partial Function Application (PFA)
Currying and Partial Function Application (PFA)
 
Intro to ReactiveCocoa
Intro to ReactiveCocoaIntro to ReactiveCocoa
Intro to ReactiveCocoa
 
What's new in Scala 2.13?
What's new in Scala 2.13?What's new in Scala 2.13?
What's new in Scala 2.13?
 
An a z index of windows power shell commandss
An a z index of windows power shell commandssAn a z index of windows power shell commandss
An a z index of windows power shell commandss
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at Pinterest
 
Refactoring to Java 8 (Devoxx BE)
Refactoring to Java 8 (Devoxx BE)Refactoring to Java 8 (Devoxx BE)
Refactoring to Java 8 (Devoxx BE)
 
Reactive programming with RxJava
Reactive programming with RxJavaReactive programming with RxJava
Reactive programming with RxJava
 
Swiss army knife Spring
Swiss army knife SpringSwiss army knife Spring
Swiss army knife Spring
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 

Viewers also liked

Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Summer training in web designing
Summer training in web designingSummer training in web designing
Summer training in web designingDUCC Systems
 
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsKalyan Hadoop
 
SEO Keyword Research and Competition Analysis
SEO Keyword Research and Competition Analysis SEO Keyword Research and Competition Analysis
SEO Keyword Research and Competition Analysis Web Trainings Academy
 
Digital Marketing Training Course in Hyderabad
Digital Marketing Training Course in HyderabadDigital Marketing Training Course in Hyderabad
Digital Marketing Training Course in HyderabadWeb Trainings Academy
 

Viewers also liked (6)

Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Summer training in web designing
Summer training in web designingSummer training in web designing
Summer training in web designing
 
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshots
 
SEO Keyword Research and Competition Analysis
SEO Keyword Research and Competition Analysis SEO Keyword Research and Competition Analysis
SEO Keyword Research and Competition Analysis
 
Digital Marketing Training Course in Hyderabad
Digital Marketing Training Course in HyderabadDigital Marketing Training Course in Hyderabad
Digital Marketing Training Course in Hyderabad
 

Similar to Hadoop Technical Introduction MapReduce Terminology

Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiUnmesh Baile
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangaloresrikanthhadoop
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...HostedbyConfluent
 
Architecting Alive Apps
Architecting Alive AppsArchitecting Alive Apps
Architecting Alive AppsJorge Ortiz
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programmingKuldeep Dhole
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Swift - One step forward from Obj-C
Swift -  One step forward from Obj-CSwift -  One step forward from Obj-C
Swift - One step forward from Obj-CNissan Tsafrir
 

Similar to Hadoop Technical Introduction MapReduce Terminology (20)

Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
 
Architecting Alive Apps
Architecting Alive AppsArchitecting Alive Apps
Architecting Alive Apps
 
Apache Crunch
Apache CrunchApache Crunch
Apache Crunch
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Angular Schematics
Angular SchematicsAngular Schematics
Angular Schematics
 
Swift - One step forward from Obj-C
Swift -  One step forward from Obj-CSwift -  One step forward from Obj-C
Swift - One step forward from Obj-C
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

More from Kelly Technologies

Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadKelly Technologies
 
Data science institutes in hyderabad
Data science institutes in hyderabadData science institutes in hyderabad
Data science institutes in hyderabadKelly Technologies
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabadKelly Technologies
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabadKelly Technologies
 
Hadoop institutes in hyderabad
Hadoop institutes in hyderabadHadoop institutes in hyderabad
Hadoop institutes in hyderabadKelly Technologies
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Websphere mb training in hyderabad
Websphere mb training in hyderabadWebsphere mb training in hyderabad
Websphere mb training in hyderabadKelly Technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Oracle training-institutes-in-hyderabad
Oracle training-institutes-in-hyderabadOracle training-institutes-in-hyderabad
Oracle training-institutes-in-hyderabadKelly Technologies
 
Hadoop training institutes in bangalore
Hadoop training institutes in bangaloreHadoop training institutes in bangalore
Hadoop training institutes in bangaloreKelly Technologies
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangaloreKelly Technologies
 
Salesforce crm-training-in-bangalore
Salesforce crm-training-in-bangaloreSalesforce crm-training-in-bangalore
Salesforce crm-training-in-bangaloreKelly Technologies
 
Qlikview training in hyderabad
Qlikview training in hyderabadQlikview training in hyderabad
Qlikview training in hyderabadKelly Technologies
 
Project Management Planning training in hyderabad
Project Management Planning training in hyderabadProject Management Planning training in hyderabad
Project Management Planning training in hyderabadKelly Technologies
 

More from Kelly Technologies (20)

Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science institutes in hyderabad
Data science institutes in hyderabadData science institutes in hyderabad
Data science institutes in hyderabad
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabad
 
Hadoop institutes in hyderabad
Hadoop institutes in hyderabadHadoop institutes in hyderabad
Hadoop institutes in hyderabad
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Sas training in hyderabad
Sas training in hyderabadSas training in hyderabad
Sas training in hyderabad
 
Websphere mb training in hyderabad
Websphere mb training in hyderabadWebsphere mb training in hyderabad
Websphere mb training in hyderabad
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Oracle training-institutes-in-hyderabad
Oracle training-institutes-in-hyderabadOracle training-institutes-in-hyderabad
Oracle training-institutes-in-hyderabad
 
Hadoop training institutes in bangalore
Hadoop training institutes in bangaloreHadoop training institutes in bangalore
Hadoop training institutes in bangalore
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangalore
 
Tableau training in bangalore
Tableau training in bangaloreTableau training in bangalore
Tableau training in bangalore
 
Salesforce crm-training-in-bangalore
Salesforce crm-training-in-bangaloreSalesforce crm-training-in-bangalore
Salesforce crm-training-in-bangalore
 
Oracle training in hyderabad
Oracle training in hyderabadOracle training in hyderabad
Oracle training in hyderabad
 
Qlikview training in hyderabad
Qlikview training in hyderabadQlikview training in hyderabad
Qlikview training in hyderabad
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Project Management Planning training in hyderabad
Project Management Planning training in hyderabadProject Management Planning training in hyderabad
Project Management Planning training in hyderabad
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Oracle training in_hyderabad
Oracle training in_hyderabadOracle training in_hyderabad
Oracle training in_hyderabad
 

Recently uploaded

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 

Recently uploaded (20)

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 

Hadoop Technical Introduction MapReduce Terminology

  • 2. Terminology Google calls it: Hadoop equivalent: MapReduce Hadoop GFS HDFS Bigtable HBase Chubby Zookeeper www.kellytechno.comww
  • 3. Some MapReduce Terminology  Job – A “full program” - an execution of a Mapper and Reducer across a data set  Task – An execution of a Mapper or a Reducer on a slice of data  a.k.a. Task-In-Progress (TIP)  Task Attempt – A particular instance of an attempt to execute a task on a machine www.kellytechno.comww
  • 4. Task Attempts  A particular task will be attempted at least once, possibly more times if it crashes  If the same input causes crashes over and over, that input will eventually be abandoned  Multiple attempts at one task may occur in parallel with speculative execution turned on  Task ID from TaskInProgress is not a unique identifier; don’t use it that way www.kellytechno.comww
  • 5. MapReduce: High Level JobTracker MapReduce job submitted by client computer Master node TaskTracker Slave node Task instance TaskTracker Slave node Task instance TaskTracker Slave node Task instance In our case: circe.rc.usf.edu www.kellytechno.comww
  • 6. Nodes, Trackers, Tasks  Master node runs JobTracker instance, which accepts Job requests from clients  TaskTracker instances run on slave nodes  TaskTracker forks separate Java process for task instances www.kellytechno.comww
  • 7. Job Distribution  MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options  Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code  … Where’s the data distribution? www.kellytechno.comww
  • 8. Data Distribution  Implicit in design of MapReduce!  All mappers are equivalent; so map whatever data is local to a particular node in HDFS  If lots of data does happen to pile up on the same node, nearby nodes will map instead  Data transfer is handled implicitly by HDFS www.kellytechno.comww
  • 9. What Happens In Hadoop? Depth First www.kellytechno.comww
  • 10. Job Launch Process: Client  Client program creates a JobConf  Identify classes implementing Mapper and Reducer interfaces  JobConf.setMapperClass(), setReducerClass()  Specify inputs, outputs  FileInputFormat.setInputPath(),  FileOutputFormat.setOutputPath()  Optionally, other options too:  JobConf.setNumReduceTasks(), JobConf.setOutputFormat()… www.kellytechno.comww
  • 11. Job Launch Process: JobClient  Pass JobConf to JobClient.runJob() or submitJob()  runJob() blocks, submitJob() does not  JobClient:  Determines proper division of input into InputSplits  Sends job data to master JobTracker server www.kellytechno.comww
  • 12. Job Launch Process: JobTracker  JobTracker:  Inserts jar and JobConf (serialized to XML) in shared location  Posts a JobInProgress to its run queue www.kellytechno.comww
  • 13. Job Launch Process: TaskTracker  TaskTrackers running on slave nodes periodically query JobTracker for work  Retrieve job-specific jar and config  Launch task in separate instance of Java  main() is provided by Hadoop www.kellytechno.comww
  • 14. Job Launch Process: Task  TaskTracker.Child.main():  Sets up the child TaskInProgress attempt  Reads XML configuration  Connects back to necessary MapReduce components via RPC  Uses TaskRunner to launch user process www.kellytechno.comww
  • 15. Job Launch Process: TaskRunner  TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper  Task knows ahead of time which InputSplits it should be mapping  Calls Mapper once for each record retrieved from the InputSplit  Running the Reducer is much the same www.kellytechno.comww
  • 16. Creating the Mapper  You provide the instance of Mapper  Should extend MapReduceBase  One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress  Exists in separate process from all other instances of Mapper – no data sharing! www.kellytechno.comww
  • 17. Mapper  void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)  K types implement WritableComparable  V types implement Writable www.kellytechno.comww
  • 18. What is Writable?  Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.  All values are instances of Writable  All keys are instances of WritableComparable www.kellytechno.comww
  • 19. Getting Data To The Mapper Input file InputSplit InputSplit InputSplit InputSplit Input file RecordReader RecordReader RecordReader RecordReader Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) InputFormat www.kellytechno.comww
  • 20. Reading Data  Data sets are specified by InputFormats  Defines input data (e.g., a directory)  Identifies partitions of the data that form an InputSplit  Factory for RecordReader objects to extract (k, v) records from the input source www.kellytechno.comww
  • 21. FileInputFormat and Friends  TextInputFormat – Treats each ‘n’-terminated line of a file as a value  KeyValueTextInputFormat – Maps ‘n’- terminated text lines of “k SEP v”  SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata  SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString()) www.kellytechno.comww
  • 22. Filtering File Inputs  FileInputFormat will read all files out of a specified directory and send them to the mapper  Delegates filtering this file list to a method subclasses may override  e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list www.kellytechno.comww
  • 23. Record Readers  Each InputFormat provides its own RecordReader implementation  Provides (unused?) capability multiplexing  LineRecordReader – Reads a line from a text file  KeyValueRecordReader – Used by KeyValueTextInputFormat www.kellytechno.comww
  • 24. Input Split Size  FileInputFormat will divide large files into chunks  Exact size controlled by mapred.min.split.size  RecordReaders receive file, offset, and length of chunk  Custom InputFormat implementations may override split size – e.g., “NeverChunkFile” www.kellytechno.comww
  • 25. Sending Data To Reducers  Map function receives OutputCollector object  OutputCollector.collect() takes (k, v) elements  Any (WritableComparable, Writable) can be used  By default, mapper output type assumed to be same as reducer output type www.kellytechno.comww
  • 26. WritableComparator  Compares WritableComparable data  Will call WritableComparable.compare()  Can provide fast path for serialized data  JobConf.setOutputValueGroupingComparator() www.kellytechno.comww
  • 27. Sending Data To The Client  Reporter object sent to Mapper allows simple asynchronous feedback  incrCounter(Enum key, long amount)  setStatus(String msg)  Allows self-identification of input  InputSplit getInputSplit() www.kellytechno.comww
  • 28. Partition And Shuffle Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Reducer Reducer Reducer (intermediates) (intermediates) (intermediates) Partitioner Partitioner Partitioner Partitioner shuffling www.kellytechno.comww
  • 29. Partitioner  int getPartition(key, val, numPartitions)  Outputs the partition number for a given key  One partition == values sent to one Reduce task  HashPartitioner used by default  Uses key.hashCode() to return partition num  JobConf sets Partitioner implementation www.kellytechno.comww
  • 30. Reduction  reduce( K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter )  Keys & values sent to one partition all go to the same reduce task  Calls are sorted by key – “earlier” keys are reduced and output before “later” keys www.kellytechno.comww
  • 31. Finally: Writing The Output Reducer Reducer Reducer RecordWriter RecordWriter RecordWriter output file output file output file OutputFormat www.kellytechno.comww
  • 32. OutputFormat  Analogous to InputFormat  TextOutputFormat – Writes “key valn” strings to output file  SequenceFileOutputFormat – Uses a binary format to pack (k, v) pairs  NullOutputFormat – Discards output  Only useful if defining own output methods within reduce() www.kellytechno.comww
  • 33. Example Program - Wordcount  map()  Receives a chunk of text  Outputs a set of word/count pairs  reduce()  Receives a key and all its associated values  Outputs the key and the sum of the values package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { www.kellytechno.comww
  • 34. Wordcount – main( ) public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } www.kellytechno.comww
  • 35. Wordcount – map( ) public static class Map extends MapReduceBase … { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, …) … { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } www.kellytechno.comww
  • 36. Wordcount – reduce( ) public static class Reduce extends MapReduceBase … { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, …) … { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } } www.kellytechno.comww
  • 37. Hadoop Streaming  Allows you to create and run map/reduce jobs with any executable  Similar to unix pipes, e.g.:  format is: Input | Mapper | Reducer  echo “this sentence has five lines” | cat | wc www.kellytechno.comww
  • 38. Hadoop Streaming  Mapper and Reducer receive data from stdin and output to stdout  Hadoop takes care of the transmission of data between the map/reduce tasks  It is still the programmer’s responsibility to set the correct key/value  Default format: “key t valuen”  Let’s look at a Python example of a MapReduce word count program… www.kellytechno.comww
  • 39. Streaming_Mapper.py # read in one line of input at a time from stdin for line in sys.stdin: line = line.strip() # string words = line.split() # list of strings # write data on stdout for word in words: print ‘%st%i’ % (word, 1) www.kellytechno.comww
  • 40. Hadoop Streaming  What are we outputting?  Example output: “the 1”  By default, “the” is the key, and “1” is the value  Hadoop Streaming handles delivering this key/value pair to a Reducer  Able to send similar keys to the same Reducer or to an intermediary Combiner www.kellytechno.comww
  • 41. Streaming_Reducer.py wordcount = { } # empty dictionary # read in one line of input at a time from stdin for line in sys.stdin: line = line.strip() # string key,value = line.split() wordcount[key] = wordcount.get(key, 0) + value # write data on stdout for word, count in sorted(wordcount.items()): print ‘%st%i’ % (word, count) www.kellytechno.comww
  • 42. Hadoop Streaming Gotcha  Streaming Reducer receives single lines (which are key/value pairs) from stdin  Regular Reducer receives a collection of all the values for a particular key  It is still the case that all the values for a particular key will go to a single Reducer www.kellytechno.comww
  • 43. Using Hadoop Distributed File System (HDFS)  Can access HDFS through various shell commands (see Further Resources slide for link to documentation)  hadoop –put <localsrc> … <dst>  hadoop –get <src> <localdst>  hadoop –ls  hadoop –rm file www.kellytechno.comww
  • 44. Configuring Number of Tasks  Normal method  jobConf.setNumMapTasks(400)  jobConf.setNumReduceTasks(4)  Hadoop Streaming method  -jobconf mapred.map.tasks=400  -jobconf mapred.reduce.tasks=4  Note: # of map tasks is only a hint to the framework. Actual number depends on the number of InputSplits generated www.kellytechno.comww
  • 45. Running a Hadoop Job  Place input file into HDFS:  hadoop fs –put ./input-file input-file  Run either normal or streaming version:  hadoop jar Wordcount.jar org.myorg.Wordcount input-file output-file  hadoop jar hadoop-streaming.jar -input input-file -output output-file -file Streaming_Mapper.py -mapper python Streaming_Mapper.py -file Streaming_Reducer.py -reducer python Streaming_Reducer.py www.kellytechno.comww
  • 46. Submitting to RC’s GridEngine  Add appropriate modules  module add apps/jdk/1.6.0_22.x86_64 apps/hadoop/0.20.2  Use the submit script posted in the Further Resources slide  Script calls internal functions hadoop_start and hadoop_end  Adjust the lines for transferring the input file to HDFS and starting the hadoop job using the commands on the previous slide  Adjust the expected runtime (generally good practice to overshoot your estimate)  #$ -l h_rt=02:00:00  NOTICE: “All jobs are required to have a hard run-time specification. Jobs that do not have this specification will have a default run-time of 10 minutes and will be stopped at that point.” www.kellytechno.comww
  • 47. Output Parsing  Output of the reduce tasks must be retrieved:  hadoop fs –get output-file hadoop-output  This creates a directory of output files, 1 per reduce task  Output files numbered part-00000, part-00001, etc.  Sample output of Wordcount  head –n5 part-00000 “’tis 1 “come 2 “coming 1 “edwin 1 “found 1 www.kellytechno.comww
  • 48. Extra Output  The stdout/stderr streams of Hadoop itself will be stored in an output file (whichever one is named in the startup script)  #$ -o output.$job_id STARTUP_MSG: Starting NameNode STARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.205 … 11/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1 11/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_0001 … 11/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 1 … 11/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments 11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 43927 bytes 11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0% … 11/03/02 18:28:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done. 11/03/02 18:28:49 INFO mapred.JobClient: Job complete: job_local_0001 www.kellytechno.comww