SlideShare a Scribd company logo
1 of 19
Download to read offline
MapReducein operation
Can I choose how many maps to be executed? 
•The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. 
•The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. 
•Task setup takes awhile, so it is best if the maps take at least a minute to execute. 
•Thus, if you expect 10TB of input data and have a blocksizeof128MB, you'll end up with 82,000 maps, unlesssetNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher.
Reducer 
•Reducerreduces a set of intermediate values which share a key to a smaller set of values. 
•The number of reduces for the job is set by the user viaJobConf.setNumReduceTasks(int). 
•Overall,Reducerimplementations are passed theJobConffor the job via theJobConfigurable.configure(JobConf)method and can override it to initialize themselves. 
•The framework then callsreduce(WritableComparable, Iterator, OutputCollector, Reporter)method for each<key, (list of values)>pair in the grouped inputs. 
•Applications can then override theCloseable.close()method to perform any required cleanup. 
•Reducerhas 3 primary phases: shuffle, sort and reduce.
Reducer: Sort 
•The framework groupsReducerinputs by keys (since different mappersmay have output the same key) in this stage. 
•The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged. 
•Secondary Sort: If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify aComparatorvia JobConf.setOutputValueGroupingComparator(Class). SinceJobConf.setOutputKeyComparatorClass(Class)can be used to control how intermediate keys are grouped, these can be used in conjunction to simulatesecondary sort on values.
Reducer: Reduce 
•In this phase thereduce(WritableComparable, Iterator, OutputCollector, Reporter)method is called for each<key, (list of values)>pair in the grouped inputs. 
•The output of the reduce task is typically written to theFileSystemviaOutputCollector.collect(WritableComparable, Writable). 
•Applications can use theReporterto report progress, set application-level status messages and updateCounters, or just indicate that they are alive. 
•The output of theReducerisnot sorted.
How many reducer? 
•The right number of reduces seems to be betweeen0.95and1.75multiplied by (<no. of nodes> *mapred.tasktracker.reduce.tasks.maximum). 
–With0.95all of the reduces can launch immediately and start transferring map outputs as the maps finish. 
–With1.75the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. 
•Increasing the number of reduces 
–Increases the framework overhead, 
–But 
•Increases load balancing and 
•lowers the cost of failures. 
•The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks. 
•It is legal to set the number of reduce-tasks tozeroif no reduction is desired. 
•In this case the outputs of the map-tasks go directly to theFileSystem, into the output path set bysetOutputPath(Path). 
•Again, the framework does not sort the map-outputs before writing them out to theFileSystem.
Partitioner 
•Partitionercontrols the partitioning of the keys of the intermediate map- outputs. 
•The key (or a subset of the key) is used to derive the partition, typically by ahash function. 
•The total number of partitions is the same as the number of reduce tasks for the job. 
•Hence this controls which of themreduce tasks the intermediate key (and hence the record) is sent to for reduction.
Reporter 
•A facility for MapReduceapplications to report progress, set application- level status messages and updateCounters. 
•In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task.
OutputCollector 
•A generalization of the facility provided by the MapReduceframework to collect data output by theMapperor theReducer(either the intermediate outputs or the output of the job).
Job Configuration 
•JobConfrepresents a MapReducejob configuration. 
•JobConfis the primary interface for a user to describe a MapReducejob to the Hadoopframework for execution. 
•The framework tries to faithfully execute the job as described byJobConf. 
•JobConfis typically used to specify the 
–Mapper, 
–Combiner (if any), 
–Partitioner, 
–Reducer, 
–InputFormat, 
–OutputFormat 
–OutputCommitter 
–set of input files, and 
–where the output files should be written 
•Optionally,JobConfis used to specify other advanced facets of the job such as 
–theComparatorto be used, 
–files to be put in theDistributedCache, 
–whether intermediate and/or job outputs are to be compressed (and how), 
–debugging via user-provided scripts 
–whether job tasks can be executed in aspeculativemanner 
–maximum number of attempts per task 
–percentage of tasks failure which can be tolerated by the job, and 
–useset(String, String)/get(String, String)to set/get arbitrary parameters needed by applications
Job Input 
•InputFormatdescribes the input-specification for a MapReducejob. 
•The MapReduceframework relies on theInputFormatof the job to: 
–Validate the input-specification of the job. 
–Split-up the input file(s) into logicalInputSplitinstances, each of which is then assigned to an individualMapper. 
–Provide theRecordReaderimplementation used to extract input records from the logicalInputSplitfor processing by theMapper. 
•TextInputFormatis the defaultInputFormat. 
•IfTextInputFormatis theInputFormatfor a given job, the framework detects input-files with the.gzextensions and automatically decompresses. 
•InputSplitrepresents the data to be processed by an individualMapper. FileSplitis the defaultInputSplit. 
•RecordReaderreads<key, value>pairs from anInputSplit.
Job Output 
•OutputFormatdescribes the output-specification for a MapReducejob. 
•The MapReduceframework relies on theOutputFormatof the job to: 
–Validate the output-specification of the job; for example, check that the output directory doesn't already exist. 
–Provide theRecordWriterimplementation used to write the output files of the job. Output files are stored in aFileSystem. 
•TextOutputFormatis the defaultOutputFormat.
IsolationRunner 
•IsolationRunneris a utility to help debug MapReduceprograms. 
•To use theIsolationRunner, first setkeep.failed.task.filestotrue(also seekeep.task.files.pattern). 
•Next, go to the node on which the failed task ran and go to theTaskTracker'slocal directory and run theIsolationRunner: $ cd<local path>/taskTracker/${taskid}/work$ bin/hadooporg.apache.hadoop.mapred.IsolationRunner../job.xml 
•IsolationRunnerwill run the failed task in a single jvm, which can be in the debugger, over precisely the same input. 
•IsolationRunnerwill only re-run map tasks.
Debugging 
•MapReduceframework provides a facility to run user-provided scripts for debugging 
•The script is given access to the task's stdoutand stderroutputs, syslogand jobconf. 
•The output from the debug script's stdoutand stderris displayed on the console diagnostics and also as part of the job UI. 
•The user needs to useDistributedCachetodistributeandsymlinkthe script file.
Counters 
•There are often things you would like to know about the data you are analyzing but that are peripheral to the analysis you are performing. 
•E.g. counting invalid records and discovered that the proportion of invalid records is very high. 
•perhaps there is a bug in the part of the program that detects invalid records? 
•Counters are a useful channel for gathering statistics about the job: 
–for quality control 
–or for application level-statistics 
•It is often better to see whether you can use a counter instead to record that a particular condition occurred. 
•Hadoop maintains some built-in counters for every job 
•Which report various metrics for your job. 
•For example, there are counters for the number of bytes and records processed, which allows you to confirm that the expected amount of input was consumed and the expected amount of output was produced. 
•Example Use case: Controlling Number of Recursions.
User-Defined Java Counters 
•MapReduce allows user code to define a set of counters, which are then incremented as desired in the mapperor reducer. 
•Counters are defined by a Java enum, which serves to group related counters. 
•A job may define an arbitrary number of enums, each with an arbitrary number of fields. 
•The name of the enumis the group name, and the enum’sfields are the counter names. 
•Counters are global: the MapReduce framework aggregates them across all maps and reduces to produce a grand total at the end of the job.
Example of user defined counters 
public class MaxTemperatureWithCountersextends Configured implements Tool { 
enumTemperature { 
MISSING, 
MALFORMED 
} 
static class MaxTemperatureMapperWithCountersextends MapReduceBase 
implements Mapper<LongWritable, Text, Text, IntWritable> { 
public void map(LongWritablekey, Text value, 
OutputCollector<Text, IntWritable> output, Reporter reporter) 
throws IOException{ 
parser.parse(value); 
if (parser.isValidTemperature()) { 
//do something 
} else if (parser.isMalformedTemperature()) { 
reporter.incrCounter(Temperature.MALFORMED, 1); 
} else if (parser.isMissingTemperature()) { 
reporter.incrCounter(Temperature.MISSING, 1); 
} 
// dynamic counter 
reporter.incrCounter("TemperatureQuality", parser.getQuality(), 1); 
... 
Counters counters= job.getCounters();
Getting built-in counters 
for (CounterGroupgroup : counters) { 
System.out.println("* Counter Group: " + group.getDisplayName() + " (" + group.getName() + ")"); 
System.out.println(" number of counters in this group: " + group.size()); 
for (Counter counter: group) { 
System.out.println(" -" + counter.getDisplayName() + ": " + counter.getName() + ": "+counter.getValue()); 
} 
}
End of session 
Day –2: MapReducein operation

More Related Content

What's hot (20)

Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
MapReduce
MapReduceMapReduce
MapReduce
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Run time administration
Run time administrationRun time administration
Run time administration
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Chapter 7 Run Time Environment
Chapter 7   Run Time EnvironmentChapter 7   Run Time Environment
Chapter 7 Run Time Environment
 

Viewers also liked

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Abdul Nasir
 
Introduction hadoop adminisrtation
Introduction hadoop adminisrtationIntroduction hadoop adminisrtation
Introduction hadoop adminisrtationAnjalli Pushpa
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2benjaminwootton
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 

Viewers also liked (8)

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
 
Introduction hadoop adminisrtation
Introduction hadoop adminisrtationIntroduction hadoop adminisrtation
Introduction hadoop adminisrtation
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 

Similar to Hadoop map reduce in operation

Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSArchana Gopinath
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptxSheba41
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericNik Peric
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model examIndhujeni
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptxTadiwaMawere
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 

Similar to Hadoop map reduce in operation (20)

Map reduce
Map reduceMap reduce
Map reduce
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Map reduce
Map reduceMap reduce
Map reduce
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptx
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
MapReduce
MapReduceMapReduce
MapReduce
 
Unit 2
Unit 2Unit 2
Unit 2
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 

More from Subhas Kumar Ghosh

More from Subhas Kumar Ghosh (14)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Greedy embedding problem
Greedy embedding problemGreedy embedding problem
Greedy embedding problem
 

Recently uploaded

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Recently uploaded (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Hadoop map reduce in operation

  • 2. Can I choose how many maps to be executed? •The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. •The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. •Task setup takes awhile, so it is best if the maps take at least a minute to execute. •Thus, if you expect 10TB of input data and have a blocksizeof128MB, you'll end up with 82,000 maps, unlesssetNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher.
  • 3. Reducer •Reducerreduces a set of intermediate values which share a key to a smaller set of values. •The number of reduces for the job is set by the user viaJobConf.setNumReduceTasks(int). •Overall,Reducerimplementations are passed theJobConffor the job via theJobConfigurable.configure(JobConf)method and can override it to initialize themselves. •The framework then callsreduce(WritableComparable, Iterator, OutputCollector, Reporter)method for each<key, (list of values)>pair in the grouped inputs. •Applications can then override theCloseable.close()method to perform any required cleanup. •Reducerhas 3 primary phases: shuffle, sort and reduce.
  • 4. Reducer: Sort •The framework groupsReducerinputs by keys (since different mappersmay have output the same key) in this stage. •The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged. •Secondary Sort: If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify aComparatorvia JobConf.setOutputValueGroupingComparator(Class). SinceJobConf.setOutputKeyComparatorClass(Class)can be used to control how intermediate keys are grouped, these can be used in conjunction to simulatesecondary sort on values.
  • 5. Reducer: Reduce •In this phase thereduce(WritableComparable, Iterator, OutputCollector, Reporter)method is called for each<key, (list of values)>pair in the grouped inputs. •The output of the reduce task is typically written to theFileSystemviaOutputCollector.collect(WritableComparable, Writable). •Applications can use theReporterto report progress, set application-level status messages and updateCounters, or just indicate that they are alive. •The output of theReducerisnot sorted.
  • 6. How many reducer? •The right number of reduces seems to be betweeen0.95and1.75multiplied by (<no. of nodes> *mapred.tasktracker.reduce.tasks.maximum). –With0.95all of the reduces can launch immediately and start transferring map outputs as the maps finish. –With1.75the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. •Increasing the number of reduces –Increases the framework overhead, –But •Increases load balancing and •lowers the cost of failures. •The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks. •It is legal to set the number of reduce-tasks tozeroif no reduction is desired. •In this case the outputs of the map-tasks go directly to theFileSystem, into the output path set bysetOutputPath(Path). •Again, the framework does not sort the map-outputs before writing them out to theFileSystem.
  • 7. Partitioner •Partitionercontrols the partitioning of the keys of the intermediate map- outputs. •The key (or a subset of the key) is used to derive the partition, typically by ahash function. •The total number of partitions is the same as the number of reduce tasks for the job. •Hence this controls which of themreduce tasks the intermediate key (and hence the record) is sent to for reduction.
  • 8. Reporter •A facility for MapReduceapplications to report progress, set application- level status messages and updateCounters. •In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task.
  • 9. OutputCollector •A generalization of the facility provided by the MapReduceframework to collect data output by theMapperor theReducer(either the intermediate outputs or the output of the job).
  • 10. Job Configuration •JobConfrepresents a MapReducejob configuration. •JobConfis the primary interface for a user to describe a MapReducejob to the Hadoopframework for execution. •The framework tries to faithfully execute the job as described byJobConf. •JobConfis typically used to specify the –Mapper, –Combiner (if any), –Partitioner, –Reducer, –InputFormat, –OutputFormat –OutputCommitter –set of input files, and –where the output files should be written •Optionally,JobConfis used to specify other advanced facets of the job such as –theComparatorto be used, –files to be put in theDistributedCache, –whether intermediate and/or job outputs are to be compressed (and how), –debugging via user-provided scripts –whether job tasks can be executed in aspeculativemanner –maximum number of attempts per task –percentage of tasks failure which can be tolerated by the job, and –useset(String, String)/get(String, String)to set/get arbitrary parameters needed by applications
  • 11. Job Input •InputFormatdescribes the input-specification for a MapReducejob. •The MapReduceframework relies on theInputFormatof the job to: –Validate the input-specification of the job. –Split-up the input file(s) into logicalInputSplitinstances, each of which is then assigned to an individualMapper. –Provide theRecordReaderimplementation used to extract input records from the logicalInputSplitfor processing by theMapper. •TextInputFormatis the defaultInputFormat. •IfTextInputFormatis theInputFormatfor a given job, the framework detects input-files with the.gzextensions and automatically decompresses. •InputSplitrepresents the data to be processed by an individualMapper. FileSplitis the defaultInputSplit. •RecordReaderreads<key, value>pairs from anInputSplit.
  • 12. Job Output •OutputFormatdescribes the output-specification for a MapReducejob. •The MapReduceframework relies on theOutputFormatof the job to: –Validate the output-specification of the job; for example, check that the output directory doesn't already exist. –Provide theRecordWriterimplementation used to write the output files of the job. Output files are stored in aFileSystem. •TextOutputFormatis the defaultOutputFormat.
  • 13. IsolationRunner •IsolationRunneris a utility to help debug MapReduceprograms. •To use theIsolationRunner, first setkeep.failed.task.filestotrue(also seekeep.task.files.pattern). •Next, go to the node on which the failed task ran and go to theTaskTracker'slocal directory and run theIsolationRunner: $ cd<local path>/taskTracker/${taskid}/work$ bin/hadooporg.apache.hadoop.mapred.IsolationRunner../job.xml •IsolationRunnerwill run the failed task in a single jvm, which can be in the debugger, over precisely the same input. •IsolationRunnerwill only re-run map tasks.
  • 14. Debugging •MapReduceframework provides a facility to run user-provided scripts for debugging •The script is given access to the task's stdoutand stderroutputs, syslogand jobconf. •The output from the debug script's stdoutand stderris displayed on the console diagnostics and also as part of the job UI. •The user needs to useDistributedCachetodistributeandsymlinkthe script file.
  • 15. Counters •There are often things you would like to know about the data you are analyzing but that are peripheral to the analysis you are performing. •E.g. counting invalid records and discovered that the proportion of invalid records is very high. •perhaps there is a bug in the part of the program that detects invalid records? •Counters are a useful channel for gathering statistics about the job: –for quality control –or for application level-statistics •It is often better to see whether you can use a counter instead to record that a particular condition occurred. •Hadoop maintains some built-in counters for every job •Which report various metrics for your job. •For example, there are counters for the number of bytes and records processed, which allows you to confirm that the expected amount of input was consumed and the expected amount of output was produced. •Example Use case: Controlling Number of Recursions.
  • 16. User-Defined Java Counters •MapReduce allows user code to define a set of counters, which are then incremented as desired in the mapperor reducer. •Counters are defined by a Java enum, which serves to group related counters. •A job may define an arbitrary number of enums, each with an arbitrary number of fields. •The name of the enumis the group name, and the enum’sfields are the counter names. •Counters are global: the MapReduce framework aggregates them across all maps and reduces to produce a grand total at the end of the job.
  • 17. Example of user defined counters public class MaxTemperatureWithCountersextends Configured implements Tool { enumTemperature { MISSING, MALFORMED } static class MaxTemperatureMapperWithCountersextends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritablekey, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{ parser.parse(value); if (parser.isValidTemperature()) { //do something } else if (parser.isMalformedTemperature()) { reporter.incrCounter(Temperature.MALFORMED, 1); } else if (parser.isMissingTemperature()) { reporter.incrCounter(Temperature.MISSING, 1); } // dynamic counter reporter.incrCounter("TemperatureQuality", parser.getQuality(), 1); ... Counters counters= job.getCounters();
  • 18. Getting built-in counters for (CounterGroupgroup : counters) { System.out.println("* Counter Group: " + group.getDisplayName() + " (" + group.getName() + ")"); System.out.println(" number of counters in this group: " + group.size()); for (Counter counter: group) { System.out.println(" -" + counter.getDisplayName() + ": " + counter.getName() + ": "+counter.getValue()); } }
  • 19. End of session Day –2: MapReducein operation