Running MapReduce
Programs in Clouds
-Anshul Aggarwal
Cisco Systems
Cloud Computing….Mapreduce
…..Hadoop…..
What is MapReduce?
• Simple data-parallel programming model designed for
scalability and fault-tolerance
• Pioneered by Google
• Processes 20 petabytes of data per day
• Popularized by open-source Hadoop project
• Used at Yahoo!, Facebook, Amazon, …
Why MapReduce Optimization
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Cloud Computing
• The emergence of cloud computing
has made a tremendous impact on
the Information Technology (IT) industry
• Cloud computing moved away from personal computers and
the individual enterprise application server to services
provided by the cloud of computers
• The resources like CPU and storage are provided as general
utilities to the users on-demand based through internet
• Cloud computing is in initial stages, with many issues still to
be addressed.
CLOUD COMPUTING SERVICES
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Mapreduce
Framework
MapReduce History
• Historically, data processing was completely done using
database technologies. Most of the data had a well-defined
structure and was often stored in relational databases
• Data soon reached terabytes and then petabytes
• Google developed a new programming model called
MapReduce to handle large-scale data analysis,and later they
introduced the model through their seminal paper
MapReduce: Simplified Data Processing on Large Clusters.
What the paper says
Example: Facebook Lexicon
www.facebook.com/lexicon
What is MapReduce used for?
• At Google:
• Index construction for Google Search
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!:
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook:
• Data mining
• Ad optimization
• Spam detection
MapReduce Framework
• computing paradigm for processing data that resides on hundreds of
computers
• popularized recently by Google, Hadoop, and many others
• more of a framework
• makes problem solving easier and harder
• inter-cluster network utilization
• performance of a job that will be distributed
• published by Google without any actual source code
MapReduce Terminology
Outline
• Cloud And MapReduce
• MapReduce Basics
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Word Count -"Hello World" of
MapReduce world.
• The word count job accepts an input directory, a mapper
function, and a reducer function as inputs.
• We use the mapper function to process the data in parallel,
and we use the reducer function to collect results of the
mapper and produce the final results.
• Mapper sends its results to reducer using a key-value based
model.
• $bin/hadoop -cp hadoop-microbook.jar
microbook.wordcount. WordCount amazon-meta.txt
wordcount-output1
WorkFlow
Example : Word Count
19Map
Tasks
Reduce
Tasks
• Job: Count the occurrences of each word in a data set
Outline
• Cloud And MapReduce
• MapReduce Basics
• Example applications
• Mapreduce Architecture
• Getting started with Hadoop
• Tuning MapReduce
How Mapreduce Works
At the highest level, there are four independent entities:
• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run. The jobtracker
is a Java application whose main class is JobTracker.
• The tasktrackers, which run the tasks that the job has been
split into.
• The distributed filesystem (normally HDFS), which is used
for sharing job files between the other entities.
Anatomy of a Mapreduce Job
Developing a MapReduce Application
• The Configuration API
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
conf.addResource("configuration-2.xml");
• GenericOptionsParser, Tool, and ToolRunner
• Writing a Unit Test
• Testing the Driver
• Launching a Job
% hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver -
conf conf/hadoop-cluster.xml  Input/ncdc/all max-temp
• Retrieving the Results
This is where the Magic Happens
public class MaxTemperatureDriver extends Configured implements Tool {
@Override
Job job = new Job(getConf(), "Max temperature");
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
System.exit(exitCode);
}
}
Configuring Map Reduce params
• <configuration>
• <property>
• <name>mapred.job.tracker</name>
• <value>MASTER_NODE:9001</value>
• </property>
• <property>
• <name>mapred.local.dir</name>
• <value>HADOOP_DATA_DIR/local</value>
• </property>
• <property>
• <name>mapred.tasktracker.map.tasks.maximum</name>
• <value>8</value>
• </property>
• </configuration>
• $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount.
WordCount amazon-meta.txt wordcount-output1
Q & A
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Hadoop Clusters
Inpioneerdaystheyusedoxenforheavypulling,
andwhenoneoxcouldn’tbudgealog,
theydidn’ttryto growalargerox.Weshouldn’tbe
tryingforbiggercomputers,butfor
moresystemsofcomputers.
—GraceHopper
Why Hadoop is able to compete?
30
Scalability (petabytes of data,
thousands of machines)
Database
vs.
Flexibility in accepting all data
formats (no schema)
Commodity inexpensive hardware
Efficient and simple fault-tolerant
mechanism
Performance (tons of indexing,
tuning, data organization tech.)
Features:
- Provenance tracking
- Annotation management
- ….
What is Hadoop
• Hadoop is a software framework for distributed processing of large
datasets across large clusters of computers
• Large datasets  Terabytes or petabytes of data
• Large clusters  hundreds or thousands of nodes
• Hadoop is open-source implementation for Google MapReduce
• HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware
31
What is Hadoop (Cont’d)
• Hadoop framework consists on two main layers
• Distributed file system (HDFS)
• Execution engine (MapReduce)
• Hadoop is designed as a master-slave shared-nothing architecture
32
Design Principles of Hadoop
• Automatic parallelization & distribution
• computation across thousands of nodes and Hidden from the end-user
• Fault tolerance and automatic recovery
• Nodes/tasks will fail and will recover automatically
• Clean and simple programming abstraction
• Users only provide two functions “map” and “reduce”
• Need to process big data
• Commodity hardware
• Large number of low-end cheap machines working in parallel to solve a
computing problem
33
Hardware Specs
• Memory
• RAM
• Total tasks
• No Raid required
• No Blade server
• Dedicated Switch
• Dedicated 1GB line
Who Uses MapReduce/Hadoop
• Google: Inventors of MapReduce computing paradigm
• Yahoo: Developing Hadoop open-source of MapReduce
• IBM, Microsoft, Oracle
• Facebook, Amazon, AOL, NetFlex
• Many others + universities and research labs
• Many enterprises are turning to Hadoop
• Especially applications generating big data
• Web applications, social networks, scientific applications
35
Hadoop:How it Works
• Hadoop implements Google’s MapReduce, using HDFS
• MapReduce divides applications into many small blocks of work.
• HDFS creates multiple replicas of data blocks for reliability, placing them
on compute nodes around the cluster.
• MapReduce can then process the data where it is located.
• Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
36
SathyaSaiUniversity,Prashanti
Nilayam
WorkFlow
Hadoop: Assumptions
It is written with large clusters of computers in mind and is built
around the following assumptions:
• Hardware will fail.
• Processing will be run in batches.
• Applications that run on HDFS have large data sets.
• It should provide high aggregate data bandwidth
• Applications need a write-once-read-many access model.
• Moving Computation is Cheaper than Moving Data.
• Portability is important.
Complete Overview
Hadoop Distributed File System (HDFS)
40
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
Main Properties of HDFS
• Large: A HDFS instance may consist of thousands of server
machines, each storing part of the file system’s data
• Replication: Each data block is replicated many times
(default is 3)
• Failure: Failure is the norm rather than exception
• Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
• Namenode is consistently checking Datanodes
41
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Tuning Parameters
Mapping workers to
Processors
• The input data (on HDFS) is stored on the local disks of the machines
in the cluster. HDFS divides each file into 64 MB blocks, and stores
several copies of each block (typically 3 copies) on different
machines.
• The MapReduce master takes the location information of the input
files into account and attempts to schedule a map task on a machine
that contains a replica of the corresponding input data. Failing that, it
attempts to schedule a map task near a replica of that task's input
data. When running large MapReduce operations on a significant
fraction of the workers in a cluster, most input data is read locally and
consumes no network bandwidth.
44
SathyaSaiUniversity,Prashanti
Nilayam
Task Granularity
• The map phase has M pieces and the reduce phase has R pieces.
• M and R should be much larger than the number of worker
machines.
• Having each worker perform many different tasks improves dynamic
load balancing, and also speeds up recovery when a worker fails.
• Larger the M and R, more the decisions the master must make
• R is often constrained by users because the output of each reduce task
ends up in a separate output file.
• Typically, (at Google), M = 200,000 and R = 5,000, using 2,000
worker machines.
45
SathyaSaiUniversity,Prashanti
Nilayam
Speculative Execution – One
approach
• Tasks may be slow for various reasons, including hardware
degradation or software mis-configuration, but the causes
may be hard to detect since the tasks still complete
• successfully, albeit after a longer time than expected. Hadoop
doesn’t try to diagnose and fix slow-running tasks;
• instead, it tries to detect when a task is running slower than
expected and launches another, equivalent, task as a backup.
Problem Statement
The problem at hand is defining a resource provisioning
framework for MapReduce jobs running in a cloud keeping in
mind performance goals such as
Resource utilization with
-optimal number of map and reduce slots
-improvements in execution time
-Highly scalable solution
References
[1] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map-
reduce clusters” In Proc. of the 4th USENIX conference on Hot Topics in Cloud computing,
2012.
[2] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-Oriented Resource Provisioning for Cloud
Computing: Challenges, Architecture, and Solutions” In International Conference on Cloud and
Service Computing, 2011.
[3] S. Chaisiri, Bu-Sung Lee, and D. Niyato, “Optimization of Resource Provisioning Cost in
Cloud Computing” in Transactions On Service Computing, Vol. 5, No. 2, IEEE, April-June 2012
[4] L Cherkasova and R.H. Campbell, “Resource Provisioning Framework for MapReduce Jobs
with Performance Goals”, in Middleware 2011, LNCS 7049, pp. 165–186, 2011
[5] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”,
Communications of the ACM, Jan 2008
[6] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu, “Resource Provisioning for Cloud Computing” In
Proc. of the 2009 Conference of the Center for Advanced Studies on Collaborative Research,
2009.
[7] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the
cloud in Proc. of the First Workshop on Hot Topics in Cloud Computing, 2009
[8] Kuyoro S. O., Ibikunle F. and Awodele O., “Cloud Computing Security Issues and
Challenges” in International Journal of Computer Networks (IJCN), Vol. 3, Issue 5, 2011
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

  • 1.
    Running MapReduce Programs inClouds -Anshul Aggarwal Cisco Systems
  • 2.
  • 3.
    What is MapReduce? •Simple data-parallel programming model designed for scalability and fault-tolerance • Pioneered by Google • Processes 20 petabytes of data per day • Popularized by open-source Hadoop project • Used at Yahoo!, Facebook, Amazon, …
  • 4.
  • 5.
    Outline • Cloud AndMapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 6.
    Cloud Computing • Theemergence of cloud computing has made a tremendous impact on the Information Technology (IT) industry • Cloud computing moved away from personal computers and the individual enterprise application server to services provided by the cloud of computers • The resources like CPU and storage are provided as general utilities to the users on-demand based through internet • Cloud computing is in initial stages, with many issues still to be addressed.
  • 7.
  • 8.
    Outline • Cloud AndMapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 9.
  • 10.
    MapReduce History • Historically,data processing was completely done using database technologies. Most of the data had a well-defined structure and was often stored in relational databases • Data soon reached terabytes and then petabytes • Google developed a new programming model called MapReduce to handle large-scale data analysis,and later they introduced the model through their seminal paper MapReduce: Simplified Data Processing on Large Clusters.
  • 11.
  • 12.
  • 13.
    What is MapReduceused for? • At Google: • Index construction for Google Search • Article clustering for Google News • Statistical machine translation • At Yahoo!: • “Web map” powering Yahoo! Search • Spam detection for Yahoo! Mail • At Facebook: • Data mining • Ad optimization • Spam detection
  • 14.
    MapReduce Framework • computingparadigm for processing data that resides on hundreds of computers • popularized recently by Google, Hadoop, and many others • more of a framework • makes problem solving easier and harder • inter-cluster network utilization • performance of a job that will be distributed • published by Google without any actual source code
  • 15.
  • 16.
    Outline • Cloud AndMapReduce • MapReduce Basics • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 17.
    Word Count -"HelloWorld" of MapReduce world. • The word count job accepts an input directory, a mapper function, and a reducer function as inputs. • We use the mapper function to process the data in parallel, and we use the reducer function to collect results of the mapper and produce the final results. • Mapper sends its results to reducer using a key-value based model. • $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount. WordCount amazon-meta.txt wordcount-output1
  • 18.
  • 19.
    Example : WordCount 19Map Tasks Reduce Tasks • Job: Count the occurrences of each word in a data set
  • 20.
    Outline • Cloud AndMapReduce • MapReduce Basics • Example applications • Mapreduce Architecture • Getting started with Hadoop • Tuning MapReduce
  • 21.
    How Mapreduce Works Atthe highest level, there are four independent entities: • The client, which submits the MapReduce job. • The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker. • The tasktrackers, which run the tasks that the job has been split into. • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.
  • 22.
    Anatomy of aMapreduce Job
  • 23.
    Developing a MapReduceApplication • The Configuration API Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); conf.addResource("configuration-2.xml"); • GenericOptionsParser, Tool, and ToolRunner • Writing a Unit Test • Testing the Driver • Launching a Job % hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver - conf conf/hadoop-cluster.xml Input/ncdc/all max-temp • Retrieving the Results
  • 24.
    This is wherethe Magic Happens public class MaxTemperatureDriver extends Configured implements Tool { @Override Job job = new Job(getConf(), "Max temperature"); job.setJarByClass(getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(MaxTemperatureMapper.class); job.setCombinerClass(MaxTemperatureReducer.class); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args); System.exit(exitCode); } }
  • 25.
    Configuring Map Reduceparams • <configuration> • <property> • <name>mapred.job.tracker</name> • <value>MASTER_NODE:9001</value> • </property> • <property> • <name>mapred.local.dir</name> • <value>HADOOP_DATA_DIR/local</value> • </property> • <property> • <name>mapred.tasktracker.map.tasks.maximum</name> • <value>8</value> • </property> • </configuration> • $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount. WordCount amazon-meta.txt wordcount-output1
  • 26.
  • 27.
    Outline • Cloud AndMapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 28.
  • 29.
  • 30.
    Why Hadoop isable to compete? 30 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Efficient and simple fault-tolerant mechanism Performance (tons of indexing, tuning, data organization tech.) Features: - Provenance tracking - Annotation management - ….
  • 31.
    What is Hadoop •Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets  Terabytes or petabytes of data • Large clusters  hundreds or thousands of nodes • Hadoop is open-source implementation for Google MapReduce • HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware 31
  • 32.
    What is Hadoop(Cont’d) • Hadoop framework consists on two main layers • Distributed file system (HDFS) • Execution engine (MapReduce) • Hadoop is designed as a master-slave shared-nothing architecture 32
  • 33.
    Design Principles ofHadoop • Automatic parallelization & distribution • computation across thousands of nodes and Hidden from the end-user • Fault tolerance and automatic recovery • Nodes/tasks will fail and will recover automatically • Clean and simple programming abstraction • Users only provide two functions “map” and “reduce” • Need to process big data • Commodity hardware • Large number of low-end cheap machines working in parallel to solve a computing problem 33
  • 34.
    Hardware Specs • Memory •RAM • Total tasks • No Raid required • No Blade server • Dedicated Switch • Dedicated 1GB line
  • 35.
    Who Uses MapReduce/Hadoop •Google: Inventors of MapReduce computing paradigm • Yahoo: Developing Hadoop open-source of MapReduce • IBM, Microsoft, Oracle • Facebook, Amazon, AOL, NetFlex • Many others + universities and research labs • Many enterprises are turning to Hadoop • Especially applications generating big data • Web applications, social networks, scientific applications 35
  • 36.
    Hadoop:How it Works •Hadoop implements Google’s MapReduce, using HDFS • MapReduce divides applications into many small blocks of work. • HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. • MapReduce can then process the data where it is located. • Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. 36 SathyaSaiUniversity,Prashanti Nilayam
  • 37.
  • 38.
    Hadoop: Assumptions It iswritten with large clusters of computers in mind and is built around the following assumptions: • Hardware will fail. • Processing will be run in batches. • Applications that run on HDFS have large data sets. • It should provide high aggregate data bandwidth • Applications need a write-once-read-many access model. • Moving Computation is Cheaper than Moving Data. • Portability is important.
  • 39.
  • 40.
    Hadoop Distributed FileSystem (HDFS) 40 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB)
  • 41.
    Main Properties ofHDFS • Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data • Replication: Each data block is replicated many times (default is 3) • Failure: Failure is the norm rather than exception • Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS • Namenode is consistently checking Datanodes 41
  • 42.
    Outline • Cloud AndMapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 43.
  • 44.
    Mapping workers to Processors •The input data (on HDFS) is stored on the local disks of the machines in the cluster. HDFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. • The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data. When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth. 44 SathyaSaiUniversity,Prashanti Nilayam
  • 45.
    Task Granularity • Themap phase has M pieces and the reduce phase has R pieces. • M and R should be much larger than the number of worker machines. • Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails. • Larger the M and R, more the decisions the master must make • R is often constrained by users because the output of each reduce task ends up in a separate output file. • Typically, (at Google), M = 200,000 and R = 5,000, using 2,000 worker machines. 45 SathyaSaiUniversity,Prashanti Nilayam
  • 46.
    Speculative Execution –One approach • Tasks may be slow for various reasons, including hardware degradation or software mis-configuration, but the causes may be hard to detect since the tasks still complete • successfully, albeit after a longer time than expected. Hadoop doesn’t try to diagnose and fix slow-running tasks; • instead, it tries to detect when a task is running slower than expected and launches another, equivalent, task as a backup.
  • 47.
    Problem Statement The problemat hand is defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Resource utilization with -optimal number of map and reduce slots -improvements in execution time -Highly scalable solution
  • 48.
    References [1] E. Bortnikov,A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map- reduce clusters” In Proc. of the 4th USENIX conference on Hot Topics in Cloud computing, 2012. [2] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-Oriented Resource Provisioning for Cloud Computing: Challenges, Architecture, and Solutions” In International Conference on Cloud and Service Computing, 2011. [3] S. Chaisiri, Bu-Sung Lee, and D. Niyato, “Optimization of Resource Provisioning Cost in Cloud Computing” in Transactions On Service Computing, Vol. 5, No. 2, IEEE, April-June 2012 [4] L Cherkasova and R.H. Campbell, “Resource Provisioning Framework for MapReduce Jobs with Performance Goals”, in Middleware 2011, LNCS 7049, pp. 165–186, 2011 [5] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Communications of the ACM, Jan 2008 [6] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu, “Resource Provisioning for Cloud Computing” In Proc. of the 2009 Conference of the Center for Advanced Studies on Collaborative Research, 2009. [7] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the cloud in Proc. of the First Workshop on Hot Topics in Cloud Computing, 2009 [8] Kuyoro S. O., Ibikunle F. and Awodele O., “Cloud Computing Security Issues and Challenges” in International Journal of Computer Networks (IJCN), Vol. 3, Issue 5, 2011

Editor's Notes

  • #19 When you run the MapReduce job, Hadoop first reads the input files from the input directory line by line. Then Hadoop invokes the mapper once for each line passing the line as the argument. Subsequently, each mapper parses the line, and extracts words included in the line it received as the input. After processing, the mapper sends the word count to the reducer by emitting the word and word count as name value pairs.
  • #24 Writing a program in MapReduce has a certain flow to it. You start by writing your map and reduce functions, ideally with unit tests to make sure they do what you expect. Then you write a driver program to run a job, which can run from your IDE using a small subset of the data to check that it is working. If it fails, then you can use your IDE’s debugger to find the source of the problem. With this information, you can expand your unit tests to cover this case and improve your mapper or reducer as appropriate to handle such input correctly. When the program runs as expected against the small dataset, you are ready to unleash it on a cluster. Running against the full dataset is likely to expose some more issues, which you can fix as before, by expanding your tests and mapper or reducer to handle the new cases. Debugging failing programs in the cluster is a challenge, so we look at some common techniques to make it easier.
  • #35 We solve problems involving large datasets using many computers where we can parallel process the dataset using those computers. However, writing a program that processes a dataset in a distributed setup is a heavy undertaking. The challenges of such a program are shown as follows: Although it is possible to write such a program, it is a waste to write such programs again and again. MapReduce-based frameworks like Hadoop lets users write only the