Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Deanna Kosaraju
Deanna KosarajuAdvisor at Loopd Inc.
Running MapReduce
Programs in Clouds
-Anshul Aggarwal
Cisco Systems
Cloud Computing….Mapreduce
…..Hadoop…..
What is MapReduce?
• Simple data-parallel programming model designed for
scalability and fault-tolerance
• Pioneered by Google
• Processes 20 petabytes of data per day
• Popularized by open-source Hadoop project
• Used at Yahoo!, Facebook, Amazon, …
Why MapReduce Optimization
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Cloud Computing
• The emergence of cloud computing
has made a tremendous impact on
the Information Technology (IT) industry
• Cloud computing moved away from personal computers and
the individual enterprise application server to services
provided by the cloud of computers
• The resources like CPU and storage are provided as general
utilities to the users on-demand based through internet
• Cloud computing is in initial stages, with many issues still to
be addressed.
CLOUD COMPUTING SERVICES
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Mapreduce
Framework
MapReduce History
• Historically, data processing was completely done using
database technologies. Most of the data had a well-defined
structure and was often stored in relational databases
• Data soon reached terabytes and then petabytes
• Google developed a new programming model called
MapReduce to handle large-scale data analysis,and later they
introduced the model through their seminal paper
MapReduce: Simplified Data Processing on Large Clusters.
What the paper says
Example: Facebook Lexicon
www.facebook.com/lexicon
What is MapReduce used for?
• At Google:
• Index construction for Google Search
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!:
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook:
• Data mining
• Ad optimization
• Spam detection
MapReduce Framework
• computing paradigm for processing data that resides on hundreds of
computers
• popularized recently by Google, Hadoop, and many others
• more of a framework
• makes problem solving easier and harder
• inter-cluster network utilization
• performance of a job that will be distributed
• published by Google without any actual source code
MapReduce Terminology
Outline
• Cloud And MapReduce
• MapReduce Basics
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Word Count -"Hello World" of
MapReduce world.
• The word count job accepts an input directory, a mapper
function, and a reducer function as inputs.
• We use the mapper function to process the data in parallel,
and we use the reducer function to collect results of the
mapper and produce the final results.
• Mapper sends its results to reducer using a key-value based
model.
• $bin/hadoop -cp hadoop-microbook.jar
microbook.wordcount. WordCount amazon-meta.txt
wordcount-output1
WorkFlow
Example : Word Count
19Map
Tasks
Reduce
Tasks
• Job: Count the occurrences of each word in a data set
Outline
• Cloud And MapReduce
• MapReduce Basics
• Example applications
• Mapreduce Architecture
• Getting started with Hadoop
• Tuning MapReduce
How Mapreduce Works
At the highest level, there are four independent entities:
• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run. The jobtracker
is a Java application whose main class is JobTracker.
• The tasktrackers, which run the tasks that the job has been
split into.
• The distributed filesystem (normally HDFS), which is used
for sharing job files between the other entities.
Anatomy of a Mapreduce Job
Developing a MapReduce Application
• The Configuration API
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
conf.addResource("configuration-2.xml");
• GenericOptionsParser, Tool, and ToolRunner
• Writing a Unit Test
• Testing the Driver
• Launching a Job
% hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver -
conf conf/hadoop-cluster.xml  Input/ncdc/all max-temp
• Retrieving the Results
This is where the Magic Happens
public class MaxTemperatureDriver extends Configured implements Tool {
@Override
Job job = new Job(getConf(), "Max temperature");
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
System.exit(exitCode);
}
}
Configuring Map Reduce params
• <configuration>
• <property>
• <name>mapred.job.tracker</name>
• <value>MASTER_NODE:9001</value>
• </property>
• <property>
• <name>mapred.local.dir</name>
• <value>HADOOP_DATA_DIR/local</value>
• </property>
• <property>
• <name>mapred.tasktracker.map.tasks.maximum</name>
• <value>8</value>
• </property>
• </configuration>
• $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount.
WordCount amazon-meta.txt wordcount-output1
Q & A
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Hadoop Clusters
Inpioneerdaystheyusedoxenforheavypulling,
andwhenoneoxcouldn’tbudgealog,
theydidn’ttryto growalargerox.Weshouldn’tbe
tryingforbiggercomputers,butfor
moresystemsofcomputers.
—GraceHopper
Why Hadoop is able to compete?
30
Scalability (petabytes of data,
thousands of machines)
Database
vs.
Flexibility in accepting all data
formats (no schema)
Commodity inexpensive hardware
Efficient and simple fault-tolerant
mechanism
Performance (tons of indexing,
tuning, data organization tech.)
Features:
- Provenance tracking
- Annotation management
- ….
What is Hadoop
• Hadoop is a software framework for distributed processing of large
datasets across large clusters of computers
• Large datasets  Terabytes or petabytes of data
• Large clusters  hundreds or thousands of nodes
• Hadoop is open-source implementation for Google MapReduce
• HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware
31
What is Hadoop (Cont’d)
• Hadoop framework consists on two main layers
• Distributed file system (HDFS)
• Execution engine (MapReduce)
• Hadoop is designed as a master-slave shared-nothing architecture
32
Design Principles of Hadoop
• Automatic parallelization & distribution
• computation across thousands of nodes and Hidden from the end-user
• Fault tolerance and automatic recovery
• Nodes/tasks will fail and will recover automatically
• Clean and simple programming abstraction
• Users only provide two functions “map” and “reduce”
• Need to process big data
• Commodity hardware
• Large number of low-end cheap machines working in parallel to solve a
computing problem
33
Hardware Specs
• Memory
• RAM
• Total tasks
• No Raid required
• No Blade server
• Dedicated Switch
• Dedicated 1GB line
Who Uses MapReduce/Hadoop
• Google: Inventors of MapReduce computing paradigm
• Yahoo: Developing Hadoop open-source of MapReduce
• IBM, Microsoft, Oracle
• Facebook, Amazon, AOL, NetFlex
• Many others + universities and research labs
• Many enterprises are turning to Hadoop
• Especially applications generating big data
• Web applications, social networks, scientific applications
35
Hadoop:How it Works
• Hadoop implements Google’s MapReduce, using HDFS
• MapReduce divides applications into many small blocks of work.
• HDFS creates multiple replicas of data blocks for reliability, placing them
on compute nodes around the cluster.
• MapReduce can then process the data where it is located.
• Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
36
SathyaSaiUniversity,Prashanti
Nilayam
WorkFlow
Hadoop: Assumptions
It is written with large clusters of computers in mind and is built
around the following assumptions:
• Hardware will fail.
• Processing will be run in batches.
• Applications that run on HDFS have large data sets.
• It should provide high aggregate data bandwidth
• Applications need a write-once-read-many access model.
• Moving Computation is Cheaper than Moving Data.
• Portability is important.
Complete Overview
Hadoop Distributed File System (HDFS)
40
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
Main Properties of HDFS
• Large: A HDFS instance may consist of thousands of server
machines, each storing part of the file system’s data
• Replication: Each data block is replicated many times
(default is 3)
• Failure: Failure is the norm rather than exception
• Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
• Namenode is consistently checking Datanodes
41
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Tuning Parameters
Mapping workers to
Processors
• The input data (on HDFS) is stored on the local disks of the machines
in the cluster. HDFS divides each file into 64 MB blocks, and stores
several copies of each block (typically 3 copies) on different
machines.
• The MapReduce master takes the location information of the input
files into account and attempts to schedule a map task on a machine
that contains a replica of the corresponding input data. Failing that, it
attempts to schedule a map task near a replica of that task's input
data. When running large MapReduce operations on a significant
fraction of the workers in a cluster, most input data is read locally and
consumes no network bandwidth.
44
SathyaSaiUniversity,Prashanti
Nilayam
Task Granularity
• The map phase has M pieces and the reduce phase has R pieces.
• M and R should be much larger than the number of worker
machines.
• Having each worker perform many different tasks improves dynamic
load balancing, and also speeds up recovery when a worker fails.
• Larger the M and R, more the decisions the master must make
• R is often constrained by users because the output of each reduce task
ends up in a separate output file.
• Typically, (at Google), M = 200,000 and R = 5,000, using 2,000
worker machines.
45
SathyaSaiUniversity,Prashanti
Nilayam
Speculative Execution – One
approach
• Tasks may be slow for various reasons, including hardware
degradation or software mis-configuration, but the causes
may be hard to detect since the tasks still complete
• successfully, albeit after a longer time than expected. Hadoop
doesn’t try to diagnose and fix slow-running tasks;
• instead, it tries to detect when a task is running slower than
expected and launches another, equivalent, task as a backup.
Problem Statement
The problem at hand is defining a resource provisioning
framework for MapReduce jobs running in a cloud keeping in
mind performance goals such as
Resource utilization with
-optimal number of map and reduce slots
-improvements in execution time
-Highly scalable solution
References
[1] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map-
reduce clusters” In Proc. of the 4th USENIX conference on Hot Topics in Cloud computing,
2012.
[2] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-Oriented Resource Provisioning for Cloud
Computing: Challenges, Architecture, and Solutions” In International Conference on Cloud and
Service Computing, 2011.
[3] S. Chaisiri, Bu-Sung Lee, and D. Niyato, “Optimization of Resource Provisioning Cost in
Cloud Computing” in Transactions On Service Computing, Vol. 5, No. 2, IEEE, April-June 2012
[4] L Cherkasova and R.H. Campbell, “Resource Provisioning Framework for MapReduce Jobs
with Performance Goals”, in Middleware 2011, LNCS 7049, pp. 165–186, 2011
[5] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”,
Communications of the ACM, Jan 2008
[6] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu, “Resource Provisioning for Cloud Computing” In
Proc. of the 2009 Conference of the Center for Advanced Studies on Collaborative Research,
2009.
[7] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the
cloud in Proc. of the First Workshop on Hot Topics in Cloud Computing, 2009
[8] Kuyoro S. O., Ibikunle F. and Awodele O., “Cloud Computing Security Issues and
Challenges” in International Journal of Computer Networks (IJCN), Vol. 3, Issue 5, 2011
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
1 of 49

Recommended

MapReduce and Hadoop by
MapReduce and HadoopMapReduce and Hadoop
MapReduce and HadoopNicola Cadenelli
1.4K views36 slides
MapReduce Paradigm by
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
8.2K views22 slides
Hadoop, HDFS and MapReduce by
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
5.8K views45 slides
MapReduce Scheduling Algorithms by
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
3.7K views30 slides
Analysing of big data using map reduce by
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
1.7K views19 slides

More Related Content

What's hot

An Introduction to MapReduce by
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
7.4K views34 slides
Hadoop fault-tolerance by
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-toleranceRavindra Bandara
4.6K views25 slides
Application of MapReduce in Cloud Computing by
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
17.6K views30 slides
Large Scale Data Analysis with Map/Reduce, part I by
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
6.2K views42 slides
Hadoop, MapReduce and R = RHadoop by
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
9.4K views28 slides
Apache Hadoop MapReduce Tutorial by
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
3.1K views23 slides

What's hot(20)

An Introduction to MapReduce by Frane Bandov
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov7.4K views
Application of MapReduce in Cloud Computing by Mohammad Mustaqeem
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem17.6K views
Large Scale Data Analysis with Map/Reduce, part I by Marin Dimitrov
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov6.2K views
Hadoop, MapReduce and R = RHadoop by Victoria López
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López 9.4K views
Apache Hadoop MapReduce Tutorial by Farzad Nozarian
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian3.1K views
Hadoop fault tolerance by Pallav Jha
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha9.5K views
Mapreduce by examples by Andrea Iacono
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono3.5K views
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce by Mahantesh Angadi
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi4.3K views
Map reduce paradigm explained by Dmytro Sandu
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
Dmytro Sandu3.2K views
Apache hadoop, hdfs and map reduce Overview by Nisanth Simon
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon961 views
MapReduce: A useful parallel tool that still has room for improvement by Kyong-Ha Lee
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee3.7K views
Resource Aware Scheduling for Hadoop [Final Presentation] by Lu Wei
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
Lu Wei2.6K views

Viewers also liked

Getting Started on Hadoop by
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
19.5K views42 slides
Hadoop MapReduce Fundamentals by
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
133.9K views86 slides
Introduction To Map Reduce by
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
106.2K views31 slides
Aggarwal Draft by
Aggarwal DraftAggarwal Draft
Aggarwal DraftDeanna Kosaraju
525 views61 slides
70a monitoring & troubleshooting by
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
1.5K views28 slides
Bfit for healthcare - A Document Management System for Healthcare Industry by
Bfit for healthcare - A Document Management System for Healthcare IndustryBfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryGlobalsion Software Sdn Bhd
509 views13 slides

Viewers also liked(20)

Getting Started on Hadoop by Paco Nathan
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
Paco Nathan19.5K views
Hadoop MapReduce Fundamentals by Lynn Langit
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit133.9K views
Introduction To Map Reduce by rantav
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav106.2K views
70a monitoring & troubleshooting by mapr-academy
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
mapr-academy1.5K views
Troubleshooting Hadoop: Distributed Debugging by Great Wide Open
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open1.1K views
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De... by AIIM International
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
AIIM International579 views
Technology Investment for Mutual Insurance Companies by Chris Reynolds
Technology Investment for Mutual Insurance CompaniesTechnology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance Companies
Chris Reynolds485 views
A Practical Guide to Capturing, Organizing, and Securing Your Documents by Scott Abel
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsA Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your Documents
Scott Abel1.7K views
MapReduce for Idiots by petewarden
MapReduce for IdiotsMapReduce for Idiots
MapReduce for Idiots
petewarden1.8K views
Introduction to MapReduce Data Transformations by swooledge
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformations
swooledge3.5K views
Alfresco As SharePoint Alternative - Architecture Overview by Alfresco Software
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture Overview
Alfresco Software6.6K views
The Chief Data Officer Agenda: Metrics for Information and Data Management by DATAVERSITY
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data Management
DATAVERSITY4.8K views
Intro To Alfresco Part 1 by Jeff Potts
Intro To Alfresco Part 1Intro To Alfresco Part 1
Intro To Alfresco Part 1
Jeff Potts13.3K views
Alfresco 5.2 REST API by J V
Alfresco 5.2 REST APIAlfresco 5.2 REST API
Alfresco 5.2 REST API
J V5.6K views
Large scale ETL with Hadoop by OReillyStrata
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata32.7K views

Similar to Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

project--2 nd review_2 by
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
178 views38 slides
project--2 nd review_2 by
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
96 views38 slides
Introduction to Hadoop by
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopYork University
150 views58 slides
Hadoop-Quick introduction by
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
679 views48 slides
Hadoop by
HadoopHadoop
HadoopAnil Reddy
1.1K views22 slides
Introduccion a Hadoop / Introduction to Hadoop by
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
401 views40 slides

Similar to Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015(20)

project--2 nd review_2 by aswini pilli
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli178 views
project--2 nd review_2 by Aswini Ashu
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu96 views
Hadoop-Quick introduction by Sandeep Singh
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh679 views
Introduccion a Hadoop / Introduction to Hadoop by GERARDO BARBERENA
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA401 views
Learn what is Hadoop-and-BigData by Thanusha154
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
Thanusha154304 views
Large scale computing with mapreduce by hansen3032
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032580 views
Introduction to Hadoop and MapReduce by Csaba Toth
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Csaba Toth2K views
Apache Tez : Accelerating Hadoop Query Processing by Bikas Saha
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha3.7K views
YARN Ready: Integrating to YARN with Tez by Hortonworks
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Hortonworks4.2K views
Hadoop live online training by Harika583
Hadoop live online trainingHadoop live online training
Hadoop live online training
Harika583483 views
Hadoop hive presentation by Arvind Kumar
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar5.5K views
Apache Tez: Accelerating Hadoop Query Processing by Hortonworks
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks19.6K views
Hadoop: A distributed framework for Big Data by Dhanashri Yadav
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
Dhanashri Yadav196 views
Hadoop introduction by Dong Ngoc
Hadoop introductionHadoop introduction
Hadoop introduction
Dong Ngoc752 views

More from Deanna Kosaraju

Speak Out and Change the World! Voices 2015 by
Speak Out and Change the World!   Voices 2015Speak Out and Change the World!   Voices 2015
Speak Out and Change the World! Voices 2015Deanna Kosaraju
1.3K views35 slides
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet... by
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...Deanna Kosaraju
2.7K views19 slides
Change IT! Voices 2015 by
Change IT! Voices 2015Change IT! Voices 2015
Change IT! Voices 2015Deanna Kosaraju
435 views34 slides
How Can We Make Interacting With Technology and Science Exciting and Fun Expe... by
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...Deanna Kosaraju
1K views23 slides
Measure Impact, Not Activity - Voices 2015 by
Measure Impact, Not Activity - Voices 2015Measure Impact, Not Activity - Voices 2015
Measure Impact, Not Activity - Voices 2015Deanna Kosaraju
628 views23 slides
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi... by
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...Deanna Kosaraju
1K views26 slides

More from Deanna Kosaraju(20)

Speak Out and Change the World! Voices 2015 by Deanna Kosaraju
Speak Out and Change the World!   Voices 2015Speak Out and Change the World!   Voices 2015
Speak Out and Change the World! Voices 2015
Deanna Kosaraju1.3K views
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet... by Deanna Kosaraju
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Deanna Kosaraju2.7K views
How Can We Make Interacting With Technology and Science Exciting and Fun Expe... by Deanna Kosaraju
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
Deanna Kosaraju1K views
Measure Impact, Not Activity - Voices 2015 by Deanna Kosaraju
Measure Impact, Not Activity - Voices 2015Measure Impact, Not Activity - Voices 2015
Measure Impact, Not Activity - Voices 2015
Deanna Kosaraju628 views
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi... by Deanna Kosaraju
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Deanna Kosaraju1K views
The Language of Leadership - Voices 2015 by Deanna Kosaraju
The Language of Leadership - Voices 2015The Language of Leadership - Voices 2015
The Language of Leadership - Voices 2015
Deanna Kosaraju2.8K views
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015 by Deanna Kosaraju
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
Deanna Kosaraju4.9K views
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ... by Deanna Kosaraju
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
Deanna Kosaraju1.1K views
Heart Rate Variability and the Digital Health Revolution - Voices 2015 by Deanna Kosaraju
Heart Rate Variability and the Digital Health Revolution - Voices 2015Heart Rate Variability and the Digital Health Revolution - Voices 2015
Heart Rate Variability and the Digital Health Revolution - Voices 2015
Deanna Kosaraju905 views
Women and CS, Lessons Learned From Turkey - Voices 2015 by Deanna Kosaraju
Women and CS, Lessons Learned From Turkey - Voices 2015Women and CS, Lessons Learned From Turkey - Voices 2015
Women and CS, Lessons Learned From Turkey - Voices 2015
Deanna Kosaraju1.1K views
Communications Platform Provides "Your School at your Fingertips" for Busy Pa... by Deanna Kosaraju
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
Deanna Kosaraju513 views
ASEAN Women in Tech - Voices 2015 by Deanna Kosaraju
ASEAN Women in Tech - Voices 2015ASEAN Women in Tech - Voices 2015
ASEAN Women in Tech - Voices 2015
Deanna Kosaraju877 views
Empowering Women Technology Startup Founders to Succeed - Voices 2015 by Deanna Kosaraju
Empowering Women Technology Startup Founders to Succeed - Voices 2015Empowering Women Technology Startup Founders to Succeed - Voices 2015
Empowering Women Technology Startup Founders to Succeed - Voices 2015
Deanna Kosaraju701 views
Innovation a Destination and a Journey - Voices 2015 by Deanna Kosaraju
Innovation a Destination and a Journey - Voices 2015Innovation a Destination and a Journey - Voices 2015
Innovation a Destination and a Journey - Voices 2015
Deanna Kosaraju763 views
Agility and Cloud Computing - Voices 2015 by Deanna Kosaraju
Agility and Cloud Computing - Voices 2015Agility and Cloud Computing - Voices 2015
Agility and Cloud Computing - Voices 2015
Deanna Kosaraju2.1K views
The Confidence Gap: Igniting Brilliance through Feminine Leadership - Voices... by Deanna Kosaraju
The Confidence Gap:  Igniting Brilliance through Feminine Leadership - Voices...The Confidence Gap:  Igniting Brilliance through Feminine Leadership - Voices...
The Confidence Gap: Igniting Brilliance through Feminine Leadership - Voices...
Deanna Kosaraju766 views
Business Intelligence Engineering - Voices 2015 by Deanna Kosaraju
Business Intelligence Engineering - Voices 2015Business Intelligence Engineering - Voices 2015
Business Intelligence Engineering - Voices 2015
Deanna Kosaraju592 views
J johnson global tech draft size reduce by Deanna Kosaraju
J johnson global tech draft size reduceJ johnson global tech draft size reduce
J johnson global tech draft size reduce
Deanna Kosaraju234 views

Recently uploaded

HTTP headers that make your website go faster - devs.gent November 2023 by
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023Thijs Feryn
26 views151 slides
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueShapeBlue
26 views15 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
418 views92 slides
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlueShapeBlue
31 views23 slides
MVP and prioritization.pdf by
MVP and prioritization.pdfMVP and prioritization.pdf
MVP and prioritization.pdfrahuldharwal141
37 views8 slides
"Surviving highload with Node.js", Andrii Shumada by
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada Fwdays
33 views29 slides

Recently uploaded(20)

HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn26 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue26 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue31 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays33 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue25 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue89 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue106 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu28 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue61 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue84 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue75 views
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro27 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue62 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

  • 1. Running MapReduce Programs in Clouds -Anshul Aggarwal Cisco Systems
  • 3. What is MapReduce? • Simple data-parallel programming model designed for scalability and fault-tolerance • Pioneered by Google • Processes 20 petabytes of data per day • Popularized by open-source Hadoop project • Used at Yahoo!, Facebook, Amazon, …
  • 5. Outline • Cloud And MapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 6. Cloud Computing • The emergence of cloud computing has made a tremendous impact on the Information Technology (IT) industry • Cloud computing moved away from personal computers and the individual enterprise application server to services provided by the cloud of computers • The resources like CPU and storage are provided as general utilities to the users on-demand based through internet • Cloud computing is in initial stages, with many issues still to be addressed.
  • 8. Outline • Cloud And MapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 10. MapReduce History • Historically, data processing was completely done using database technologies. Most of the data had a well-defined structure and was often stored in relational databases • Data soon reached terabytes and then petabytes • Google developed a new programming model called MapReduce to handle large-scale data analysis,and later they introduced the model through their seminal paper MapReduce: Simplified Data Processing on Large Clusters.
  • 13. What is MapReduce used for? • At Google: • Index construction for Google Search • Article clustering for Google News • Statistical machine translation • At Yahoo!: • “Web map” powering Yahoo! Search • Spam detection for Yahoo! Mail • At Facebook: • Data mining • Ad optimization • Spam detection
  • 14. MapReduce Framework • computing paradigm for processing data that resides on hundreds of computers • popularized recently by Google, Hadoop, and many others • more of a framework • makes problem solving easier and harder • inter-cluster network utilization • performance of a job that will be distributed • published by Google without any actual source code
  • 16. Outline • Cloud And MapReduce • MapReduce Basics • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 17. Word Count -"Hello World" of MapReduce world. • The word count job accepts an input directory, a mapper function, and a reducer function as inputs. • We use the mapper function to process the data in parallel, and we use the reducer function to collect results of the mapper and produce the final results. • Mapper sends its results to reducer using a key-value based model. • $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount. WordCount amazon-meta.txt wordcount-output1
  • 19. Example : Word Count 19Map Tasks Reduce Tasks • Job: Count the occurrences of each word in a data set
  • 20. Outline • Cloud And MapReduce • MapReduce Basics • Example applications • Mapreduce Architecture • Getting started with Hadoop • Tuning MapReduce
  • 21. How Mapreduce Works At the highest level, there are four independent entities: • The client, which submits the MapReduce job. • The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker. • The tasktrackers, which run the tasks that the job has been split into. • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.
  • 22. Anatomy of a Mapreduce Job
  • 23. Developing a MapReduce Application • The Configuration API Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); conf.addResource("configuration-2.xml"); • GenericOptionsParser, Tool, and ToolRunner • Writing a Unit Test • Testing the Driver • Launching a Job % hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver - conf conf/hadoop-cluster.xml Input/ncdc/all max-temp • Retrieving the Results
  • 24. This is where the Magic Happens public class MaxTemperatureDriver extends Configured implements Tool { @Override Job job = new Job(getConf(), "Max temperature"); job.setJarByClass(getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(MaxTemperatureMapper.class); job.setCombinerClass(MaxTemperatureReducer.class); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args); System.exit(exitCode); } }
  • 25. Configuring Map Reduce params • <configuration> • <property> • <name>mapred.job.tracker</name> • <value>MASTER_NODE:9001</value> • </property> • <property> • <name>mapred.local.dir</name> • <value>HADOOP_DATA_DIR/local</value> • </property> • <property> • <name>mapred.tasktracker.map.tasks.maximum</name> • <value>8</value> • </property> • </configuration> • $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount. WordCount amazon-meta.txt wordcount-output1
  • 26. Q & A
  • 27. Outline • Cloud And MapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 30. Why Hadoop is able to compete? 30 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Efficient and simple fault-tolerant mechanism Performance (tons of indexing, tuning, data organization tech.) Features: - Provenance tracking - Annotation management - ….
  • 31. What is Hadoop • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets  Terabytes or petabytes of data • Large clusters  hundreds or thousands of nodes • Hadoop is open-source implementation for Google MapReduce • HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware 31
  • 32. What is Hadoop (Cont’d) • Hadoop framework consists on two main layers • Distributed file system (HDFS) • Execution engine (MapReduce) • Hadoop is designed as a master-slave shared-nothing architecture 32
  • 33. Design Principles of Hadoop • Automatic parallelization & distribution • computation across thousands of nodes and Hidden from the end-user • Fault tolerance and automatic recovery • Nodes/tasks will fail and will recover automatically • Clean and simple programming abstraction • Users only provide two functions “map” and “reduce” • Need to process big data • Commodity hardware • Large number of low-end cheap machines working in parallel to solve a computing problem 33
  • 34. Hardware Specs • Memory • RAM • Total tasks • No Raid required • No Blade server • Dedicated Switch • Dedicated 1GB line
  • 35. Who Uses MapReduce/Hadoop • Google: Inventors of MapReduce computing paradigm • Yahoo: Developing Hadoop open-source of MapReduce • IBM, Microsoft, Oracle • Facebook, Amazon, AOL, NetFlex • Many others + universities and research labs • Many enterprises are turning to Hadoop • Especially applications generating big data • Web applications, social networks, scientific applications 35
  • 36. Hadoop:How it Works • Hadoop implements Google’s MapReduce, using HDFS • MapReduce divides applications into many small blocks of work. • HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. • MapReduce can then process the data where it is located. • Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. 36 SathyaSaiUniversity,Prashanti Nilayam
  • 38. Hadoop: Assumptions It is written with large clusters of computers in mind and is built around the following assumptions: • Hardware will fail. • Processing will be run in batches. • Applications that run on HDFS have large data sets. • It should provide high aggregate data bandwidth • Applications need a write-once-read-many access model. • Moving Computation is Cheaper than Moving Data. • Portability is important.
  • 40. Hadoop Distributed File System (HDFS) 40 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB)
  • 41. Main Properties of HDFS • Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data • Replication: Each data block is replicated many times (default is 3) • Failure: Failure is the norm rather than exception • Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS • Namenode is consistently checking Datanodes 41
  • 42. Outline • Cloud And MapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 44. Mapping workers to Processors • The input data (on HDFS) is stored on the local disks of the machines in the cluster. HDFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. • The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data. When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth. 44 SathyaSaiUniversity,Prashanti Nilayam
  • 45. Task Granularity • The map phase has M pieces and the reduce phase has R pieces. • M and R should be much larger than the number of worker machines. • Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails. • Larger the M and R, more the decisions the master must make • R is often constrained by users because the output of each reduce task ends up in a separate output file. • Typically, (at Google), M = 200,000 and R = 5,000, using 2,000 worker machines. 45 SathyaSaiUniversity,Prashanti Nilayam
  • 46. Speculative Execution – One approach • Tasks may be slow for various reasons, including hardware degradation or software mis-configuration, but the causes may be hard to detect since the tasks still complete • successfully, albeit after a longer time than expected. Hadoop doesn’t try to diagnose and fix slow-running tasks; • instead, it tries to detect when a task is running slower than expected and launches another, equivalent, task as a backup.
  • 47. Problem Statement The problem at hand is defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Resource utilization with -optimal number of map and reduce slots -improvements in execution time -Highly scalable solution
  • 48. References [1] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map- reduce clusters” In Proc. of the 4th USENIX conference on Hot Topics in Cloud computing, 2012. [2] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-Oriented Resource Provisioning for Cloud Computing: Challenges, Architecture, and Solutions” In International Conference on Cloud and Service Computing, 2011. [3] S. Chaisiri, Bu-Sung Lee, and D. Niyato, “Optimization of Resource Provisioning Cost in Cloud Computing” in Transactions On Service Computing, Vol. 5, No. 2, IEEE, April-June 2012 [4] L Cherkasova and R.H. Campbell, “Resource Provisioning Framework for MapReduce Jobs with Performance Goals”, in Middleware 2011, LNCS 7049, pp. 165–186, 2011 [5] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Communications of the ACM, Jan 2008 [6] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu, “Resource Provisioning for Cloud Computing” In Proc. of the 2009 Conference of the Center for Advanced Studies on Collaborative Research, 2009. [7] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the cloud in Proc. of the First Workshop on Hot Topics in Cloud Computing, 2009 [8] Kuyoro S. O., Ibikunle F. and Awodele O., “Cloud Computing Security Issues and Challenges” in International Journal of Computer Networks (IJCN), Vol. 3, Issue 5, 2011

Editor's Notes

  1. When you run the MapReduce job, Hadoop first reads the input files from the input directory line by line. Then Hadoop invokes the mapper once for each line passing the line as the argument. Subsequently, each mapper parses the line, and extracts words included in the line it received as the input. After processing, the mapper sends the word count to the reducer by emitting the word and word count as name value pairs.
  2. Writing a program in MapReduce has a certain flow to it. You start by writing your map and reduce functions, ideally with unit tests to make sure they do what you expect. Then you write a driver program to run a job, which can run from your IDE using a small subset of the data to check that it is working. If it fails, then you can use your IDE’s debugger to find the source of the problem. With this information, you can expand your unit tests to cover this case and improve your mapper or reducer as appropriate to handle such input correctly. When the program runs as expected against the small dataset, you are ready to unleash it on a cluster. Running against the full dataset is likely to expose some more issues, which you can fix as before, by expanding your tests and mapper or reducer to handle the new cases. Debugging failing programs in the cluster is a challenge, so we look at some common techniques to make it easier.
  3. We solve problems involving large datasets using many computers where we can parallel process the dataset using those computers. However, writing a program that processes a dataset in a distributed setup is a heavy undertaking. The challenges of such a program are shown as follows: Although it is possible to write such a program, it is a waste to write such programs again and again. MapReduce-based frameworks like Hadoop lets users write only the