SlideShare a Scribd company logo
1 of 36
Download to read offline
Profiling the Network Performance
of Hadoop Jobs
Team : Pramod Biligiri & Sayed Asad Ali
Talk Outline
Introduction to the problem
What is Hadoop?
Hadoop’s MapReduce Framework
Shuffle as a Bottleneck
Experimental Setup
Choice of Benchmarks
Terasort Discussion
Ranked Inverted Index Discussion
Summary and Future Work
Introduction to the problem
Reproduce existing results which show that the
“Network” is the bottleneck in shuffle-intensive
Hadoop jobs.
What is Hadoop?
A framework for distributed processing of large data sets across
clusters of computers using simple programming models based on
Google’s MapReduce.
Distinct Features:
● Designed for Commodity Hardware
● Highly Fault-tolerant
● Horizontally Scalable
● Push computation to data
MapReduce
● MapReduce is a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster
● Programming Model
○ For each input record, generate (key, value)
○ Apply reduce operation for all values corresponding to the
same key
Hadoop’s MapReduce Framework
1. Prepare the Map() input
2. Run the user-provided Map() code
3. "Shuffle" the Map output to the Reduce processors
4. Run the user-provided Reduce() code
5. Produce the final output
MapReduce Flow
Shuffle!
Shuffle as a Bottleneck?
“On average, the shuffle phase accounts for 33% of the running
time in these jobs. In addition, in 26% of the jobs with reduce tasks,
shuffles account for more than 50% of the running time, and in 16% of
jobs, they account for more than 70% of the running time. This
confirms widely reported results that the network is a bottleneck
in MapReduce”
Managing Data Transfers in Computer Clusters with Orchestra
- Mosharaf Chowdhury et al
Chosen Benchmarks
● Terasort
● Ranked Inverted Index
Experimental Setups
Instance type Memory CPU Elastic Compute Units Disk Network performance
Config 1 m1.large 7.5 GB 64-bit 4 2 x 420 GB Moderate
Config 2 m1.xlarge 15 GB 64-bit 8 4 x 420 GB High
SDSC custom 8 GB 64-bit/ Intel Xeon CPU 5140 @2.33
GHz, 4 cores
2 x 1.5 TB 1 Gb/s
Network Performance of EMR
Conflicting Values!
Source 1 : with AppNeta pathtest
average : 753 Mb/s
http://www.appneta.com/resources/pathtest-download.html
Source 2 : “The available bandwidth is still 1 Gb/s, confirming
anecdotal evidence that EC2 has full bisection bandwidth."
Opening Up Black Box Networks with CloudTalk, by Costin Raiciu et al
Source 3 : “The median TCP/UDP throughput of medium
instances are both close to 760 Mb/s."
The Impact of Virtualization on Network Performance of Amazon EC2 Data Center, by Guohui Wang et al
Why Terasort?
● Popular benchmark for Hadoop
● Shipped with most Hadoop distributions.
● Utilizes all aspects of the cluster - cpu, network, disk and memory
● Large amount of data to shuffle (240 GB).
● Representative of real world workloads
“This data shuffle pattern arises in large scale sorts, merges and join
operations in the data center. We chose this test because, in our
interactions with application developers, we learned that many use such
operations with caution, because the operations are highly expensive in
today’s data center network.”
source : VL2: A Scalable and Flexible Data Center Network - A. Greenberg et al.
Terasort - How it works:
● Sorts 1 terabyte of data.
● Each data item is 100 bytes in size.
● The first 10 bytes of a data item constitute its sort key.
● Format of input data:
○ <key 10 bytes><rowid 10 bytes><filler 78 bytes>rn
■ key : random characters from ASCII 32-126
■ rowid : an integer
■ filler : random characters from the set A-Z
Terasort - How it works:
Map
Partition input keys into different buckets
<Leverage Hadoop’s default sorting of Map output>
Reduce
Collect outputs from different maps
Results
Comparison of Terasort on different configurations
Instance type
Config 1 m1.large (RAM 7.5 GB)
Config 2 m1.xlarge (RAM 15 GB)
SDSC Custom (RAM 8 GB)
Total job Time
(min)
Map Time
(min)
Reduce
Time (min)
Shuffle
Average Time
Shuffle Time %
Config 1 205 84 205 60 29.3
SDSC 166 60 90 36 21.7
Config 2 86 40 75 22 25.5
CDF of data transferred over the network during the lifetime of the job
Map ends
Shuffle starts
Shuffle ends
Reduce nearly done
Sorting of Map outputs
(local to the node)
5100 6900
Network Transfer Rate on nodes
Network Link Saturated
Disk I/O
Sorting of map
outputs
Blue : Read
Red : Write
CPU Utilisation
Memory Statistics
Why Ranked Inverted Index?
● For a given text corpus, for each word it generates a list of
documents containing the word in decreasing order of frequency
word -> (count1 | file1), (count2 | file2), ...
count1 > count2 > …
● A ranked inverted index is used often in text processing and
information retrieval tasks
● Mentioned in the Tarazu paper as a Shuffle heavy workload
Tarazu: Optimizing MapReduce On Heterogeneous Clusters, Faraz Ahmad et al.
Ranked Inverted Index - How it
works:
Map input: (word | filename) -> count
Map output: word -> (filename, count)
Reduce output: word -> (count1 | file1), (count2 | file2) ...
It involves a sort of the values on the reduce side
(Note that the Map input is the output of another MapReduce job called
sequence-count)
Experimental Results of Ranked Inverted Index
Instance type
Config 1 m1.large (RAM 7.5 GB)
Total job Time
(min)
Map Time
(min)
Reduce
Time (min)
Shuffle
Average Time
Shuffle Time %
Config 1 12 5.5 11.5 3.5 27.14
Input Data Set : 40 GB ftp://ftp.ecn.purdue.edu/fahmad/rankedinvindex_40GB.tar.bz2
CDF of data transferred over the network during the lifetime of the job
Map ends
Shuffle starts
Shuffle ends
Reduce nearly done
Replicating results to 3
Nodes
Network Transfer Rate on nodes
Network Link Saturated
Disk I/O Blue : Read
Red : Write
CPU Utilisation
Memory Statistics
Summary
- Shuffle can constitute significant time of the total job runtime
- Worth investing in good network connectivity for a compute cluster
Stuff that doesn’t add up!
● Why does peak Network Bandwidth for Ranked Inverted Index
overshoot the 1Gb/s mark?
● Why is the sort phase of RII so short?
Future Work
● How does changing the various parameters make a difference? eg
io.sort.mb, io.sort.factor, fs.inmemory.size.mb
● Effect of Combiners?
● Varying the number of Map Tasks and Reduce Tasks
● How many Map tasks are rack local or machine local?
● Investigate the unresolved issues
● Lack of precise information about “topology” and “network
bandwidth” for EMR Clusters
Q n A
Thank you!
Standard Test Results
Input Size Run Time on
Hadoop (min)
Shuffle Volume Critical Path
tera-sort 300 2353 200 Shuffle
ranked-inverted-index 205 2322 219 Shuffle

More Related Content

What's hot

A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Yahoo Developer Network
 

What's hot (20)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce Application
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
 

Viewers also liked

Performing Network & Security Analytics with Hadoop
Performing Network & Security Analytics with HadoopPerforming Network & Security Analytics with Hadoop
Performing Network & Security Analytics with Hadoop
DataWorks Summit
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
DataWorks Summit
 

Viewers also liked (19)

STORM
STORMSTORM
STORM
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
 
Tweeting hadoop
Tweeting hadoopTweeting hadoop
Tweeting hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Monitoring and Analyzing Big Traffic Data of a Large-Scale Cellular Network w...
Monitoring and Analyzing Big Traffic Data of a Large-Scale Cellular Network w...Monitoring and Analyzing Big Traffic Data of a Large-Scale Cellular Network w...
Monitoring and Analyzing Big Traffic Data of a Large-Scale Cellular Network w...
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Performing Network & Security Analytics with Hadoop
Performing Network & Security Analytics with HadoopPerforming Network & Security Analytics with Hadoop
Performing Network & Security Analytics with Hadoop
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
 
ahepburn MDES PRES2 Production Tech Its only a Comic
ahepburn MDES PRES2 Production Tech Its only a Comicahepburn MDES PRES2 Production Tech Its only a Comic
ahepburn MDES PRES2 Production Tech Its only a Comic
 
TRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPTRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOP
 
Kkeithley ufonfs-gluster summit
Kkeithley ufonfs-gluster summitKkeithley ufonfs-gluster summit
Kkeithley ufonfs-gluster summit
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
Solving Big Data Problems
Solving Big Data ProblemsSolving Big Data Problems
Solving Big Data Problems
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best Practices
 

Similar to Hadoop Network Performance profile

Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
butest
 

Similar to Hadoop Network Performance profile (20)

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 
Handout3o
Handout3oHandout3o
Handout3o
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 

Recently uploaded

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 

Hadoop Network Performance profile

  • 1. Profiling the Network Performance of Hadoop Jobs Team : Pramod Biligiri & Sayed Asad Ali
  • 2. Talk Outline Introduction to the problem What is Hadoop? Hadoop’s MapReduce Framework Shuffle as a Bottleneck Experimental Setup Choice of Benchmarks Terasort Discussion Ranked Inverted Index Discussion Summary and Future Work
  • 3. Introduction to the problem Reproduce existing results which show that the “Network” is the bottleneck in shuffle-intensive Hadoop jobs.
  • 4. What is Hadoop? A framework for distributed processing of large data sets across clusters of computers using simple programming models based on Google’s MapReduce. Distinct Features: ● Designed for Commodity Hardware ● Highly Fault-tolerant ● Horizontally Scalable ● Push computation to data
  • 5. MapReduce ● MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster ● Programming Model ○ For each input record, generate (key, value) ○ Apply reduce operation for all values corresponding to the same key
  • 6. Hadoop’s MapReduce Framework 1. Prepare the Map() input 2. Run the user-provided Map() code 3. "Shuffle" the Map output to the Reduce processors 4. Run the user-provided Reduce() code 5. Produce the final output
  • 9. Shuffle as a Bottleneck? “On average, the shuffle phase accounts for 33% of the running time in these jobs. In addition, in 26% of the jobs with reduce tasks, shuffles account for more than 50% of the running time, and in 16% of jobs, they account for more than 70% of the running time. This confirms widely reported results that the network is a bottleneck in MapReduce” Managing Data Transfers in Computer Clusters with Orchestra - Mosharaf Chowdhury et al
  • 10. Chosen Benchmarks ● Terasort ● Ranked Inverted Index
  • 11. Experimental Setups Instance type Memory CPU Elastic Compute Units Disk Network performance Config 1 m1.large 7.5 GB 64-bit 4 2 x 420 GB Moderate Config 2 m1.xlarge 15 GB 64-bit 8 4 x 420 GB High SDSC custom 8 GB 64-bit/ Intel Xeon CPU 5140 @2.33 GHz, 4 cores 2 x 1.5 TB 1 Gb/s
  • 12. Network Performance of EMR Conflicting Values! Source 1 : with AppNeta pathtest average : 753 Mb/s http://www.appneta.com/resources/pathtest-download.html Source 2 : “The available bandwidth is still 1 Gb/s, confirming anecdotal evidence that EC2 has full bisection bandwidth." Opening Up Black Box Networks with CloudTalk, by Costin Raiciu et al Source 3 : “The median TCP/UDP throughput of medium instances are both close to 760 Mb/s." The Impact of Virtualization on Network Performance of Amazon EC2 Data Center, by Guohui Wang et al
  • 13. Why Terasort? ● Popular benchmark for Hadoop ● Shipped with most Hadoop distributions. ● Utilizes all aspects of the cluster - cpu, network, disk and memory ● Large amount of data to shuffle (240 GB). ● Representative of real world workloads “This data shuffle pattern arises in large scale sorts, merges and join operations in the data center. We chose this test because, in our interactions with application developers, we learned that many use such operations with caution, because the operations are highly expensive in today’s data center network.” source : VL2: A Scalable and Flexible Data Center Network - A. Greenberg et al.
  • 14. Terasort - How it works: ● Sorts 1 terabyte of data. ● Each data item is 100 bytes in size. ● The first 10 bytes of a data item constitute its sort key. ● Format of input data: ○ <key 10 bytes><rowid 10 bytes><filler 78 bytes>rn ■ key : random characters from ASCII 32-126 ■ rowid : an integer ■ filler : random characters from the set A-Z
  • 15. Terasort - How it works: Map Partition input keys into different buckets <Leverage Hadoop’s default sorting of Map output> Reduce Collect outputs from different maps
  • 17. Comparison of Terasort on different configurations Instance type Config 1 m1.large (RAM 7.5 GB) Config 2 m1.xlarge (RAM 15 GB) SDSC Custom (RAM 8 GB) Total job Time (min) Map Time (min) Reduce Time (min) Shuffle Average Time Shuffle Time % Config 1 205 84 205 60 29.3 SDSC 166 60 90 36 21.7 Config 2 86 40 75 22 25.5
  • 18. CDF of data transferred over the network during the lifetime of the job Map ends Shuffle starts Shuffle ends Reduce nearly done Sorting of Map outputs (local to the node) 5100 6900
  • 19. Network Transfer Rate on nodes Network Link Saturated
  • 20. Disk I/O Sorting of map outputs Blue : Read Red : Write
  • 23. Why Ranked Inverted Index? ● For a given text corpus, for each word it generates a list of documents containing the word in decreasing order of frequency word -> (count1 | file1), (count2 | file2), ... count1 > count2 > … ● A ranked inverted index is used often in text processing and information retrieval tasks ● Mentioned in the Tarazu paper as a Shuffle heavy workload Tarazu: Optimizing MapReduce On Heterogeneous Clusters, Faraz Ahmad et al.
  • 24. Ranked Inverted Index - How it works: Map input: (word | filename) -> count Map output: word -> (filename, count) Reduce output: word -> (count1 | file1), (count2 | file2) ... It involves a sort of the values on the reduce side (Note that the Map input is the output of another MapReduce job called sequence-count)
  • 25. Experimental Results of Ranked Inverted Index Instance type Config 1 m1.large (RAM 7.5 GB) Total job Time (min) Map Time (min) Reduce Time (min) Shuffle Average Time Shuffle Time % Config 1 12 5.5 11.5 3.5 27.14 Input Data Set : 40 GB ftp://ftp.ecn.purdue.edu/fahmad/rankedinvindex_40GB.tar.bz2
  • 26. CDF of data transferred over the network during the lifetime of the job Map ends Shuffle starts Shuffle ends Reduce nearly done Replicating results to 3 Nodes
  • 27. Network Transfer Rate on nodes Network Link Saturated
  • 28. Disk I/O Blue : Read Red : Write
  • 31. Summary - Shuffle can constitute significant time of the total job runtime - Worth investing in good network connectivity for a compute cluster
  • 32. Stuff that doesn’t add up! ● Why does peak Network Bandwidth for Ranked Inverted Index overshoot the 1Gb/s mark? ● Why is the sort phase of RII so short?
  • 33. Future Work ● How does changing the various parameters make a difference? eg io.sort.mb, io.sort.factor, fs.inmemory.size.mb ● Effect of Combiners? ● Varying the number of Map Tasks and Reduce Tasks ● How many Map tasks are rack local or machine local? ● Investigate the unresolved issues ● Lack of precise information about “topology” and “network bandwidth” for EMR Clusters
  • 34. Q n A
  • 36. Standard Test Results Input Size Run Time on Hadoop (min) Shuffle Volume Critical Path tera-sort 300 2353 200 Shuffle ranked-inverted-index 205 2322 219 Shuffle