SlideShare a Scribd company logo
1 of 29
Download to read offline
MapReduce	
  	
  
as	
  a	
  General	
  Framework	
  to	
  Support	
  Research	
  in	
  
Mining	
  So8ware	
  Repositories(MSR)
	
  Weiyi	
  Shang,	
  Zhen	
  Ming	
  Jiang,	
  Bram	
  Adams,	
  Ahmed	
  Hassan	
  
	
  
	
  	
  	
  	
  	
  	
  So8ware	
  Analysis	
  and	
  Intelligence	
  Lab(SAIL)	
  
School	
  of	
  CompuCng,	
  Queen’s	
  University	
  
As	
  an	
  MSR	
  researcher,	
  
have	
  you	
  ever	
  been	
  in	
  such	
  a	
  
situa>on?
• Analyzing	
  gigabytes	
  of	
  data?	
  
• WaiCng	
  hours	
  for	
  experimental	
  results?	
  
• Experiments	
  fail	
  with	
  “out	
  of	
  memory”	
  excepCons?
To	
  overcome	
  these	
  problems,	
  you	
  could	
  …
…	
  buy	
  more	
  powerful	
  machines
…	
  spend	
  weeks	
  to	
  make	
  your	
  
tools	
  more	
  efficient
However!
• The	
  data	
  will	
  keep	
  on	
  
growing	
  
• Spend	
  	
  Cme	
  on	
  research	
  
not	
  on	
  speeding	
  up	
  
experiments	
  
Debian	
  doubles	
  in	
  size	
  
approximately	
  every	
  two	
  years	
  
•  Idle	
  compuCng	
  power	
  is	
  available	
  in	
  every	
  lab	
  
•  We	
  can	
  bundle	
  these	
  computers	
  together	
  
•  A	
  distributed	
  framework	
  can	
  help	
  us	
  do	
  so
General	
  requirements	
  for	
  a	
  	
  
distributed	
  framework:
1.  Efficiency	
  
speed	
  up	
  the	
  process	
  significantly	
  
2.  Scalability	
  
scale	
  with	
  data	
  size	
  and	
  compuCng	
  power	
  
3.  Adaptability	
  
require	
  only	
  minimal	
  programming	
  effort	
  
4.  Flexibility	
  
run	
  in	
  various	
  environments
Google’s	
  	
  MapReduce	
  
is	
  an	
  idea	
  of	
  distributed	
  computa8on
Google’s	
  	
  MapReduce	
  
is	
  an	
  idea	
  of	
  distributed	
  computa8on
•  Open-­‐source	
  MapReduce	
  implementaCon	
  
•  Well	
  documented	
  and	
  many	
  examples	
  
available	
  	
  
•  Well	
  supported	
  by	
  large	
  user	
  base	
  and	
  news	
  
groups	
  
•  Straight	
  forward	
  API	
  
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  lengths
dog
cat
fish
good
hello
night
happy
school
# WordsLength
23
24
35
16
dog	
  
cat	
  
fish	
  
hello	
  
good	
  
night	
  
happy	
  
school
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
1.	
  Deploy	
  data	
  into	
  a	
  distributed	
  file	
  system	
  
data
network	
  compuCng	
  environment
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
2.	
  Read	
  data	
  as	
  records	
  
Data
dog
cat
fish
hello
good
night
happy
school
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
3.	
  Generate	
  keys	
  of	
  each	
  record	
  by	
  Mappers	
  
Data
dog
cat
fish
hello
good
night
happy
school
ValueKey
dog3
cat3
fish4
hello5
good4
night5
happy5
school6
Mapper
Mapper
Mapper
Mapperdog3
cat3
fish4
hello5
good4
night5
happy5
school6
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
4.	
  Group	
  and	
  sort	
  records	
  by	
  keys	
  	
  
ValueKey
dog3
cat3
fish4
hello5
good4
night5
happy5
school6
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
5.	
  Send	
  records	
  with	
  the	
  same	
  key	
  to	
  one	
  reducer	
  
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
Reducer
Reducer
Reducer
dog3
cat3
Reducer
fish4
good4
hello5
night5
happy5
school6
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
6.	
  Generate	
  outputs	
  by	
  Reducers	
  
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
Reducer
Reducer
Reducer
dog3
cat3
Reducer
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16
A	
  typical	
  MSR	
  analysis
Extract	
  all	
  versions	
  of	
  all	
  files	
  
Analyze	
  each	
  version	
  
Compare	
  versions	
  to	
  each	
  other
We	
  implement	
  MapReduce	
  on	
  a	
  typical	
  MSR	
  tool
Repository	
  
Applying	
  MapReduce	
  to	
  typical	
  MSR	
  tools
Repository	
  
Data
a0.java
a1.java
b0.java
a2.java
b1.java
Applying	
  MapReduce	
  to	
  typical	
  MSR	
  tools
Mapper
Mapper
Mapper
Data
a0.java
a1.java
b0.java
a2.java
b1.java
ValueKey
a.java
a.java
b.java
a.java
b.java
a0.java
a1.java
b0.java
a2.java
b1.java
a.java
a.java
a0.java
a1.java
b.java
a.java
b0.java
a2.java
b.java b1.java
Applying	
  MapReduce	
  to	
  typical	
  MSR	
  tools
ValueKey
a.java
a.java
b.java
a.java
b.java
a0.java
a1.java
b0.java
a2.java
b1.java
ValueKey
a.java
a.java
a.java
b.java
b.java
a0.java
a1.java
a2.java
b0.java
b1.java
Applying	
  MapReduce	
  to	
  typical	
  MSR	
  tools
Reducer
Reducer
ValueKey
a.java
a.java
a.java
b.java
b.java
a0.java
a1.java
a2.java
b0.java
b1.java
a.java
a.java
a.java
a0.java
a1.java
a2.java
b.java
b.java
b0.java
b1.java
Applying	
  MapReduce	
  to	
  typical	
  MSR	
  tools
Reducer
Reducer
ValueKey
a.java
a.java
a.java
b.java
b.java
a0.java
a1.java
a2.java
b0.java
b1.java
a.java
a.java
a.java
a0.java
a1.java
a2.java
b.java
b.java
b0.java
b1.java
ValueKey
a.outputa.java
b.outputb.java
Case	
  study: J-­‐REX
Extract	
  snapshots	
  
from	
  CVS	
  repository
Use	
  Eclipse	
  JDT	
  to	
  
parse	
  source	
  code	
  to	
  
XML	
  files
Compare	
  each	
  XML	
  file	
  
to	
  generate	
  evoluCon	
  
informaCon
XML	
  
output	
  
n
…
JDT
EvoluCon	
  Analyzer
EvoluConary	
  Change	
  
Data
… Snapshot	
  n
XML	
  
output	
  
1
Snapshot	
  extractor
CVS
ExtracCon	
  
phase
Parsing	
  
phase
Analysis	
  
phase
Snapshot	
  1
Case	
  study:	
  data
Repository	
  
Size
#Source	
  
Code	
  
Files
Length	
  
of	
  
History
#Revisions
Datatools 394MB 10,552 2	
  years 2,398
BIRT 810MB 13,002 4	
  years 19,583
Eclipse 4.2GB 56,851 8	
  years 82,682
Case	
  study:	
  experimental	
  setup
CPU	
  type #CPU	
   Memory	
  size Disk	
  type
Desktop Intel	
  Quad	
  Core	
  
Q6600	
  @	
  2.40	
  
GHz
4 2GB SATA
Server Intel	
  Quad	
  Core	
  
Q6600	
  @	
  2.40	
  
GHz
4 8GB RAID5
Server Intel	
  Core	
  i7	
  
920	
  @	
  2.67	
  GHz
8 6GB SSD
Efficiency:	
  significant	
  reduc>on	
  of	
  
running	
  >me	
  by	
  using	
  MapReduce
Desktop	
  
Server(SSD)	
  
With	
  MapReduce	
  
70%	
  less
64%	
  less
Running	
  
>me	
  (hour)
faste
r
59%	
  less
Scalability:	
  dras>c	
  reduc>on	
  of	
  run	
  
>me	
  by	
  adding	
  machines
•  When	
  adding	
  machines	
  
– Time	
  to	
  deploy	
  data	
  increases	
  	
  
– Time	
  to	
  process	
  decreases	
  
	
  
2nodes	
  
3nodes	
  
4nodes	
  
faster
Adaptability:	
  liale	
  effort	
  to	
  apply	
  
MapReduce	
  to	
  MSR	
  tool
•  J-­‐REX	
  logic	
  unchanged	
  
•  Only	
  300-­‐400	
  LOC	
  to	
  implement	
  Map	
  and	
  
Reduce	
  
•  Typical	
  MapReduce	
  examples	
  available	
  
•  Less	
  than	
  one	
  hour	
  for	
  deployment	
  
Flexibility:	
  run	
  on	
  various	
  environments
Conclusions
•  Distributed	
  frameworks	
  are	
  needed	
  to	
  
– deal	
  with	
  growing	
  data	
  
– make	
  best	
  use	
  of	
  available	
  compuCng	
  resources	
  
•  A	
  MapReduce	
  soluCon	
  of	
  a	
  typical	
  MSR	
  
analysis	
  is:	
  
– straight	
  forward	
  
– scalable	
  
– efficient	
  

More Related Content

What's hot

Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactArun Kejariwal
 
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...Amazon Web Services
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsIRJET Journal
 
Streaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same GameStreaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same GameNumenta
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeMasud Rahman
 
Automated Traceability for Software Engineering Tasks
Automated Traceability for Software Engineering TasksAutomated Traceability for Software Engineering Tasks
Automated Traceability for Software Engineering TasksDharmalingam Ganesan
 

What's hot (7)

Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript Programs
 
Streaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same GameStreaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same Game
 
PCR Digital Test Tube
PCR Digital Test TubePCR Digital Test Tube
PCR Digital Test Tube
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-Singapore
 
Automated Traceability for Software Engineering Tasks
Automated Traceability for Software Engineering TasksAutomated Traceability for Software Engineering Tasks
Automated Traceability for Software Engineering Tasks
 

Similar to MSR 2009

Msr2009 ian
Msr2009 ianMsr2009 ian
Msr2009 ianSAIL_QU
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesIlkay Altintas, Ph.D.
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using sparkDatabricks
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerAndrew Yongjoon Kong
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores inside-BigData.com
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performanceinside-BigData.com
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 

Similar to MSR 2009 (20)

Msr2009 ian
Msr2009 ianMsr2009 ian
Msr2009 ian
 
Spark
SparkSpark
Spark
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 

Recently uploaded

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 

Recently uploaded (20)

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 

MSR 2009

  • 1. MapReduce     as  a  General  Framework  to  Support  Research  in   Mining  So8ware  Repositories(MSR)  Weiyi  Shang,  Zhen  Ming  Jiang,  Bram  Adams,  Ahmed  Hassan                So8ware  Analysis  and  Intelligence  Lab(SAIL)   School  of  CompuCng,  Queen’s  University  
  • 2. As  an  MSR  researcher,   have  you  ever  been  in  such  a   situa>on? • Analyzing  gigabytes  of  data?   • WaiCng  hours  for  experimental  results?   • Experiments  fail  with  “out  of  memory”  excepCons?
  • 3. To  overcome  these  problems,  you  could  … …  buy  more  powerful  machines …  spend  weeks  to  make  your   tools  more  efficient
  • 4. However! • The  data  will  keep  on   growing   • Spend    Cme  on  research   not  on  speeding  up   experiments   Debian  doubles  in  size   approximately  every  two  years  
  • 5. •  Idle  compuCng  power  is  available  in  every  lab   •  We  can  bundle  these  computers  together   •  A  distributed  framework  can  help  us  do  so
  • 6. General  requirements  for  a     distributed  framework: 1.  Efficiency   speed  up  the  process  significantly   2.  Scalability   scale  with  data  size  and  compuCng  power   3.  Adaptability   require  only  minimal  programming  effort   4.  Flexibility   run  in  various  environments
  • 7. Google’s    MapReduce   is  an  idea  of  distributed  computa8on
  • 8. Google’s    MapReduce   is  an  idea  of  distributed  computa8on •  Open-­‐source  MapReduce  implementaCon   •  Well  documented  and  many  examples   available     •  Well  supported  by  large  user  base  and  news   groups   •  Straight  forward  API  
  • 9. Example:  coun>ng  the  frequency  of   word  lengths dog cat fish good hello night happy school # WordsLength 23 24 35 16
  • 10. dog   cat   fish   hello   good   night   happy   school Example:  coun>ng  the  frequency  of   word  length 1.  Deploy  data  into  a  distributed  file  system   data network  compuCng  environment
  • 11. Example:  coun>ng  the  frequency  of   word  length 2.  Read  data  as  records   Data dog cat fish hello good night happy school
  • 12. Example:  coun>ng  the  frequency  of   word  length 3.  Generate  keys  of  each  record  by  Mappers   Data dog cat fish hello good night happy school ValueKey dog3 cat3 fish4 hello5 good4 night5 happy5 school6 Mapper Mapper Mapper Mapperdog3 cat3 fish4 hello5 good4 night5 happy5 school6
  • 13. Example:  coun>ng  the  frequency  of   word  length 4.  Group  and  sort  records  by  keys     ValueKey dog3 cat3 fish4 hello5 good4 night5 happy5 school6 ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6
  • 14. Example:  coun>ng  the  frequency  of   word  length 5.  Send  records  with  the  same  key  to  one  reducer   ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6 Reducer Reducer Reducer dog3 cat3 Reducer fish4 good4 hello5 night5 happy5 school6
  • 15. Example:  coun>ng  the  frequency  of   word  length 6.  Generate  outputs  by  Reducers   ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6 Reducer Reducer Reducer dog3 cat3 Reducer fish4 good4 hello5 night5 happy5 school6 ValueKey 23 24 35 16
  • 16. A  typical  MSR  analysis Extract  all  versions  of  all  files   Analyze  each  version   Compare  versions  to  each  other We  implement  MapReduce  on  a  typical  MSR  tool Repository  
  • 17. Applying  MapReduce  to  typical  MSR  tools Repository   Data a0.java a1.java b0.java a2.java b1.java
  • 18. Applying  MapReduce  to  typical  MSR  tools Mapper Mapper Mapper Data a0.java a1.java b0.java a2.java b1.java ValueKey a.java a.java b.java a.java b.java a0.java a1.java b0.java a2.java b1.java a.java a.java a0.java a1.java b.java a.java b0.java a2.java b.java b1.java
  • 19. Applying  MapReduce  to  typical  MSR  tools ValueKey a.java a.java b.java a.java b.java a0.java a1.java b0.java a2.java b1.java ValueKey a.java a.java a.java b.java b.java a0.java a1.java a2.java b0.java b1.java
  • 20. Applying  MapReduce  to  typical  MSR  tools Reducer Reducer ValueKey a.java a.java a.java b.java b.java a0.java a1.java a2.java b0.java b1.java a.java a.java a.java a0.java a1.java a2.java b.java b.java b0.java b1.java
  • 21. Applying  MapReduce  to  typical  MSR  tools Reducer Reducer ValueKey a.java a.java a.java b.java b.java a0.java a1.java a2.java b0.java b1.java a.java a.java a.java a0.java a1.java a2.java b.java b.java b0.java b1.java ValueKey a.outputa.java b.outputb.java
  • 22. Case  study: J-­‐REX Extract  snapshots   from  CVS  repository Use  Eclipse  JDT  to   parse  source  code  to   XML  files Compare  each  XML  file   to  generate  evoluCon   informaCon XML   output   n … JDT EvoluCon  Analyzer EvoluConary  Change   Data … Snapshot  n XML   output   1 Snapshot  extractor CVS ExtracCon   phase Parsing   phase Analysis   phase Snapshot  1
  • 23. Case  study:  data Repository   Size #Source   Code   Files Length   of   History #Revisions Datatools 394MB 10,552 2  years 2,398 BIRT 810MB 13,002 4  years 19,583 Eclipse 4.2GB 56,851 8  years 82,682
  • 24. Case  study:  experimental  setup CPU  type #CPU   Memory  size Disk  type Desktop Intel  Quad  Core   Q6600  @  2.40   GHz 4 2GB SATA Server Intel  Quad  Core   Q6600  @  2.40   GHz 4 8GB RAID5 Server Intel  Core  i7   920  @  2.67  GHz 8 6GB SSD
  • 25. Efficiency:  significant  reduc>on  of   running  >me  by  using  MapReduce Desktop   Server(SSD)   With  MapReduce   70%  less 64%  less Running   >me  (hour) faste r 59%  less
  • 26. Scalability:  dras>c  reduc>on  of  run   >me  by  adding  machines •  When  adding  machines   – Time  to  deploy  data  increases     – Time  to  process  decreases     2nodes   3nodes   4nodes   faster
  • 27. Adaptability:  liale  effort  to  apply   MapReduce  to  MSR  tool •  J-­‐REX  logic  unchanged   •  Only  300-­‐400  LOC  to  implement  Map  and   Reduce   •  Typical  MapReduce  examples  available   •  Less  than  one  hour  for  deployment  
  • 28. Flexibility:  run  on  various  environments
  • 29. Conclusions •  Distributed  frameworks  are  needed  to   – deal  with  growing  data   – make  best  use  of  available  compuCng  resources   •  A  MapReduce  soluCon  of  a  typical  MSR   analysis  is:   – straight  forward   – scalable   – efficient