SlideShare a Scribd company logo
1 of 27
Download to read offline
Dominoes:	Exploratory	Data	
Analysis	of	So5ware	Repositories	
Through	GPU	Processing
Jose	Ricardo	
Esteban	Clua	
Leonardo	Murta	
Anita	Sarma
Introduction
•  Identifying expertise is important:
•  Normally, expertise is identified through informal process:
•  Social network
•  Implicit knowledge of work dependencies
•  Even more challenging in globally distributed development
2
Introduction
•  Software development leaves behind the activity logs for
mining relationships
•  Commits in a version system
•  Tasks in a issue tracker
•  Communication
•  Finding them is not a trivial task
•  There is an extensive amount of data to be analyzed
•  Data is typically stored across different repositories
•  Scalability problems depending on the project size
•  Processing the history of large repositories at fine grain for
exploratory analysis at interactive rate
3
Related Work
•  EEL scope the analysis to 1,000 project elements
•  Restrict the history to small chunk of data
•  Cataldo analyze data at coarse-grain
•  Developer is expert of the whole artifact
•  Boa allows fine-grain analysis by using a CPU cluster
•  Normally require a time slice for using the cluster
•  Require data submission for processing
4
Solving the Problem
5
Implement used operations
for data analysis in GPU
User interface
+
Happy user / researcher
Performance Usability
Deeper Research
=
Solving the Problem
•  Efficient large-scale
repository analysis
•  Enable users to explore
relationships across different
levels of granularity
•  No requirement for a
specialized infrastructure
6
Dominoes
•  Infrastructure that enables interactive exploratory data
analysis at varying levels of granularity using GPU
•  Organizes data from software repositories into multiple
matrices
•  Each matrix is treated as Dominoes tile
•  Tiles can be combined through operations to generate derived tiles
•  Transposition, multiplication, addition, …
7
Dominoes UI
•  Dominoes’ tiles resemble a Dominoes game, where the
user can play with to build new relationships
8
Basic Building Tiles
9
[developer|commit] [package|file] [commit|file]
[class|method] [issue|commit]
[commit|method] [file|class]
Examples of Derived Building Tiles
•  [method|method] (MM = CMT × CM): represents method
dependencies
•  [class|class] (ClCl = ClM × MM × ClMT ): represents
class dependencies
•  [issue|method] (IM = IC × CM): represents the methods
that were changed to implement/fix an issue
10
Dominoes Architecture
11
•  Extractor module gather information
from repository and save to database
•  Basic block builder is responsible
to generate building blocks
relationship from database
•  Operations are performed in GPU
using a Java Native Interface call
•  Derived and basic building block still
in memory for future use
Dominoes
CUDA Kernels
Database
Data Mining
Linear
Transformations
Statistics
Extractor
2D Tile
3D Tile
Derived Tile
Serialize
10010011000 10010011000
Unserialize
Basic Tile
Builder
Memory
Client Analysis
Request
Data Structure
•  Matrix are very sparse for
some relationships
•  Developer x Commit
•  The java side maintain a
pointer to the sparse
matrix allocated in C side
•  The matrix are stored in CRS
format
•  Matrix operations
performed in C using a
JNI interface
12
Java
…
long pointer m1, m2, res;
createObj(res);
multiplication(m1, m2, res)
…
C / CUDA
void multiplication(JNIEnv *env, jclass obj,
jlong m1, jlong m2, res )
{
Matrix *_m1 = (Matrix*) m1;
Matrix *_m2 = (Matrix*) m2;
Matrix *_res = (Matrix*) m2;
GPUMul(m1, m2, _res);
}
Operations in GPU
13
Linear Transformation
Addition
Multiplication
Transposition
Data Mining
Confidence
Lift
Support
Statistics
Mean
Standard Deviation
Z-Score
Linear Transforms
•  Allows connecting pieces in the
Dominoes by changing its edge
•  Allows extracting further
relationships in the data by
combining the different types of data
•  Uses cusp library for performing
linear transforms
14
Reduction
•  Normally used for calculating the amount of relationship
•  Total of classes modified by a developer
•  How many bugs a developer have inserted in a method Y
•  Uses the Thrust library for calculating it in GPU
15
Confidence
•  Used to detect the relationship direction
•  Each GPU thread is responsible for processing the
confidence for each element
16
Mconf
[i, j]=
M SUP
[i, j]
M SUP
[i,i]
Class A Class B
Depends (98%)
Depends (0.8%)
Confidence
• Due to the fact that the row and column must be
know, they are computed and stored in a vector.
• Given a sparse M × M with t non zero values:
17
V1 V2 V3 … Vt R1 R2 R3 … Rt C1 C2 C3 … Ct D1 D2 … DM
For each t GPU thread
diagIdx = row[idx];
conf[idx] = value[idx] / diagonal[diagIdx]
Value Row Col Diagonal
Z-Score
• Responsible to convert an absolute value to a
score above the mean
• Require a set of steps
•  Calculating the mean / column
•  Calculating the standard deviation
•  Finally calculating the z-score
18
z =
(x −µ)
σ
x = absolute score
µ = mean
σ = standard deviation
Z-Score
•  Calculating the mean / column
•  Given a Matrix M × N, containing t non zero values, the
GPU is responsible to sum up all values for a column,
producing a vector sized N for the mean.
19
V1 V2 V3 … Vt C1 C2 C3 … Ct
Value Col
Kernel 1
For each t GPU thread
colIdx = col[idx]
atomicAdd(value[idx],
sum[idx])
atomicAdd(1, count[idx])
Kernel 2
For each N GPU thread
mean[idx] = sum[idx] / count[idx]
Z-Score
•  Calculating the standard deviation / column
•  Given a Matrix M × N, the GPU is responsible to sum up
all values for a column, producing a vector sized N for
the standard deviation
20
V1 V2 V3 … Vt C1 C2 C3 … Ct
Value Col
Kernel 1
For each t GPU thread
colIdx = col[idx]
colMean = mean[colIdx]
deviate = value[idx] – colMean
deviatePower2 = deviate * deviate
atomicAdd(deviatePower2, variance[colIdx])
Kernel 2
For each N GPU thread
colVariance = variance[idx]
colVarianceSqrt = sqrt(colVariance / M)
deviation[idx] = colVarianceSqrt
M1 M2 M3 … M
N
Mean
Z-Score
•  Calculating the standard score
•  Given a Matrix M × N with t non zero elements, the
GPU is responsible to produce the z-score
21
V1 V2 V3 … Vt C1 C2 C3 … Ct
Value Col
For each t GPU thread
colIdx = col[idx]
colMean = mean[colIdx]
standardDev = sd[colIdx]
z = (value[idx] – colMean) / standardDev
zscore[idx] = z
M1 M2 M3 … M
N
Mean
S1 S2 S3 … SN
SD
Applicability
22
Dependency Identification
Expertise Identification
Expertise breadth identification
Results
•  Evaluation time (support and confidence).
•  [file|commit] (34,335 x 7,578)
•  CPU: 696 minutes | GPU: 0.7 minutes | Speed up: 994
•  [method|commit] (305,551 x 7,578)
•  CPU: N/A | GPU: 5 minutes | Speed up: -
23
* Intel Core 2 Quad Q6600 2.40GHz PC with 4GB RAM and a nVidia GeForce GTX580 graphics card
was used.
Results
24
EBD when Considering Files
(seconds)
EBD when Considering
Methods (seconds)
Mean &
SD
Z-Score Total
Mean &
SD
Z-Score Total
CPU 2.19 301.23 303.42 424.71 1,573,60 1,998.31
GPU 0.10 19.49 19.59 8.55 203.46 212.01
Speed
Up
21.90 15.45 15.48 49.67 7.73 9.42
•  [Developer|File|Time]: 114 layers of 36 x 3400 (13,953,600 elements)
•  [Developer|Method|Time]: 114 layers of 36 x 43,788 (179,705,952 elements)
* EBD = Expertise Breadth of a Developer.
Results
•  J. R. da Silva, E. Clua, L. Murta, and A. Sarma. Niche vs. breadth:
Calculating expertise over time through a fine-grained analysis.
In Software Analysis, Evolution and Reengineering (SANER), 2015
IEEE 22nd International Conference on, pages 409–418, Mar. 2015
•  J. R. da Silva, E. Clua, L. Murta, and A. Sarma. Multi-Perspective
Exploratory Analysis of Software Development Data.
International Journal of Software Engineering and Knowledge
Engineering, 25(01):51–68, 2015.
•  J. R. da Silva Junior, E. Clua, L. Murta, and A. Sarma. Exploratory
Data Analysis of Software Repositories via GPU Processing.
26th SEKE, 2014
25
Conclusions
•  The main contribution is using GPU for solving Software
Engineering problems
•  Employment of GPU allows seamless relationship
manipulations at interactive rates
•  Uses matrices underneath to represents building blocks
•  Dominoes opens a new realm of exploratory software
analysis, as endless combinations of Dominoes’ pieces
can be experimented in an exploratory fashion
•  Thanks to the use of GPU, the user can do its analysis on
its own machine
26
Questions
27
jricardo@ic.uff.br
http://www.josericardojunior.com
https://br.linkedin.com/in/jose-ricardo-da-silva-junior-7299987
https://twitter.com/jricardojunior

More Related Content

Similar to Dominoes: Exploratory Data Analysis of Software Repositories Through GPU Processing

Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organizationssuserdfc773
 
B tech Projects,Final Year Projects,Engineering Projects
B tech Projects,Final Year Projects,Engineering ProjectsB tech Projects,Final Year Projects,Engineering Projects
B tech Projects,Final Year Projects,Engineering ProjectsTechnogroovy
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningSri Ambati
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxBharathiLakshmiAAssi
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
COA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxCOA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxRuhul Amin
 
09 accelerators
09 accelerators09 accelerators
09 acceleratorsMurali M
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...Iosif Itkin
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Gurbinder Gill
 

Similar to Dominoes: Exploratory Data Analysis of Software Repositories Through GPU Processing (20)

Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organization
 
B tech Projects,Final Year Projects,Engineering Projects
B tech Projects,Final Year Projects,Engineering ProjectsB tech Projects,Final Year Projects,Engineering Projects
B tech Projects,Final Year Projects,Engineering Projects
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
PCL (Point Cloud Library)
PCL (Point Cloud Library)PCL (Point Cloud Library)
PCL (Point Cloud Library)
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsx
 
matrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsxmatrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsx
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Pregel
PregelPregel
Pregel
 
An35225228
An35225228An35225228
An35225228
 
COA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxCOA-Unit4-PPT.pptx
COA-Unit4-PPT.pptx
 
09 accelerators
09 accelerators09 accelerators
09 accelerators
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
 

Recently uploaded

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 

Recently uploaded (20)

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 

Dominoes: Exploratory Data Analysis of Software Repositories Through GPU Processing

  • 2. Introduction •  Identifying expertise is important: •  Normally, expertise is identified through informal process: •  Social network •  Implicit knowledge of work dependencies •  Even more challenging in globally distributed development 2
  • 3. Introduction •  Software development leaves behind the activity logs for mining relationships •  Commits in a version system •  Tasks in a issue tracker •  Communication •  Finding them is not a trivial task •  There is an extensive amount of data to be analyzed •  Data is typically stored across different repositories •  Scalability problems depending on the project size •  Processing the history of large repositories at fine grain for exploratory analysis at interactive rate 3
  • 4. Related Work •  EEL scope the analysis to 1,000 project elements •  Restrict the history to small chunk of data •  Cataldo analyze data at coarse-grain •  Developer is expert of the whole artifact •  Boa allows fine-grain analysis by using a CPU cluster •  Normally require a time slice for using the cluster •  Require data submission for processing 4
  • 5. Solving the Problem 5 Implement used operations for data analysis in GPU User interface + Happy user / researcher Performance Usability Deeper Research =
  • 6. Solving the Problem •  Efficient large-scale repository analysis •  Enable users to explore relationships across different levels of granularity •  No requirement for a specialized infrastructure 6
  • 7. Dominoes •  Infrastructure that enables interactive exploratory data analysis at varying levels of granularity using GPU •  Organizes data from software repositories into multiple matrices •  Each matrix is treated as Dominoes tile •  Tiles can be combined through operations to generate derived tiles •  Transposition, multiplication, addition, … 7
  • 8. Dominoes UI •  Dominoes’ tiles resemble a Dominoes game, where the user can play with to build new relationships 8
  • 9. Basic Building Tiles 9 [developer|commit] [package|file] [commit|file] [class|method] [issue|commit] [commit|method] [file|class]
  • 10. Examples of Derived Building Tiles •  [method|method] (MM = CMT × CM): represents method dependencies •  [class|class] (ClCl = ClM × MM × ClMT ): represents class dependencies •  [issue|method] (IM = IC × CM): represents the methods that were changed to implement/fix an issue 10
  • 11. Dominoes Architecture 11 •  Extractor module gather information from repository and save to database •  Basic block builder is responsible to generate building blocks relationship from database •  Operations are performed in GPU using a Java Native Interface call •  Derived and basic building block still in memory for future use Dominoes CUDA Kernels Database Data Mining Linear Transformations Statistics Extractor 2D Tile 3D Tile Derived Tile Serialize 10010011000 10010011000 Unserialize Basic Tile Builder Memory Client Analysis Request
  • 12. Data Structure •  Matrix are very sparse for some relationships •  Developer x Commit •  The java side maintain a pointer to the sparse matrix allocated in C side •  The matrix are stored in CRS format •  Matrix operations performed in C using a JNI interface 12 Java … long pointer m1, m2, res; createObj(res); multiplication(m1, m2, res) … C / CUDA void multiplication(JNIEnv *env, jclass obj, jlong m1, jlong m2, res ) { Matrix *_m1 = (Matrix*) m1; Matrix *_m2 = (Matrix*) m2; Matrix *_res = (Matrix*) m2; GPUMul(m1, m2, _res); }
  • 13. Operations in GPU 13 Linear Transformation Addition Multiplication Transposition Data Mining Confidence Lift Support Statistics Mean Standard Deviation Z-Score
  • 14. Linear Transforms •  Allows connecting pieces in the Dominoes by changing its edge •  Allows extracting further relationships in the data by combining the different types of data •  Uses cusp library for performing linear transforms 14
  • 15. Reduction •  Normally used for calculating the amount of relationship •  Total of classes modified by a developer •  How many bugs a developer have inserted in a method Y •  Uses the Thrust library for calculating it in GPU 15
  • 16. Confidence •  Used to detect the relationship direction •  Each GPU thread is responsible for processing the confidence for each element 16 Mconf [i, j]= M SUP [i, j] M SUP [i,i] Class A Class B Depends (98%) Depends (0.8%)
  • 17. Confidence • Due to the fact that the row and column must be know, they are computed and stored in a vector. • Given a sparse M × M with t non zero values: 17 V1 V2 V3 … Vt R1 R2 R3 … Rt C1 C2 C3 … Ct D1 D2 … DM For each t GPU thread diagIdx = row[idx]; conf[idx] = value[idx] / diagonal[diagIdx] Value Row Col Diagonal
  • 18. Z-Score • Responsible to convert an absolute value to a score above the mean • Require a set of steps •  Calculating the mean / column •  Calculating the standard deviation •  Finally calculating the z-score 18 z = (x −µ) σ x = absolute score µ = mean σ = standard deviation
  • 19. Z-Score •  Calculating the mean / column •  Given a Matrix M × N, containing t non zero values, the GPU is responsible to sum up all values for a column, producing a vector sized N for the mean. 19 V1 V2 V3 … Vt C1 C2 C3 … Ct Value Col Kernel 1 For each t GPU thread colIdx = col[idx] atomicAdd(value[idx], sum[idx]) atomicAdd(1, count[idx]) Kernel 2 For each N GPU thread mean[idx] = sum[idx] / count[idx]
  • 20. Z-Score •  Calculating the standard deviation / column •  Given a Matrix M × N, the GPU is responsible to sum up all values for a column, producing a vector sized N for the standard deviation 20 V1 V2 V3 … Vt C1 C2 C3 … Ct Value Col Kernel 1 For each t GPU thread colIdx = col[idx] colMean = mean[colIdx] deviate = value[idx] – colMean deviatePower2 = deviate * deviate atomicAdd(deviatePower2, variance[colIdx]) Kernel 2 For each N GPU thread colVariance = variance[idx] colVarianceSqrt = sqrt(colVariance / M) deviation[idx] = colVarianceSqrt M1 M2 M3 … M N Mean
  • 21. Z-Score •  Calculating the standard score •  Given a Matrix M × N with t non zero elements, the GPU is responsible to produce the z-score 21 V1 V2 V3 … Vt C1 C2 C3 … Ct Value Col For each t GPU thread colIdx = col[idx] colMean = mean[colIdx] standardDev = sd[colIdx] z = (value[idx] – colMean) / standardDev zscore[idx] = z M1 M2 M3 … M N Mean S1 S2 S3 … SN SD
  • 23. Results •  Evaluation time (support and confidence). •  [file|commit] (34,335 x 7,578) •  CPU: 696 minutes | GPU: 0.7 minutes | Speed up: 994 •  [method|commit] (305,551 x 7,578) •  CPU: N/A | GPU: 5 minutes | Speed up: - 23 * Intel Core 2 Quad Q6600 2.40GHz PC with 4GB RAM and a nVidia GeForce GTX580 graphics card was used.
  • 24. Results 24 EBD when Considering Files (seconds) EBD when Considering Methods (seconds) Mean & SD Z-Score Total Mean & SD Z-Score Total CPU 2.19 301.23 303.42 424.71 1,573,60 1,998.31 GPU 0.10 19.49 19.59 8.55 203.46 212.01 Speed Up 21.90 15.45 15.48 49.67 7.73 9.42 •  [Developer|File|Time]: 114 layers of 36 x 3400 (13,953,600 elements) •  [Developer|Method|Time]: 114 layers of 36 x 43,788 (179,705,952 elements) * EBD = Expertise Breadth of a Developer.
  • 25. Results •  J. R. da Silva, E. Clua, L. Murta, and A. Sarma. Niche vs. breadth: Calculating expertise over time through a fine-grained analysis. In Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on, pages 409–418, Mar. 2015 •  J. R. da Silva, E. Clua, L. Murta, and A. Sarma. Multi-Perspective Exploratory Analysis of Software Development Data. International Journal of Software Engineering and Knowledge Engineering, 25(01):51–68, 2015. •  J. R. da Silva Junior, E. Clua, L. Murta, and A. Sarma. Exploratory Data Analysis of Software Repositories via GPU Processing. 26th SEKE, 2014 25
  • 26. Conclusions •  The main contribution is using GPU for solving Software Engineering problems •  Employment of GPU allows seamless relationship manipulations at interactive rates •  Uses matrices underneath to represents building blocks •  Dominoes opens a new realm of exploratory software analysis, as endless combinations of Dominoes’ pieces can be experimented in an exploratory fashion •  Thanks to the use of GPU, the user can do its analysis on its own machine 26