SlideShare a Scribd company logo
1 of 52
Deduplication of large
amounts of code
Romain Keramitas
FOSDEM 2019
Clones
def foo(name: str):
print('Hello World, my name is ' + name)
def bar(name: str):
print('Hello World, my name is {}'.format(name))
def baz(name: str):
print('Hello World, my name is %s' % name)
I am so happy to speak at FOSDEM this year
since it's a really awesome conference !
Natural language clones
I will be speaking at FOSDEM this year, it makes me so happy
because that conference is really awesome !
I am delighted to speak at FOSDEM this year
since it's a really amazing conference !
Natural language clones
I will be speaking at FOSDEM this year, it makes me so happy
because that conference is really awesome !
Clone Taxonomy (Kapser; Roy and Cordy )
Type I : exactly the same
Type II : structurally the same, syntactical differences
Type III : combination of type I & II and minor structural changes
Type IV : semantically the same, different structure and syntax
Type IV clone
def foo(n: int):
k = 0
for i in range(n):
k += i
return k
def bar(m: int):
counter, l = 0, 0
while counter < m:
counter += 1
l+= counter
return l
Déjà Vu approach (Lopes et al.)
def foo(n: int):
k = 0
for i in range(n):
k += i
return k
[(def, 1), (foo, 1), (n,
2), (int, 1), (k, 3),
(0, 1), (for, 1), (i, 2),
(in, 1),
(range, 1), (return, 1)]
File hash
7d02b25e38eadb33e9f96d771e1844a6
Token hash
f2ea1bb5a6208b5d4f2c930a0d7042d6
SourcererCC
Gemini approach
1. Extract syntactical and structural features from each snippet
Features Matrix M:
N x M values
Mi,j = importance of feature j
for snippet i (0 if none)
Dataset
N snippets
Gemini approach
2. Create a pairwise similarity graph between snippets,
by hashing the feature matrix
Pairwise Similarity Graph
Graph with N nodes
nodes = snippets
edge = similarity
Dataset
N snippets
Features
NxM values
Gemini approach
3. Extract connected components from the similarity graph
Connected Components
Graphs where each pair of nodes
is connected by paths
Dataset
N snippets
Features
NxM values
Similarity
Graph
N nodes
Gemini approach
4. Perform community detection on each components to obtain
clone communities
Dataset
N snippets
Clone
Communities
Parts of components
where each node is a
clone
Features
NxM values
Similarity
Graph
N nodes
Connected
Components
Feature Extraction Step
Dataset
N snippets
Features
NxM values
Similarity
Graph
N nodes
Connected
Components
Clone
Communities
Abstract Syntax Trees
def foo(x):
return x + 42
def
foo x
x
+
return
42
Abstract Syntax Trees
def(decl)
foo(id) x(id)
x(id)
+(op)
return(stat
)
42(lit)
def foo(x):
return x + 42
Identifiers and Literals
def foo(n: int):
k = 0
for i in range(n):
k += i
return k + 42
Identifiers:
[(foo, 1),(n, 2),(k, 3),(i, 2)]
Literals:
[(0, 1), (42, 1)]
Graphlets and Children
Graphlets:
[(decl,[id,id,statement]),
(id, []),(id, []),
(statement,[op.]),(op,[id,lit])
(id,[]),(lit,[])]
Children:
[(decl,3), (id, 0),(id, 0),
(statement,1),(op,2)(id,0),
(lit,0)]
def(decl)
foo(id) x(id)
x(id)
+(op)
return(stat
)
42(lit)
Graphlets and Children
Graphlets:
[((decl,[id,id,statement]),1),
((id, []),3),
((statement,[op.]),1)
((op,[id,lit]),1),((lit,[]),1)]
Children:
[((decl,3),1),((id, 0),3),
((statement,1),1),((op,2),1),
((lit,0),1)]
def
foo x
x
+
return
42
TF-IDF
wx,y = tfx,y × log ( N / dfx )
tfx,y = frequency of feature x in snippet y
dfx = number of snippets containing feature x
N = total number of snippets
Feature extraction step
1. Convert snippets to UAST using Babelfish
2. Extract weighted bags of features from each UAST
3. Perform TF-IDF to reduce the amount of features
4. Find weights for each feature type (hyperparameters)
Hashing Step
Dataset
N snippets
Features
NxM values
Similarity
Graph
N nodes
Connected
Components
Clone
Communities
Similarity between weighted bags ?
Problem: Computing similarity between each pair of snippet
is not viable
1 million snippets 499,998,500,001 similarities
428 million snippets 91,591,999,358,000,000 similarities
Weighted Jaccard Similarity
S = set of all features
A , B = subsets of S
Minhashing
1. Create perm(A) and perm(B)
2. Hash all elements in perm(A) and perm(B)
3. Select the smallest hash for A and B
4. Probability it's the same value equals J(A, B) !
Minhash signatures
We take the k smallest hash values for each snippet:
For each pair of snippets, the probability they have the same value in
any row is still equal to their Jaccard similarity !
1 0 0 0 2
3 2 1 2 3
5 6 3 6 4
... ...
... ... ... k rows
... ......
Locality-Sensitive Hashing
We divide the signature matrix into b bands of r rows
candidate pair = any snippets which have the same values in at
least one band.
1 0 0 0 2
3 2 1 2 3
5 6 3 6 4
... ...
... ... ...
r rows
... ......
r × b = k rows
Locality-Sensitive Hashing
s = similarity between the two snippets
1. Probability the signatures are the same in one band: sr
Locality-Sensitive Hashing
s = similarity between the two snippets
1. Probability the signatures are the same in one band: sr
2. Probability the signatures are different in one band: 1 - sr
Locality-Sensitive Hashing
s = similarity between the two snippets
1. Probability the signatures are the same in one band: sr
2. Probability the signatures are different in one band: 1 - sr
3. Probability the signatures are different in all bands: (1 - sr)b
Locality-Sensitive Hashing
s = similarity between the two snippets
1. Probability the signatures are the same in one band: sr
2. Probability the signatures are different in one band: 1 - sr
3. Probability the signatures are different in all bands: (1 - sr)b
4. Probability the signatures are the same in at least one band,
i.e. the snippets are a candidate pair: 1 - (1 - sr)b
Locality-Sensitive Hashing
Regardless of r and b, we get an S-curve:
Locality-Sensitive Hashing
If we choose r and b well, we can get this:
Computing the similarity graph
1. Create the signatures for each snippet
2. Select the threshold from which we decide that snippets are
clones, deduce r and b, then for each band hash
sub-signatures into buckets
3. Snippets that land in at least one common bucket are the
candidates
Connected Components Step
Dataset
N snippets
Features
NxM values
Similarity
Graph
N nodes
Connected
Components
Clone
Communities
Extracting connected components
Community Detection Step
Dataset
N snippets
Features
NxM values
Similarity
Graph
N nodes
Connected
Components
Clone
Communities
Applying community detection
Public Git Archive
182,014 projects (repos on GitHub with over 50 stars)
3 TB of code
Commit history included
PGA HEAD = ~54.5 million files
With Babelfish driver = ~14.1 million files
Post-feature extraction
7,869,917 million files processed
102,114 project
6,194,874 distinct features
Features analysis
Feature distribution:
• identifiers: 9.93 %
• literals: 65.7 %
• graphlets: 11.84 %
• children: 0.02 %
• uast2seq (DFS) : 10.49%
• node2vec (random walk) : 2.02 %
Features analysis
Average number of features per file:
• 60 identifiers
• 38 literals
• 116 graphlets
• 37 children
• 336 uast2seq (DFS)
• 460 node2vec (random walk)
Connected components analysis
95 % threshold:
• 4,270,967 CCs
• 52.4 % of files in CCs with > 1 file
• 7.83 file per CC with > 1 file
80 % threshold:
• 3,551,648 CCs
• 61.9 % of files in CCs with > 1 file
• 8.79 file per CC with > 1 file
Connected components analysis
Connected components analysis
3,551,648
Communities analysis
Detection done with Walktrap algorithm
95 % threshold:
• 526,715 CCs with > 1 file
• In those, we detected 666,692 communities
80 % threshold:
• 553,997 CCs with > 1 file
• In those, we detected 918,333 communities
text_parser.go
422 files - 327 projects - 1 filename
2344 files - 1058 projects - 584 filename
filenames
filenames
Microsoft Azure SDK
4803 files - 3 projects - 2954 filename
Thank you !
For more:
blog.sourced.tech/post/deduplicating_pga_with_apollo
github.com/src-d/gemini
pga.sourced.tech/
r.keramitas@gmail.com

More Related Content

What's hot

Hacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnHacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnMoriyoshi Koizumi
 
Workshop on command line tools - day 1
Workshop on command line tools - day 1Workshop on command line tools - day 1
Workshop on command line tools - day 1Leandro Lima
 
Python strings presentation
Python strings presentationPython strings presentation
Python strings presentationVedaGayathri1
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
 
The Next Great Functional Programming Language
The Next Great Functional Programming LanguageThe Next Great Functional Programming Language
The Next Great Functional Programming LanguageJohn De Goes
 
Learn python in 20 minutes
Learn python in 20 minutesLearn python in 20 minutes
Learn python in 20 minutesSidharth Nadhan
 
SHA-1 backdooring & exploitation
SHA-1 backdooring & exploitationSHA-1 backdooring & exploitation
SHA-1 backdooring & exploitationAnge Albertini
 
String in python use of split method
String in python use of split methodString in python use of split method
String in python use of split methodvikram mahendra
 
Aspects of software naturalness through the generation of IdentifierNames
Aspects of software naturalness through the generation of IdentifierNamesAspects of software naturalness through the generation of IdentifierNames
Aspects of software naturalness through the generation of IdentifierNamesOleksandr Zaitsev
 

What's hot (15)

LEX & YACC TOOL
LEX & YACC TOOLLEX & YACC TOOL
LEX & YACC TOOL
 
Huffman
HuffmanHuffman
Huffman
 
Hacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnHacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 Autumn
 
Workshop on command line tools - day 1
Workshop on command line tools - day 1Workshop on command line tools - day 1
Workshop on command line tools - day 1
 
Python strings presentation
Python strings presentationPython strings presentation
Python strings presentation
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
The Next Great Functional Programming Language
The Next Great Functional Programming LanguageThe Next Great Functional Programming Language
The Next Great Functional Programming Language
 
M3 search
M3 searchM3 search
M3 search
 
Learning python
Learning pythonLearning python
Learning python
 
Learn python in 20 minutes
Learn python in 20 minutesLearn python in 20 minutes
Learn python in 20 minutes
 
SHA-1 backdooring & exploitation
SHA-1 backdooring & exploitationSHA-1 backdooring & exploitation
SHA-1 backdooring & exploitation
 
pairs
pairspairs
pairs
 
String in python use of split method
String in python use of split methodString in python use of split method
String in python use of split method
 
Aspects of software naturalness through the generation of IdentifierNames
Aspects of software naturalness through the generation of IdentifierNamesAspects of software naturalness through the generation of IdentifierNames
Aspects of software naturalness through the generation of IdentifierNames
 

Similar to Deduplication on large amounts of code

Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashingDmitriy Selivanov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Mail.ru Group
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed CoordinationLuis Galárraga
 
Pharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source SmalltalkPharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source SmalltalkSerge Stinckwich
 
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for FresherJaved Ahmad
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationsonu sharma
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKgr Sushmitha
 
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010Yahoo Developer Network
 
Manipulating string data with a pattern in R
Manipulating string data with  a pattern in RManipulating string data with  a pattern in R
Manipulating string data with a pattern in RLun-Hsien Chang
 
The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)jaxLondonConference
 
Declare Your Language (at DLS)
Declare Your Language (at DLS)Declare Your Language (at DLS)
Declare Your Language (at DLS)Eelco Visser
 
Ctcompare: Comparing Multiple Code Trees for Similarity
Ctcompare: Comparing Multiple Code Trees for SimilarityCtcompare: Comparing Multiple Code Trees for Similarity
Ctcompare: Comparing Multiple Code Trees for SimilarityDoctorWkt
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3Debanjan Bhattacharya
 
Declarative Thinking, Declarative Practice
Declarative Thinking, Declarative PracticeDeclarative Thinking, Declarative Practice
Declarative Thinking, Declarative PracticeKevlin Henney
 

Similar to Deduplication on large amounts of code (20)

Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Pharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source SmalltalkPharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source Smalltalk
 
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for Fresher
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview Question and Answer
C interview Question and AnswerC interview Question and Answer
C interview Question and Answer
 
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
 
I1
I1I1
I1
 
Nn
NnNn
Nn
 
Manipulating string data with a pattern in R
Manipulating string data with  a pattern in RManipulating string data with  a pattern in R
Manipulating string data with a pattern in R
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 
The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)
 
Declare Your Language (at DLS)
Declare Your Language (at DLS)Declare Your Language (at DLS)
Declare Your Language (at DLS)
 
Ctcompare: Comparing Multiple Code Trees for Similarity
Ctcompare: Comparing Multiple Code Trees for SimilarityCtcompare: Comparing Multiple Code Trees for Similarity
Ctcompare: Comparing Multiple Code Trees for Similarity
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3
 
Cryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie BrownCryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie Brown
 
Declarative Thinking, Declarative Practice
Declarative Thinking, Declarative PracticeDeclarative Thinking, Declarative Practice
Declarative Thinking, Declarative Practice
 

More from source{d}

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored MLsource{d}
 
Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticssource{d}
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!source{d}
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriessource{d}
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookoutsource{d}
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetupsource{d}
 
Inextricably linked reproducibility and productivity in data science and ai ...
Inextricably linked reproducibility and productivity in data science and ai  ...Inextricably linked reproducibility and productivity in data science and ai  ...
Inextricably linked reproducibility and productivity in data science and ai ...source{d}
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as datasource{d}
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack source{d}
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d}
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Codesource{d}
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performancesource{d}
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source codesource{d}
 

More from source{d} (15)

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored ML
 
Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analytics
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositories
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookout
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
Inextricably linked reproducibility and productivity in data science and ai ...
Inextricably linked reproducibility and productivity in data science and ai  ...Inextricably linked reproducibility and productivity in data science and ai  ...
Inextricably linked reproducibility and productivity in data science and ai ...
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as data
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQL
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Code
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source code
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Deduplication on large amounts of code

  • 1. Deduplication of large amounts of code Romain Keramitas FOSDEM 2019
  • 2. Clones def foo(name: str): print('Hello World, my name is ' + name) def bar(name: str): print('Hello World, my name is {}'.format(name)) def baz(name: str): print('Hello World, my name is %s' % name)
  • 3. I am so happy to speak at FOSDEM this year since it's a really awesome conference ! Natural language clones I will be speaking at FOSDEM this year, it makes me so happy because that conference is really awesome !
  • 4. I am delighted to speak at FOSDEM this year since it's a really amazing conference ! Natural language clones I will be speaking at FOSDEM this year, it makes me so happy because that conference is really awesome !
  • 5. Clone Taxonomy (Kapser; Roy and Cordy ) Type I : exactly the same Type II : structurally the same, syntactical differences Type III : combination of type I & II and minor structural changes Type IV : semantically the same, different structure and syntax
  • 6. Type IV clone def foo(n: int): k = 0 for i in range(n): k += i return k def bar(m: int): counter, l = 0, 0 while counter < m: counter += 1 l+= counter return l
  • 7. Déjà Vu approach (Lopes et al.) def foo(n: int): k = 0 for i in range(n): k += i return k [(def, 1), (foo, 1), (n, 2), (int, 1), (k, 3), (0, 1), (for, 1), (i, 2), (in, 1), (range, 1), (return, 1)] File hash 7d02b25e38eadb33e9f96d771e1844a6 Token hash f2ea1bb5a6208b5d4f2c930a0d7042d6 SourcererCC
  • 8. Gemini approach 1. Extract syntactical and structural features from each snippet Features Matrix M: N x M values Mi,j = importance of feature j for snippet i (0 if none) Dataset N snippets
  • 9. Gemini approach 2. Create a pairwise similarity graph between snippets, by hashing the feature matrix Pairwise Similarity Graph Graph with N nodes nodes = snippets edge = similarity Dataset N snippets Features NxM values
  • 10. Gemini approach 3. Extract connected components from the similarity graph Connected Components Graphs where each pair of nodes is connected by paths Dataset N snippets Features NxM values Similarity Graph N nodes
  • 11. Gemini approach 4. Perform community detection on each components to obtain clone communities Dataset N snippets Clone Communities Parts of components where each node is a clone Features NxM values Similarity Graph N nodes Connected Components
  • 12. Feature Extraction Step Dataset N snippets Features NxM values Similarity Graph N nodes Connected Components Clone Communities
  • 13. Abstract Syntax Trees def foo(x): return x + 42 def foo x x + return 42
  • 14. Abstract Syntax Trees def(decl) foo(id) x(id) x(id) +(op) return(stat ) 42(lit) def foo(x): return x + 42
  • 15. Identifiers and Literals def foo(n: int): k = 0 for i in range(n): k += i return k + 42 Identifiers: [(foo, 1),(n, 2),(k, 3),(i, 2)] Literals: [(0, 1), (42, 1)]
  • 16. Graphlets and Children Graphlets: [(decl,[id,id,statement]), (id, []),(id, []), (statement,[op.]),(op,[id,lit]) (id,[]),(lit,[])] Children: [(decl,3), (id, 0),(id, 0), (statement,1),(op,2)(id,0), (lit,0)] def(decl) foo(id) x(id) x(id) +(op) return(stat ) 42(lit)
  • 17. Graphlets and Children Graphlets: [((decl,[id,id,statement]),1), ((id, []),3), ((statement,[op.]),1) ((op,[id,lit]),1),((lit,[]),1)] Children: [((decl,3),1),((id, 0),3), ((statement,1),1),((op,2),1), ((lit,0),1)] def foo x x + return 42
  • 18. TF-IDF wx,y = tfx,y × log ( N / dfx ) tfx,y = frequency of feature x in snippet y dfx = number of snippets containing feature x N = total number of snippets
  • 19. Feature extraction step 1. Convert snippets to UAST using Babelfish 2. Extract weighted bags of features from each UAST 3. Perform TF-IDF to reduce the amount of features 4. Find weights for each feature type (hyperparameters)
  • 20. Hashing Step Dataset N snippets Features NxM values Similarity Graph N nodes Connected Components Clone Communities
  • 21.
  • 22. Similarity between weighted bags ? Problem: Computing similarity between each pair of snippet is not viable 1 million snippets 499,998,500,001 similarities 428 million snippets 91,591,999,358,000,000 similarities
  • 23. Weighted Jaccard Similarity S = set of all features A , B = subsets of S
  • 24. Minhashing 1. Create perm(A) and perm(B) 2. Hash all elements in perm(A) and perm(B) 3. Select the smallest hash for A and B 4. Probability it's the same value equals J(A, B) !
  • 25. Minhash signatures We take the k smallest hash values for each snippet: For each pair of snippets, the probability they have the same value in any row is still equal to their Jaccard similarity ! 1 0 0 0 2 3 2 1 2 3 5 6 3 6 4 ... ... ... ... ... k rows ... ......
  • 26. Locality-Sensitive Hashing We divide the signature matrix into b bands of r rows candidate pair = any snippets which have the same values in at least one band. 1 0 0 0 2 3 2 1 2 3 5 6 3 6 4 ... ... ... ... ... r rows ... ...... r × b = k rows
  • 27. Locality-Sensitive Hashing s = similarity between the two snippets 1. Probability the signatures are the same in one band: sr
  • 28. Locality-Sensitive Hashing s = similarity between the two snippets 1. Probability the signatures are the same in one band: sr 2. Probability the signatures are different in one band: 1 - sr
  • 29. Locality-Sensitive Hashing s = similarity between the two snippets 1. Probability the signatures are the same in one band: sr 2. Probability the signatures are different in one band: 1 - sr 3. Probability the signatures are different in all bands: (1 - sr)b
  • 30. Locality-Sensitive Hashing s = similarity between the two snippets 1. Probability the signatures are the same in one band: sr 2. Probability the signatures are different in one band: 1 - sr 3. Probability the signatures are different in all bands: (1 - sr)b 4. Probability the signatures are the same in at least one band, i.e. the snippets are a candidate pair: 1 - (1 - sr)b
  • 31. Locality-Sensitive Hashing Regardless of r and b, we get an S-curve:
  • 32. Locality-Sensitive Hashing If we choose r and b well, we can get this:
  • 33. Computing the similarity graph 1. Create the signatures for each snippet 2. Select the threshold from which we decide that snippets are clones, deduce r and b, then for each band hash sub-signatures into buckets 3. Snippets that land in at least one common bucket are the candidates
  • 34. Connected Components Step Dataset N snippets Features NxM values Similarity Graph N nodes Connected Components Clone Communities
  • 36. Community Detection Step Dataset N snippets Features NxM values Similarity Graph N nodes Connected Components Clone Communities
  • 38. Public Git Archive 182,014 projects (repos on GitHub with over 50 stars) 3 TB of code Commit history included PGA HEAD = ~54.5 million files With Babelfish driver = ~14.1 million files
  • 39.
  • 40.
  • 41. Post-feature extraction 7,869,917 million files processed 102,114 project 6,194,874 distinct features
  • 42. Features analysis Feature distribution: • identifiers: 9.93 % • literals: 65.7 % • graphlets: 11.84 % • children: 0.02 % • uast2seq (DFS) : 10.49% • node2vec (random walk) : 2.02 %
  • 43. Features analysis Average number of features per file: • 60 identifiers • 38 literals • 116 graphlets • 37 children • 336 uast2seq (DFS) • 460 node2vec (random walk)
  • 44. Connected components analysis 95 % threshold: • 4,270,967 CCs • 52.4 % of files in CCs with > 1 file • 7.83 file per CC with > 1 file 80 % threshold: • 3,551,648 CCs • 61.9 % of files in CCs with > 1 file • 8.79 file per CC with > 1 file
  • 47. Communities analysis Detection done with Walktrap algorithm 95 % threshold: • 526,715 CCs with > 1 file • In those, we detected 666,692 communities 80 % threshold: • 553,997 CCs with > 1 file • In those, we detected 918,333 communities
  • 48. text_parser.go 422 files - 327 projects - 1 filename
  • 49. 2344 files - 1058 projects - 584 filename filenames
  • 51. Microsoft Azure SDK 4803 files - 3 projects - 2954 filename
  • 52. Thank you ! For more: blog.sourced.tech/post/deduplicating_pga_with_apollo github.com/src-d/gemini pga.sourced.tech/ r.keramitas@gmail.com