SlideShare a Scribd company logo
1 of 22
Finding Similar Projects in GitHub using
Word2Vec and WMD
MD MASUDUR RAHMAN
DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF VIRGINIA
1
Introduction
Given project details (description
and source code), the aim is to find
functionally similar projects
Finding functionally similar project
is important
Application/project recommendation
Code re-use, rapid prototyping
Discovering code plagiarism
CS@UVa 2
Code re-use Plagiarism checking
Application/project
Recommendation
How developer search for similar
projects?
General Purpose Search(Google)
CS@UVa 3
Query: android browser
Try to find application relevant to the query
Not intended to search for source code
GitHub Search: android browser
CS@UVa 4
Mostly keyword based search on textual
contents
Project name, description, etc.
Open and analyze jar, class, apk, etc.
Might rank irrelevant projects at the top
Less textual content
Use source code content
 Augment content by Method, Class, and API name
Model Workflow
5
GitHub
Projects
Data Preprocessing
(per feature)
(Tokenization, Normalization,
Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction
(Description, Readme, Method
& Class Name, API Package
Name, API Class name)
Document Generation
(combined all features)
Search
Interface
Candidate
Project
Documents
Query Project
Documents
Document Similarity
Computation
(Word2Vec, WMD)
Search Result
(Ranked list of similar projects)
CS@UVa
Model Workflow
6
GitHub
Projects
Data Preprocessing
(per feature)
(Tokenization, Normalization,
Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction
(Description, Readme, Method
& Class Name, API Package
Name, API Class name)
Document Generation
(combined all features)
Search
Interface
Candidate
Project
Documents
Query Project
Documents
Document Similarity
Computation
(Word2Vec, WMD)
Search Result
(Ranked list of similar projects)
CS@UVa
How to measure document similarity?
Document 1: image gallery app for Lollipop
7
Keyword based Cosine similarity
Bag of Word (BOW)
Document 2: android photo viewer
No common keyword!
Cosine similarity = 0
CS@UVa
How to measure document similarity?
Document 1: image gallery app for Lollipop
8
Document 2: android photo viewer
Word Embedding
𝑤1
𝑤3𝑤2
𝑤4
CS@UVa
Word Embedding
“You shall know a word by the company it keeps” –J. R. Firth 1957
9
Open source upgrade path for Odoo/OpenERP
Plugin to check for obvious upgrade points on the path to 3.0
Codes related to upgrade project
Demo app to demonstrate how to upgrade from Angular 1 to Angular 2
 Learn word vector for upgrade by its surrounding words
 Word2Vec
0.286
0.792
-0.171
-0.105
0.544
0.351
-0.653
0.274
upgrade
CS@UVa
Word2Vec
Input: Text corpus
CS@UVa 10
0.286
0.792
-0.171
-0.105
0.544
0.351
-0.653
0.274
upgrade
Word2Vec
Model
Word Embedding
Output: Word vectorsTraining
Word2Vec Model
CS@UVa 11
Document: image gallery app for android
Skip-gram
image
gallery
app
for
android
Example Word Embedding
In Embedded space
Similar meaning word clustered together
CS@UVa 12
image
photo
picture figure
sample
example
demo illustration
upgrade update
modify
change
install setup
launch
change
dimension size
height
length
range
Embedding for each word
How to get document/sentence level similarity?
 Word Mover’s Distance (WMD)
Word Mover’s Distance(WMD)
CS@UVa 13
image LollipopappgalleryD1
android viewerphotoD2
0.1
0.50.7
Word Mover’s Distance
CS@UVa 14
image LollipopappgalleryD1
android viewerphotoD2
0.1
0.50.7
Word Mover’s Distance
CS@UVa 15
image LollipopappgalleryD1
android viewerphotoD2
0.35
0.20.6
Word Mover’s Distance
CS@UVa 16
image LollipopappgalleryD1
android viewerphotoD2
0.35
0.150.2
Word Mover’s Distance
CS@UVa 17
image LollipopappgalleryD1
android viewerphotoD2
0.4
0.30.1
Word Mover’s Distance
Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55
Smaller score means more similar
CS@UVa 18
image LollipopappgalleryD1
android viewerphotoD2
0.15
0.2
0.1
0.1
Preliminary Results
19
Project Name Description Project Type
Query/
Rank
android_browser
Customize android webclient
(source code with readme file)
Lightning based
android browser
1 Myfacebook MyFacebook source code Lightning based
android browser
2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser,
and licensed under the Mozilla Public License, v. 2.0..
Lightning based
android browser
3 Web-browser Web browser is based on Lightning Browser, and licensed
under the Mozilla Public License, v. 2.0..
Lightning based
android browser
4 JumpGo JumpGo Web Browser for Android JumpGo Android
Browser
5 VChrome Build an test browser for Viettel in job interview Android Browser
CS@UVa
Summary
We proposed a model for finding functionally similar projects in GitHub
Used textual and source code content to construct document
Measured similarity between document adopting Word Mover’s Distance
Leveraged Word2Vec word embedding
20
Reference
Word2vec : Gensim python library
https://radimrehurek.com/gensim/models/word2vec.html
WMD
 https://github.com/mkusner/wmd
Wikipedia Dump.
https://dumps.wikimedia.org/enwiki/
GitHub Projects Data: The GHTorrent project
http://ghtorrent.org/
21CS@UVa
Question?
22CS@UVa

More Related Content

Similar to Finding Similar Projects in GitHub using Word2Vec and WMD

Microsoft graph and power platform champ
Microsoft graph and power platform   champMicrosoft graph and power platform   champ
Microsoft graph and power platform champKumton Suttiraksiri
 
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!Evan Mullins
 
Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Tony Frame
 
Introduction to meteor
Introduction to meteorIntroduction to meteor
Introduction to meteorNodeXperts
 
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...Amazon Web Services
 
SciVerse Application Integration Points
SciVerse Application Integration PointsSciVerse Application Integration Points
SciVerse Application Integration PointsElsevier Developers
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for DevelopersSarah Dutkiewicz
 
Develop FOSS project using Google Code Hosting
Develop FOSS project using Google Code HostingDevelop FOSS project using Google Code Hosting
Develop FOSS project using Google Code HostingNarendra Sisodiya
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with PythonBrian Lyttle
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application ModelsMarco Brambilla
 
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 Evan Mullins
 
Asp.net Programming Training (Web design, Web development)
Asp.net Programming Training (Web design, Web  development)Asp.net Programming Training (Web design, Web  development)
Asp.net Programming Training (Web design, Web development)Moutasm Tamimi
 
COMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docxCOMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docxwrite31
 
Azure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalAzure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalSarah Dutkiewicz
 
Using Thinking Sphinx with rails
Using Thinking Sphinx with railsUsing Thinking Sphinx with rails
Using Thinking Sphinx with railsRishav Dixit
 

Similar to Finding Similar Projects in GitHub using Word2Vec and WMD (20)

Azure ARM Template
Azure ARM TemplateAzure ARM Template
Azure ARM Template
 
Microsoft graph and power platform champ
Microsoft graph and power platform   champMicrosoft graph and power platform   champ
Microsoft graph and power platform champ
 
Vsts intro
Vsts introVsts intro
Vsts intro
 
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
 
Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01
 
Complete resource for web development
Complete resource for web developmentComplete resource for web development
Complete resource for web development
 
Introduction to meteor
Introduction to meteorIntroduction to meteor
Introduction to meteor
 
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
 
SciVerse Application Integration Points
SciVerse Application Integration PointsSciVerse Application Integration Points
SciVerse Application Integration Points
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for Developers
 
Develop FOSS project using Google Code Hosting
Develop FOSS project using Google Code HostingDevelop FOSS project using Google Code Hosting
Develop FOSS project using Google Code Hosting
 
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee ApplicatiesFinal Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
 
Advanced JavaScript
Advanced JavaScriptAdvanced JavaScript
Advanced JavaScript
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with Python
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application Models
 
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
 
Asp.net Programming Training (Web design, Web development)
Asp.net Programming Training (Web design, Web  development)Asp.net Programming Training (Web design, Web  development)
Asp.net Programming Training (Web design, Web development)
 
COMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docxCOMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docx
 
Azure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalAzure DevOps for the Data Professional
Azure DevOps for the Data Professional
 
Using Thinking Sphinx with rails
Using Thinking Sphinx with railsUsing Thinking Sphinx with rails
Using Thinking Sphinx with rails
 

Recently uploaded

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Finding Similar Projects in GitHub using Word2Vec and WMD

  • 1. Finding Similar Projects in GitHub using Word2Vec and WMD MD MASUDUR RAHMAN DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF VIRGINIA 1
  • 2. Introduction Given project details (description and source code), the aim is to find functionally similar projects Finding functionally similar project is important Application/project recommendation Code re-use, rapid prototyping Discovering code plagiarism CS@UVa 2 Code re-use Plagiarism checking Application/project Recommendation How developer search for similar projects?
  • 3. General Purpose Search(Google) CS@UVa 3 Query: android browser Try to find application relevant to the query Not intended to search for source code
  • 4. GitHub Search: android browser CS@UVa 4 Mostly keyword based search on textual contents Project name, description, etc. Open and analyze jar, class, apk, etc. Might rank irrelevant projects at the top Less textual content Use source code content  Augment content by Method, Class, and API name
  • 5. Model Workflow 5 GitHub Projects Data Preprocessing (per feature) (Tokenization, Normalization, Stemming, Stopwords Removal, TF-IDF score based word filtering) Feature Extraction (Description, Readme, Method & Class Name, API Package Name, API Class name) Document Generation (combined all features) Search Interface Candidate Project Documents Query Project Documents Document Similarity Computation (Word2Vec, WMD) Search Result (Ranked list of similar projects) CS@UVa
  • 6. Model Workflow 6 GitHub Projects Data Preprocessing (per feature) (Tokenization, Normalization, Stemming, Stopwords Removal, TF-IDF score based word filtering) Feature Extraction (Description, Readme, Method & Class Name, API Package Name, API Class name) Document Generation (combined all features) Search Interface Candidate Project Documents Query Project Documents Document Similarity Computation (Word2Vec, WMD) Search Result (Ranked list of similar projects) CS@UVa
  • 7. How to measure document similarity? Document 1: image gallery app for Lollipop 7 Keyword based Cosine similarity Bag of Word (BOW) Document 2: android photo viewer No common keyword! Cosine similarity = 0 CS@UVa
  • 8. How to measure document similarity? Document 1: image gallery app for Lollipop 8 Document 2: android photo viewer Word Embedding 𝑤1 𝑤3𝑤2 𝑤4 CS@UVa
  • 9. Word Embedding “You shall know a word by the company it keeps” –J. R. Firth 1957 9 Open source upgrade path for Odoo/OpenERP Plugin to check for obvious upgrade points on the path to 3.0 Codes related to upgrade project Demo app to demonstrate how to upgrade from Angular 1 to Angular 2  Learn word vector for upgrade by its surrounding words  Word2Vec 0.286 0.792 -0.171 -0.105 0.544 0.351 -0.653 0.274 upgrade CS@UVa
  • 10. Word2Vec Input: Text corpus CS@UVa 10 0.286 0.792 -0.171 -0.105 0.544 0.351 -0.653 0.274 upgrade Word2Vec Model Word Embedding Output: Word vectorsTraining
  • 11. Word2Vec Model CS@UVa 11 Document: image gallery app for android Skip-gram image gallery app for android
  • 12. Example Word Embedding In Embedded space Similar meaning word clustered together CS@UVa 12 image photo picture figure sample example demo illustration upgrade update modify change install setup launch change dimension size height length range Embedding for each word How to get document/sentence level similarity?  Word Mover’s Distance (WMD)
  • 13. Word Mover’s Distance(WMD) CS@UVa 13 image LollipopappgalleryD1 android viewerphotoD2 0.1 0.50.7
  • 14. Word Mover’s Distance CS@UVa 14 image LollipopappgalleryD1 android viewerphotoD2 0.1 0.50.7
  • 15. Word Mover’s Distance CS@UVa 15 image LollipopappgalleryD1 android viewerphotoD2 0.35 0.20.6
  • 16. Word Mover’s Distance CS@UVa 16 image LollipopappgalleryD1 android viewerphotoD2 0.35 0.150.2
  • 17. Word Mover’s Distance CS@UVa 17 image LollipopappgalleryD1 android viewerphotoD2 0.4 0.30.1
  • 18. Word Mover’s Distance Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55 Smaller score means more similar CS@UVa 18 image LollipopappgalleryD1 android viewerphotoD2 0.15 0.2 0.1 0.1
  • 19. Preliminary Results 19 Project Name Description Project Type Query/ Rank android_browser Customize android webclient (source code with readme file) Lightning based android browser 1 Myfacebook MyFacebook source code Lightning based android browser 2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0.. Lightning based android browser 3 Web-browser Web browser is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0.. Lightning based android browser 4 JumpGo JumpGo Web Browser for Android JumpGo Android Browser 5 VChrome Build an test browser for Viettel in job interview Android Browser CS@UVa
  • 20. Summary We proposed a model for finding functionally similar projects in GitHub Used textual and source code content to construct document Measured similarity between document adopting Word Mover’s Distance Leveraged Word2Vec word embedding 20
  • 21. Reference Word2vec : Gensim python library https://radimrehurek.com/gensim/models/word2vec.html WMD  https://github.com/mkusner/wmd Wikipedia Dump. https://dumps.wikimedia.org/enwiki/ GitHub Projects Data: The GHTorrent project http://ghtorrent.org/ 21CS@UVa

Editor's Notes

  1. Hello Everyone, I am Masudur Rahman. I am a PhD student at Department of Computer Science of University of Virginia. I will present our work, finding similar project in GitHub where we used Word Mover Distance and Word2Vec word embedding.
  2. Finding Functionally similar project is very important fo ap recommendation, code re-use, rapid prototyiping and plagiarism checking
  3. There is no convenient way to search for similar project using all the project information (souce code, description, readme etc.) No surprise! Google try to find out application based on the search engine, and they are not intended to do project level search for finding source code. We might augment the query to get some meaning results for the developer but, the intent of these general purpose search engine will remain same and it will try to find application not source code that developer might willing to use
  4. There is no convenient way to search for similar project using all the project information (souce code, description, readme etc.) We will see how we incorporated this method, class and API name to augment the textual information
  5. Let’s consider this two documents, there is no common keyword in this document thus keyword based cosine similarity will give us 0, that means they are totally dissimilar, but actually they are not, they even represent same meaning. And in project documentation developer often use different word to represent he same thing. Though these two documents are similar in meaning, normal keyword based similarity cannot capture these.
  6. If we look into closely, android and lollipop are similar in meaning. Same for other keywords as well. Now, instead of matching words exactly, can we give some value between these two words that will indicate how much similar they are in meaning. Yes we can. Learn a weight w where higher weight mean strongly similar and lower weight mean less similar
  7. Intuition: The context words of similar words would be same. One of the most effective way of doing this is: Word2Vec
  8. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  9. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  10. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  11. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  12. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  13. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity