SlideShare a Scribd company logo
1 of 22
Finding Similar Projects in GitHub using
Word2Vec and WMD
MD MASUDUR RAHMAN
DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF VIRGINIA
1
Introduction
Given project details (description
and source code), the aim is to find
functionally similar projects
Finding functionally similar project
is important
Application/project recommendation
Code re-use, rapid prototyping
Discovering code plagiarism
CS@UVa 2
Code re-use Plagiarism checking
Application/project
Recommendation
How developer search for similar
projects?
General Purpose Search(Google)
CS@UVa 3
Query: android browser
Try to find application relevant to the query
Not intended to search for source code
GitHub Search: android browser
CS@UVa 4
Mostly keyword based search on textual
contents
Project name, description, etc.
Open and analyze jar, class, apk, etc.
Might rank irrelevant projects at the top
Less textual content
Use source code content
 Augment content by Method, Class, and API name
Model Workflow
5
GitHub
Projects
Data Preprocessing
(per feature)
(Tokenization, Normalization,
Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction
(Description, Readme, Method
& Class Name, API Package
Name, API Class name)
Document Generation
(combined all features)
Search
Interface
Candidate
Project
Documents
Query Project
Documents
Document Similarity
Computation
(Word2Vec, WMD)
Search Result
(Ranked list of similar projects)
CS@UVa
Model Workflow
6
GitHub
Projects
Data Preprocessing
(per feature)
(Tokenization, Normalization,
Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction
(Description, Readme, Method
& Class Name, API Package
Name, API Class name)
Document Generation
(combined all features)
Search
Interface
Candidate
Project
Documents
Query Project
Documents
Document Similarity
Computation
(Word2Vec, WMD)
Search Result
(Ranked list of similar projects)
CS@UVa
How to measure document similarity?
Document 1: image gallery app for Lollipop
7
Keyword based Cosine similarity
Bag of Word (BOW)
Document 2: android photo viewer
No common keyword!
Cosine similarity = 0
CS@UVa
How to measure document similarity?
Document 1: image gallery app for Lollipop
8
Document 2: android photo viewer
Word Embedding
𝑤1
𝑤3𝑤2
𝑤4
CS@UVa
Word Embedding
“You shall know a word by the company it keeps” –J. R. Firth 1957
9
Open source upgrade path for Odoo/OpenERP
Plugin to check for obvious upgrade points on the path to 3.0
Codes related to upgrade project
Demo app to demonstrate how to upgrade from Angular 1 to Angular 2
 Learn word vector for upgrade by its surrounding words
 Word2Vec
0.286
0.792
-0.171
-0.105
0.544
0.351
-0.653
0.274
upgrade
CS@UVa
Word2Vec
Input: Text corpus
CS@UVa 10
0.286
0.792
-0.171
-0.105
0.544
0.351
-0.653
0.274
upgrade
Word2Vec
Model
Word Embedding
Output: Word vectorsTraining
Word2Vec Model
CS@UVa 11
Document: image gallery app for android
Skip-gram
image
gallery
app
for
android
Example Word Embedding
In Embedded space
Similar meaning word clustered together
CS@UVa 12
image
photo
picture figure
sample
example
demo illustration
upgrade update
modify
change
install setup
launch
change
dimension size
height
length
range
Embedding for each word
How to get document/sentence level similarity?
 Word Mover’s Distance (WMD)
Word Mover’s Distance(WMD)
CS@UVa 13
image LollipopappgalleryD1
android viewerphotoD2
0.1
0.50.7
Word Mover’s Distance
CS@UVa 14
image LollipopappgalleryD1
android viewerphotoD2
0.1
0.50.7
Word Mover’s Distance
CS@UVa 15
image LollipopappgalleryD1
android viewerphotoD2
0.35
0.20.6
Word Mover’s Distance
CS@UVa 16
image LollipopappgalleryD1
android viewerphotoD2
0.35
0.150.2
Word Mover’s Distance
CS@UVa 17
image LollipopappgalleryD1
android viewerphotoD2
0.4
0.30.1
Word Mover’s Distance
Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55
Smaller score means more similar
CS@UVa 18
image LollipopappgalleryD1
android viewerphotoD2
0.15
0.2
0.1
0.1
Preliminary Results
19
Project Name Description Project Type
Query/
Rank
android_browser
Customize android webclient
(source code with readme file)
Lightning based
android browser
1 Myfacebook MyFacebook source code Lightning based
android browser
2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser,
and licensed under the Mozilla Public License, v. 2.0..
Lightning based
android browser
3 Web-browser Web browser is based on Lightning Browser, and licensed
under the Mozilla Public License, v. 2.0..
Lightning based
android browser
4 JumpGo JumpGo Web Browser for Android JumpGo Android
Browser
5 VChrome Build an test browser for Viettel in job interview Android Browser
CS@UVa
Summary
We proposed a model for finding functionally similar projects in GitHub
Used textual and source code content to construct document
Measured similarity between document adopting Word Mover’s Distance
Leveraged Word2Vec word embedding
20
Reference
Word2vec : Gensim python library
https://radimrehurek.com/gensim/models/word2vec.html
WMD
 https://github.com/mkusner/wmd
Wikipedia Dump.
https://dumps.wikimedia.org/enwiki/
GitHub Projects Data: The GHTorrent project
http://ghtorrent.org/
21CS@UVa
Question?
22CS@UVa

More Related Content

Similar to Finding Similar Projects in GitHub using Word2Vec and WMD

Microsoft graph and power platform champ
Microsoft graph and power platform   champMicrosoft graph and power platform   champ
Microsoft graph and power platform champKumton Suttiraksiri
 
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!Evan Mullins
 
Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Tony Frame
 
Introduction to meteor
Introduction to meteorIntroduction to meteor
Introduction to meteorNodeXperts
 
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...Amazon Web Services
 
SciVerse Application Integration Points
SciVerse Application Integration PointsSciVerse Application Integration Points
SciVerse Application Integration PointsElsevier Developers
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for DevelopersSarah Dutkiewicz
 
Develop FOSS project using Google Code Hosting
Develop FOSS project using Google Code HostingDevelop FOSS project using Google Code Hosting
Develop FOSS project using Google Code HostingNarendra Sisodiya
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with PythonBrian Lyttle
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application ModelsMarco Brambilla
 
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 Evan Mullins
 
Asp.net Programming Training (Web design, Web development)
Asp.net Programming Training (Web design, Web  development)Asp.net Programming Training (Web design, Web  development)
Asp.net Programming Training (Web design, Web development)Moutasm Tamimi
 
COMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docxCOMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docxwrite31
 
Azure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalAzure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalSarah Dutkiewicz
 
Using Thinking Sphinx with rails
Using Thinking Sphinx with railsUsing Thinking Sphinx with rails
Using Thinking Sphinx with railsRishav Dixit
 

Similar to Finding Similar Projects in GitHub using Word2Vec and WMD (20)

Azure ARM Template
Azure ARM TemplateAzure ARM Template
Azure ARM Template
 
Microsoft graph and power platform champ
Microsoft graph and power platform   champMicrosoft graph and power platform   champ
Microsoft graph and power platform champ
 
Vsts intro
Vsts introVsts intro
Vsts intro
 
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
 
Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01
 
Complete resource for web development
Complete resource for web developmentComplete resource for web development
Complete resource for web development
 
Introduction to meteor
Introduction to meteorIntroduction to meteor
Introduction to meteor
 
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
 
SciVerse Application Integration Points
SciVerse Application Integration PointsSciVerse Application Integration Points
SciVerse Application Integration Points
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for Developers
 
Develop FOSS project using Google Code Hosting
Develop FOSS project using Google Code HostingDevelop FOSS project using Google Code Hosting
Develop FOSS project using Google Code Hosting
 
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee ApplicatiesFinal Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
 
Advanced JavaScript
Advanced JavaScriptAdvanced JavaScript
Advanced JavaScript
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with Python
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application Models
 
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
 
Asp.net Programming Training (Web design, Web development)
Asp.net Programming Training (Web design, Web  development)Asp.net Programming Training (Web design, Web  development)
Asp.net Programming Training (Web design, Web development)
 
COMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docxCOMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docx
 
Azure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalAzure DevOps for the Data Professional
Azure DevOps for the Data Professional
 
Using Thinking Sphinx with rails
Using Thinking Sphinx with railsUsing Thinking Sphinx with rails
Using Thinking Sphinx with rails
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Finding Similar Projects in GitHub using Word2Vec and WMD

  • 1. Finding Similar Projects in GitHub using Word2Vec and WMD MD MASUDUR RAHMAN DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF VIRGINIA 1
  • 2. Introduction Given project details (description and source code), the aim is to find functionally similar projects Finding functionally similar project is important Application/project recommendation Code re-use, rapid prototyping Discovering code plagiarism CS@UVa 2 Code re-use Plagiarism checking Application/project Recommendation How developer search for similar projects?
  • 3. General Purpose Search(Google) CS@UVa 3 Query: android browser Try to find application relevant to the query Not intended to search for source code
  • 4. GitHub Search: android browser CS@UVa 4 Mostly keyword based search on textual contents Project name, description, etc. Open and analyze jar, class, apk, etc. Might rank irrelevant projects at the top Less textual content Use source code content  Augment content by Method, Class, and API name
  • 5. Model Workflow 5 GitHub Projects Data Preprocessing (per feature) (Tokenization, Normalization, Stemming, Stopwords Removal, TF-IDF score based word filtering) Feature Extraction (Description, Readme, Method & Class Name, API Package Name, API Class name) Document Generation (combined all features) Search Interface Candidate Project Documents Query Project Documents Document Similarity Computation (Word2Vec, WMD) Search Result (Ranked list of similar projects) CS@UVa
  • 6. Model Workflow 6 GitHub Projects Data Preprocessing (per feature) (Tokenization, Normalization, Stemming, Stopwords Removal, TF-IDF score based word filtering) Feature Extraction (Description, Readme, Method & Class Name, API Package Name, API Class name) Document Generation (combined all features) Search Interface Candidate Project Documents Query Project Documents Document Similarity Computation (Word2Vec, WMD) Search Result (Ranked list of similar projects) CS@UVa
  • 7. How to measure document similarity? Document 1: image gallery app for Lollipop 7 Keyword based Cosine similarity Bag of Word (BOW) Document 2: android photo viewer No common keyword! Cosine similarity = 0 CS@UVa
  • 8. How to measure document similarity? Document 1: image gallery app for Lollipop 8 Document 2: android photo viewer Word Embedding 𝑤1 𝑤3𝑤2 𝑤4 CS@UVa
  • 9. Word Embedding “You shall know a word by the company it keeps” –J. R. Firth 1957 9 Open source upgrade path for Odoo/OpenERP Plugin to check for obvious upgrade points on the path to 3.0 Codes related to upgrade project Demo app to demonstrate how to upgrade from Angular 1 to Angular 2  Learn word vector for upgrade by its surrounding words  Word2Vec 0.286 0.792 -0.171 -0.105 0.544 0.351 -0.653 0.274 upgrade CS@UVa
  • 10. Word2Vec Input: Text corpus CS@UVa 10 0.286 0.792 -0.171 -0.105 0.544 0.351 -0.653 0.274 upgrade Word2Vec Model Word Embedding Output: Word vectorsTraining
  • 11. Word2Vec Model CS@UVa 11 Document: image gallery app for android Skip-gram image gallery app for android
  • 12. Example Word Embedding In Embedded space Similar meaning word clustered together CS@UVa 12 image photo picture figure sample example demo illustration upgrade update modify change install setup launch change dimension size height length range Embedding for each word How to get document/sentence level similarity?  Word Mover’s Distance (WMD)
  • 13. Word Mover’s Distance(WMD) CS@UVa 13 image LollipopappgalleryD1 android viewerphotoD2 0.1 0.50.7
  • 14. Word Mover’s Distance CS@UVa 14 image LollipopappgalleryD1 android viewerphotoD2 0.1 0.50.7
  • 15. Word Mover’s Distance CS@UVa 15 image LollipopappgalleryD1 android viewerphotoD2 0.35 0.20.6
  • 16. Word Mover’s Distance CS@UVa 16 image LollipopappgalleryD1 android viewerphotoD2 0.35 0.150.2
  • 17. Word Mover’s Distance CS@UVa 17 image LollipopappgalleryD1 android viewerphotoD2 0.4 0.30.1
  • 18. Word Mover’s Distance Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55 Smaller score means more similar CS@UVa 18 image LollipopappgalleryD1 android viewerphotoD2 0.15 0.2 0.1 0.1
  • 19. Preliminary Results 19 Project Name Description Project Type Query/ Rank android_browser Customize android webclient (source code with readme file) Lightning based android browser 1 Myfacebook MyFacebook source code Lightning based android browser 2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0.. Lightning based android browser 3 Web-browser Web browser is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0.. Lightning based android browser 4 JumpGo JumpGo Web Browser for Android JumpGo Android Browser 5 VChrome Build an test browser for Viettel in job interview Android Browser CS@UVa
  • 20. Summary We proposed a model for finding functionally similar projects in GitHub Used textual and source code content to construct document Measured similarity between document adopting Word Mover’s Distance Leveraged Word2Vec word embedding 20
  • 21. Reference Word2vec : Gensim python library https://radimrehurek.com/gensim/models/word2vec.html WMD  https://github.com/mkusner/wmd Wikipedia Dump. https://dumps.wikimedia.org/enwiki/ GitHub Projects Data: The GHTorrent project http://ghtorrent.org/ 21CS@UVa

Editor's Notes

  1. Hello Everyone, I am Masudur Rahman. I am a PhD student at Department of Computer Science of University of Virginia. I will present our work, finding similar project in GitHub where we used Word Mover Distance and Word2Vec word embedding.
  2. Finding Functionally similar project is very important fo ap recommendation, code re-use, rapid prototyiping and plagiarism checking
  3. There is no convenient way to search for similar project using all the project information (souce code, description, readme etc.) No surprise! Google try to find out application based on the search engine, and they are not intended to do project level search for finding source code. We might augment the query to get some meaning results for the developer but, the intent of these general purpose search engine will remain same and it will try to find application not source code that developer might willing to use
  4. There is no convenient way to search for similar project using all the project information (souce code, description, readme etc.) We will see how we incorporated this method, class and API name to augment the textual information
  5. Let’s consider this two documents, there is no common keyword in this document thus keyword based cosine similarity will give us 0, that means they are totally dissimilar, but actually they are not, they even represent same meaning. And in project documentation developer often use different word to represent he same thing. Though these two documents are similar in meaning, normal keyword based similarity cannot capture these.
  6. If we look into closely, android and lollipop are similar in meaning. Same for other keywords as well. Now, instead of matching words exactly, can we give some value between these two words that will indicate how much similar they are in meaning. Yes we can. Learn a weight w where higher weight mean strongly similar and lower weight mean less similar
  7. Intuition: The context words of similar words would be same. One of the most effective way of doing this is: Word2Vec
  8. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  9. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  10. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  11. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  12. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  13. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity