Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Finding Similar Projects in GitHub using Word2Vec and WMD

418 views

Published on

NL+SE Workshop talk at FSE 2016

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Finding Similar Projects in GitHub using Word2Vec and WMD

  1. 1. Finding Similar Projects in GitHub using Word2Vec and WMD MD MASUDUR RAHMAN DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF VIRGINIA 1
  2. 2. Introduction Given project details (description and source code), the aim is to find functionally similar projects Finding functionally similar project is important Application/project recommendation Code re-use, rapid prototyping Discovering code plagiarism CS@UVa 2 Code re-use Plagiarism checking Application/project Recommendation How developer search for similar projects?
  3. 3. General Purpose Search(Google) CS@UVa 3 Query: android browser Try to find application relevant to the query Not intended to search for source code
  4. 4. GitHub Search: android browser CS@UVa 4 Mostly keyword based search on textual contents Project name, description, etc. Open and analyze jar, class, apk, etc. Might rank irrelevant projects at the top Less textual content Use source code content  Augment content by Method, Class, and API name
  5. 5. Model Workflow 5 GitHub Projects Data Preprocessing (per feature) (Tokenization, Normalization, Stemming, Stopwords Removal, TF-IDF score based word filtering) Feature Extraction (Description, Readme, Method & Class Name, API Package Name, API Class name) Document Generation (combined all features) Search Interface Candidate Project Documents Query Project Documents Document Similarity Computation (Word2Vec, WMD) Search Result (Ranked list of similar projects) CS@UVa
  6. 6. Model Workflow 6 GitHub Projects Data Preprocessing (per feature) (Tokenization, Normalization, Stemming, Stopwords Removal, TF-IDF score based word filtering) Feature Extraction (Description, Readme, Method & Class Name, API Package Name, API Class name) Document Generation (combined all features) Search Interface Candidate Project Documents Query Project Documents Document Similarity Computation (Word2Vec, WMD) Search Result (Ranked list of similar projects) CS@UVa
  7. 7. How to measure document similarity? Document 1: image gallery app for Lollipop 7 Keyword based Cosine similarity Bag of Word (BOW) Document 2: android photo viewer No common keyword! Cosine similarity = 0 CS@UVa
  8. 8. How to measure document similarity? Document 1: image gallery app for Lollipop 8 Document 2: android photo viewer Word Embedding 𝑤1 𝑤3𝑤2 𝑤4 CS@UVa
  9. 9. Word Embedding “You shall know a word by the company it keeps” –J. R. Firth 1957 9 Open source upgrade path for Odoo/OpenERP Plugin to check for obvious upgrade points on the path to 3.0 Codes related to upgrade project Demo app to demonstrate how to upgrade from Angular 1 to Angular 2  Learn word vector for upgrade by its surrounding words  Word2Vec 0.286 0.792 -0.171 -0.105 0.544 0.351 -0.653 0.274 upgrade CS@UVa
  10. 10. Word2Vec Input: Text corpus CS@UVa 10 0.286 0.792 -0.171 -0.105 0.544 0.351 -0.653 0.274 upgrade Word2Vec Model Word Embedding Output: Word vectorsTraining
  11. 11. Word2Vec Model CS@UVa 11 Document: image gallery app for android Skip-gram image gallery app for android
  12. 12. Example Word Embedding In Embedded space Similar meaning word clustered together CS@UVa 12 image photo picture figure sample example demo illustration upgrade update modify change install setup launch change dimension size height length range Embedding for each word How to get document/sentence level similarity?  Word Mover’s Distance (WMD)
  13. 13. Word Mover’s Distance(WMD) CS@UVa 13 image LollipopappgalleryD1 android viewerphotoD2 0.1 0.50.7
  14. 14. Word Mover’s Distance CS@UVa 14 image LollipopappgalleryD1 android viewerphotoD2 0.1 0.50.7
  15. 15. Word Mover’s Distance CS@UVa 15 image LollipopappgalleryD1 android viewerphotoD2 0.35 0.20.6
  16. 16. Word Mover’s Distance CS@UVa 16 image LollipopappgalleryD1 android viewerphotoD2 0.35 0.150.2
  17. 17. Word Mover’s Distance CS@UVa 17 image LollipopappgalleryD1 android viewerphotoD2 0.4 0.30.1
  18. 18. Word Mover’s Distance Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55 Smaller score means more similar CS@UVa 18 image LollipopappgalleryD1 android viewerphotoD2 0.15 0.2 0.1 0.1
  19. 19. Preliminary Results 19 Project Name Description Project Type Query/ Rank android_browser Customize android webclient (source code with readme file) Lightning based android browser 1 Myfacebook MyFacebook source code Lightning based android browser 2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0.. Lightning based android browser 3 Web-browser Web browser is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0.. Lightning based android browser 4 JumpGo JumpGo Web Browser for Android JumpGo Android Browser 5 VChrome Build an test browser for Viettel in job interview Android Browser CS@UVa
  20. 20. Summary We proposed a model for finding functionally similar projects in GitHub Used textual and source code content to construct document Measured similarity between document adopting Word Mover’s Distance Leveraged Word2Vec word embedding 20
  21. 21. Reference Word2vec : Gensim python library https://radimrehurek.com/gensim/models/word2vec.html WMD  https://github.com/mkusner/wmd Wikipedia Dump. https://dumps.wikimedia.org/enwiki/ GitHub Projects Data: The GHTorrent project http://ghtorrent.org/ 21CS@UVa
  22. 22. Question? 22CS@UVa

×