Successfully reported this slideshow.
Your SlideShare is downloading. ×

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 30 Ad

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Download to read offline

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io

Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf

Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS

About the Author:
Currently Vadim is a Senior Machine Learning Engineer at source{d} where he works on deep neural networks that aim to understand all of the world's developers through their code. Vadim is one of the creators of the distributed deep learning platform Veles (https://velesnet.ml) while working at Samsung. Afterwards Vadim was responsible for the machine learning efforts to fight email spam at Mail.Ru. In the past Vadim was also a visiting associate professor at Moscow Institute of Physics and Technology, teaching about new technologies and conducting ACM-like internal coding competitions. Vadim is also a big fan of GitHub (vmarkovtsev) and HackerRank (markhor), as well as likes to write technical articles on a number of web sites.

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io

Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf

Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS

About the Author:
Currently Vadim is a Senior Machine Learning Engineer at source{d} where he works on deep neural networks that aim to understand all of the world's developers through their code. Vadim is one of the creators of the distributed deep learning platform Veles (https://velesnet.ml) while working at Samsung. Afterwards Vadim was responsible for the machine learning efforts to fight email spam at Mail.Ru. In the past Vadim was also a visiting associate professor at Moscow Institute of Physics and Technology, teaching about new technologies and conducting ACM-like internal coding competitions. Vadim is also a big fan of GitHub (vmarkovtsev) and HackerRank (markhor), as well as likes to write technical articles on a number of web sites.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d} (20)

More from Dataconomy Media (20)

Advertisement

Recently uploaded (20)

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

  1. 1. Source code abstracts classification using CNN Vadim Markovtsev, source{d}
  2. 2. goo.gl/sd7wsm (view this on your device) Plan 1. Motivation 2. Source code feature engineering 3. The Network 4. Results 5. Other work
  3. 3. Motivation Everything is better with clusters. “
  4. 4. Motivation Customers buy goods, and software developers write code.
  5. 5. Motivation So to understand the latter, we need to understand what and how they do what they do. Feature origins: • Social networks • Version control statistics • History • Style • Source code • Algorithms • Dependency graph • Style
  6. 6. Motivation
  7. 7. Motivation Let's check how deep we can drill with source code style ML. Toy task: binary classification between 2 projects using only the data with the origin in code style.
  8. 8. Feature engineering Requirements: 1. Ignore text files, Markdown, etc. 2. Ignore autogenerated files 3. Support many languages with minimal efforts 4. Include as much information about the source code as possible
  9. 9. Feature engineering (1) and (2) are solved by  github/linguist and source{d}'s own tool • Used by GihHub for language bars • Supports 400+ languages
  10. 10. Feature engineering (3) and (4) are solved by • Highlights source code (tokenizer) • Supports 400+ languages (though only 50% intersects with github/linguist) • ≈90 token types (not all are used for every language)
  11. 11. Feature engineering Pygments example: # prints "Hello, World!" if True: print("Hello, World!") # prints "Hello, World!" if True: print("Hello, World!") 01. 02. 03.
  12. 12. Feature engineering Token.Comment.Single '# prints "Hello, World!"' Token.Text 'n' Token.Keyword 'if' Token.Text ' ' Token.Name.Builtin.Pseudo 'True' Token.Punctuation ':' Token.Text 'n' Token.Text ' ' Token.Keyword 'print' Token.Punctuation '(' Token.Literal.String.Double '"' Token.Literal.String.Double 'Hello, World!' Token.Literal.String.Double '"' Token.Punctuation ')' Token.Text 'n' 01. 02. 03. 04. 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15.
  13. 13. Feature engineering
  14. 14. Feature engineering • Split stream into lines, each line contains ≤40 tokens • Merge indents • "One against all" with value length • Some tokens occupy more than 1 dimension, e.g. Token.Name reflects naming style • About 200 dimensions overall • 8000 features per line, most are zeros • Mean-dispersion normalization
  15. 15. Feature engineering Though extracted, names as words may not used in this scheme. We've checked out two approaches to using this extra information: 1. LSTM sequence modelling (link to presentation) 2. ARTM topic modelling (article in our blog)
  16. 16. Feature engineering
  17. 17. The Network layer kernel pooling number convolutional 4x1 2x1 250 convolutional 8x2 2x2 200 convolutional 5x6 2x2 150 convolutional 2x10 2x2 100 all2all 512 all2all 64 all2all output
  18. 18. The Network Activation ReLU Optimizer GD with momentum (0.5) Learning rate 0.002 Weight decay 0.955 Regularization L2, 0.0005 Weight initialization σ = 0.1
  19. 19. The Network • Merge all project files together, feed 50 LOT (lines of tokens) as a single sample. • Does not converge without random shuffling files (sample borders are of course fixed). • Batch size is 50. • Truncate projects by the smallest LOT. • Fragile to small metaparameter deviations.
  20. 20. The Network • Python3 / Tensorflow / NVIDIA GPU • Preprocessing is done on Dataproc (Spark) • Database of features is stored in Cloud Storage • Sparse matrices ⇒normalization on the fly
  21. 21. Results projects description size accuracy Django vs Twisted Web frameworks, Python 800ktok each 84% Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60% Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76% Django vs Guava Python vs Java 800ktok >99% Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%
  22. 22. Results Conclusion: the network is likely to extract internal similarity in each project and use it. Just like humans do. If the languages are different, it is very easy to distinguish projects (at least because of unique token types).
  23. 23. Results
  24. 24. Results Problem: how to get this for a source code network?
  25. 25. Other work GitHub has ≈6M of active users (and 3M after reasonable filtering). If we are able to extract various features for each, we can cluster them. Visio: 1. Run K-means with K=45000 (using src-d/kmcuda) 2. Run t-SNE to visualize the landscape BTW, kmcuda implements Yinyang k-means.
  26. 26. Other work Article. ASP ActionScript Ada Apex Apollo Guidance Computer AppleScript Arc Arduino AsciiDoc AspectJ Assembly AutoHotkey AutoIt Awk Batchfile Brainfuck C C# C++CLIPS CMake COBOL COLLADA CSS CSV ChucK Click Clojure CoffeeScript ColdFusion ColdFusion CFC Common Lisp Component Pascal Coq Csound DocumentCsound Score Cucumber Cuda Cython D DIGITAL Command Language DM DNS Zone DTrace Dart Diff EJS Eagle Eiffel Elixir Elm Emacs Lisp Erlang F# FORTRAN Forth FreeMarker Frege G-code GAP GAS GLSL Genshi Gentoo Ebuild Gettext Catalog Gnuplot Go Gradle Graphviz (DOT) Groff Groovy Groovy Server Pages HCL HLSL HTML HTML+Django HTML+ERB HTML+PHP HTTP Haml Handlebars Haskell Haxe IGOR Pro INI JFlex JSON JSONLD JSX Jade Jasmin Java Java Server Pages JavaScript Julia Jupyter Notebook KiCad LLVM Lasso Less Lex LilyPond Limbo Linker Script Linux Kernel Module LiquidLiterate Haskell LiveScript Logos Lua M M4 MAXScript MUF Makefile Markdown Mathematica Matlab Max MediaWiki Modelica Moocode NSIS NetLogo NewLisp Nix OCaml ObjDump Objective-C Objective-C++ Objective-J OpenCL OpenEdge ABL OpenSCAD Org PAWN PHP PLSQL PLpgSQL POV-Ray SDL Pascal Perl Perl6 Pickle Pod PostScript PowerShell Processing Prolog Protocol Buffer Public Key Puppet Pure Data PureBasic Python QML QMake R RAML RDoc RHTML RMarkdown Racket Ragel in Ruby Host Raw token data Ruby Rust SAS SCSS SMT SQF SQL SQLPL SRecode Template SVG Sass Scala Scheme Scilab Shell Slash Slim Smali Smarty SourcePawn Squirrel Standard ML Stata Stylus SuperCollider Swift SystemVerilog Tcl TeX Text Textile Turing Turtle TwigTypeScript Unity3D Asset VHDLVala Verilog VimL Visual Basic Vue Wavefront Material Wavefront Object Web Ontology Language XML XProc XQuery XS XSLT YAML Yacc edn mupad nesC reStructuredText xBase spaces tabs mixed © source{d} CC-BY-SA 4.0
  27. 27. Other work Article.
  28. 28. Other work Before: After:
  29. 29. Thank you We are hiring!

×