Source code abstracts
classification using CNN
Vadim Markovtsev, source{d}
goo.gl/sd7wsm
(view this on your device)
Plan
1. Motivation
2. Source code feature engineering
3. The Network
4. Results
5. Other work
Motivation
Everything is better with clusters.
“
Motivation
Customers buy goods, and software developers write code.
Motivation
So to understand the latter, we need to understand what and how they do
what they do. Feature origins:
• Social networks
• Version control statistics
• History
• Style
• Source code
• Algorithms
• Dependency graph
• Style
Motivation
Motivation
Let's check how deep we can drill with source code style ML.
Toy task: binary classification between 2 projects using only the data with the
origin in code style.
Feature engineering
Requirements:
1. Ignore text files, Markdown, etc.
2. Ignore autogenerated files
3. Support many languages with minimal efforts
4. Include as much information about the source code as possible
Feature engineering
(1) and (2) are solved by  github/linguist and source{d}'s own tool
• Used by GihHub for language bars
• Supports 400+ languages
Feature engineering
(3) and (4) are solved by
• Highlights source code (tokenizer)
• Supports 400+ languages (though only 50% intersects with github/linguist)
• ≈90 token types (not all are used for every language)
Feature engineering
Pygments example:
# prints "Hello, World!"
if True:
print("Hello, World!")
# prints "Hello, World!"
if True:
print("Hello, World!")
01.
02.
03.
Feature engineering
Token.Comment.Single '# prints "Hello, World!"'
Token.Text 'n'
Token.Keyword 'if'
Token.Text ' '
Token.Name.Builtin.Pseudo 'True'
Token.Punctuation ':'
Token.Text 'n'
Token.Text ' '
Token.Keyword 'print'
Token.Punctuation '('
Token.Literal.String.Double '"'
Token.Literal.String.Double 'Hello, World!'
Token.Literal.String.Double '"'
Token.Punctuation ')'
Token.Text 'n'
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
13.
14.
15.
Feature engineering
Feature engineering
• Split stream into lines, each line contains ≤40 tokens
• Merge indents
• "One against all" with value length
• Some tokens occupy more than 1 dimension, e.g. Token.Name reflects
naming style
• About 200 dimensions overall
• 8000 features per line, most are zeros
• Mean-dispersion normalization
Feature engineering
Though extracted, names as words may not used in this scheme.
We've checked out two approaches to using this extra information:
1. LSTM sequence modelling (link to presentation)
2. ARTM topic modelling (article in our blog)
Feature engineering
The Network
layer kernel pooling number
convolutional 4x1 2x1 250
convolutional 8x2 2x2 200
convolutional 5x6 2x2 150
convolutional 2x10 2x2 100
all2all 512
all2all 64
all2all output
The Network
Activation ReLU
Optimizer GD with momentum (0.5)
Learning rate 0.002
Weight decay 0.955
Regularization L2, 0.0005
Weight initialization σ = 0.1
The Network
• Merge all project files together, feed 50 LOT (lines of tokens) as a single
sample.
• Does not converge without random shuffling files (sample borders are of
course fixed).
• Batch size is 50.
• Truncate projects by the smallest LOT.
• Fragile to small metaparameter deviations.
The Network
• Python3 / Tensorflow / NVIDIA GPU
• Preprocessing is done on Dataproc (Spark)
• Database of features is stored in Cloud Storage
• Sparse matrices ⇒normalization on the fly
Results
projects description size accuracy
Django vs Twisted Web frameworks, Python 800ktok each 84%
Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60%
Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76%
Django vs Guava Python vs Java 800ktok >99%
Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%
Results
Conclusion: the network is likely to extract internal similarity in each project
and use it. Just like humans do.
If the languages are different, it is very easy to distinguish projects (at least
because of unique token types).
Results
Results
Problem: how to get this for a source code network?
Other work
GitHub has ≈6M of active users (and 3M after reasonable filtering). If we are
able to extract various features for each, we can cluster them. Visio:
1. Run K-means with K=45000 (using src-d/kmcuda)
2. Run t-SNE to visualize the landscape
BTW, kmcuda implements Yinyang k-means.
Other work
Article.
ASP
ActionScript
Ada
Apex
Apollo Guidance Computer
AppleScript Arc
Arduino
AsciiDoc
AspectJ
Assembly
AutoHotkey
AutoIt
Awk
Batchfile
Brainfuck
C
C#
C++CLIPS
CMake
COBOL
COLLADA
CSS
CSV
ChucK
Click
Clojure
CoffeeScript
ColdFusion
ColdFusion CFC
Common Lisp
Component Pascal
Coq
Csound DocumentCsound Score
Cucumber
Cuda
Cython
D
DIGITAL Command Language
DM
DNS Zone
DTrace
Dart
Diff
EJS
Eagle
Eiffel
Elixir
Elm
Emacs Lisp
Erlang
F#
FORTRAN
Forth
FreeMarker
Frege
G-code
GAP
GAS
GLSL
Genshi
Gentoo Ebuild
Gettext Catalog
Gnuplot
Go
Gradle
Graphviz (DOT)
Groff
Groovy
Groovy Server Pages
HCL
HLSL
HTML
HTML+Django
HTML+ERB
HTML+PHP
HTTP
Haml
Handlebars
Haskell
Haxe
IGOR Pro
INI
JFlex
JSON
JSONLD
JSX
Jade
Jasmin
Java
Java Server Pages
JavaScript
Julia
Jupyter Notebook
KiCad
LLVM
Lasso
Less
Lex
LilyPond
Limbo
Linker Script
Linux Kernel Module
LiquidLiterate Haskell
LiveScript
Logos
Lua
M
M4
MAXScript
MUF
Makefile
Markdown
Mathematica
Matlab
Max
MediaWiki
Modelica
Moocode
NSIS
NetLogo
NewLisp
Nix
OCaml
ObjDump
Objective-C
Objective-C++
Objective-J
OpenCL
OpenEdge ABL
OpenSCAD
Org
PAWN
PHP
PLSQL
PLpgSQL
POV-Ray SDL
Pascal
Perl
Perl6
Pickle
Pod
PostScript
PowerShell
Processing
Prolog
Protocol Buffer
Public Key
Puppet
Pure Data
PureBasic
Python
QML
QMake
R
RAML
RDoc
RHTML
RMarkdown
Racket
Ragel in Ruby Host
Raw token data
Ruby
Rust
SAS
SCSS
SMT
SQF
SQL
SQLPL
SRecode Template
SVG
Sass
Scala
Scheme
Scilab
Shell
Slash
Slim
Smali
Smarty
SourcePawn
Squirrel
Standard ML
Stata
Stylus
SuperCollider
Swift
SystemVerilog
Tcl
TeX
Text
Textile
Turing
Turtle
TwigTypeScript
Unity3D Asset
VHDLVala
Verilog
VimL
Visual Basic
Vue
Wavefront Material
Wavefront Object
Web Ontology Language
XML
XProc
XQuery
XS
XSLT
YAML
Yacc
edn
mupad
nesC
reStructuredText
xBase
spaces tabs mixed
© source{d} CC-BY-SA 4.0
Other work
Article.
Other work
Before:
After:
Thank you
We are hiring!

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

  • 1.
    Source code abstracts classificationusing CNN Vadim Markovtsev, source{d}
  • 2.
    goo.gl/sd7wsm (view this onyour device) Plan 1. Motivation 2. Source code feature engineering 3. The Network 4. Results 5. Other work
  • 3.
  • 4.
    Motivation Customers buy goods,and software developers write code.
  • 5.
    Motivation So to understandthe latter, we need to understand what and how they do what they do. Feature origins: • Social networks • Version control statistics • History • Style • Source code • Algorithms • Dependency graph • Style
  • 6.
  • 7.
    Motivation Let's check howdeep we can drill with source code style ML. Toy task: binary classification between 2 projects using only the data with the origin in code style.
  • 8.
    Feature engineering Requirements: 1. Ignoretext files, Markdown, etc. 2. Ignore autogenerated files 3. Support many languages with minimal efforts 4. Include as much information about the source code as possible
  • 9.
    Feature engineering (1) and(2) are solved by  github/linguist and source{d}'s own tool • Used by GihHub for language bars • Supports 400+ languages
  • 10.
    Feature engineering (3) and(4) are solved by • Highlights source code (tokenizer) • Supports 400+ languages (though only 50% intersects with github/linguist) • ≈90 token types (not all are used for every language)
  • 11.
    Feature engineering Pygments example: #prints "Hello, World!" if True: print("Hello, World!") # prints "Hello, World!" if True: print("Hello, World!") 01. 02. 03.
  • 12.
    Feature engineering Token.Comment.Single '#prints "Hello, World!"' Token.Text 'n' Token.Keyword 'if' Token.Text ' ' Token.Name.Builtin.Pseudo 'True' Token.Punctuation ':' Token.Text 'n' Token.Text ' ' Token.Keyword 'print' Token.Punctuation '(' Token.Literal.String.Double '"' Token.Literal.String.Double 'Hello, World!' Token.Literal.String.Double '"' Token.Punctuation ')' Token.Text 'n' 01. 02. 03. 04. 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15.
  • 13.
  • 14.
    Feature engineering • Splitstream into lines, each line contains ≤40 tokens • Merge indents • "One against all" with value length • Some tokens occupy more than 1 dimension, e.g. Token.Name reflects naming style • About 200 dimensions overall • 8000 features per line, most are zeros • Mean-dispersion normalization
  • 15.
    Feature engineering Though extracted,names as words may not used in this scheme. We've checked out two approaches to using this extra information: 1. LSTM sequence modelling (link to presentation) 2. ARTM topic modelling (article in our blog)
  • 16.
  • 17.
    The Network layer kernelpooling number convolutional 4x1 2x1 250 convolutional 8x2 2x2 200 convolutional 5x6 2x2 150 convolutional 2x10 2x2 100 all2all 512 all2all 64 all2all output
  • 18.
    The Network Activation ReLU OptimizerGD with momentum (0.5) Learning rate 0.002 Weight decay 0.955 Regularization L2, 0.0005 Weight initialization σ = 0.1
  • 19.
    The Network • Mergeall project files together, feed 50 LOT (lines of tokens) as a single sample. • Does not converge without random shuffling files (sample borders are of course fixed). • Batch size is 50. • Truncate projects by the smallest LOT. • Fragile to small metaparameter deviations.
  • 20.
    The Network • Python3/ Tensorflow / NVIDIA GPU • Preprocessing is done on Dataproc (Spark) • Database of features is stored in Cloud Storage • Sparse matrices ⇒normalization on the fly
  • 21.
    Results projects description sizeaccuracy Django vs Twisted Web frameworks, Python 800ktok each 84% Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60% Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76% Django vs Guava Python vs Java 800ktok >99% Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%
  • 22.
    Results Conclusion: the networkis likely to extract internal similarity in each project and use it. Just like humans do. If the languages are different, it is very easy to distinguish projects (at least because of unique token types).
  • 23.
  • 24.
    Results Problem: how toget this for a source code network?
  • 25.
    Other work GitHub has≈6M of active users (and 3M after reasonable filtering). If we are able to extract various features for each, we can cluster them. Visio: 1. Run K-means with K=45000 (using src-d/kmcuda) 2. Run t-SNE to visualize the landscape BTW, kmcuda implements Yinyang k-means.
  • 27.
    Other work Article. ASP ActionScript Ada Apex Apollo GuidanceComputer AppleScript Arc Arduino AsciiDoc AspectJ Assembly AutoHotkey AutoIt Awk Batchfile Brainfuck C C# C++CLIPS CMake COBOL COLLADA CSS CSV ChucK Click Clojure CoffeeScript ColdFusion ColdFusion CFC Common Lisp Component Pascal Coq Csound DocumentCsound Score Cucumber Cuda Cython D DIGITAL Command Language DM DNS Zone DTrace Dart Diff EJS Eagle Eiffel Elixir Elm Emacs Lisp Erlang F# FORTRAN Forth FreeMarker Frege G-code GAP GAS GLSL Genshi Gentoo Ebuild Gettext Catalog Gnuplot Go Gradle Graphviz (DOT) Groff Groovy Groovy Server Pages HCL HLSL HTML HTML+Django HTML+ERB HTML+PHP HTTP Haml Handlebars Haskell Haxe IGOR Pro INI JFlex JSON JSONLD JSX Jade Jasmin Java Java Server Pages JavaScript Julia Jupyter Notebook KiCad LLVM Lasso Less Lex LilyPond Limbo Linker Script Linux Kernel Module LiquidLiterate Haskell LiveScript Logos Lua M M4 MAXScript MUF Makefile Markdown Mathematica Matlab Max MediaWiki Modelica Moocode NSIS NetLogo NewLisp Nix OCaml ObjDump Objective-C Objective-C++ Objective-J OpenCL OpenEdge ABL OpenSCAD Org PAWN PHP PLSQL PLpgSQL POV-Ray SDL Pascal Perl Perl6 Pickle Pod PostScript PowerShell Processing Prolog Protocol Buffer Public Key Puppet Pure Data PureBasic Python QML QMake R RAML RDoc RHTML RMarkdown Racket Ragel in Ruby Host Raw token data Ruby Rust SAS SCSS SMT SQF SQL SQLPL SRecode Template SVG Sass Scala Scheme Scilab Shell Slash Slim Smali Smarty SourcePawn Squirrel Standard ML Stata Stylus SuperCollider Swift SystemVerilog Tcl TeX Text Textile Turing Turtle TwigTypeScript Unity3D Asset VHDLVala Verilog VimL Visual Basic Vue Wavefront Material Wavefront Object Web Ontology Language XML XProc XQuery XS XSLT YAML Yacc edn mupad nesC reStructuredText xBase spaces tabs mixed © source{d} CC-BY-SA 4.0
  • 28.
  • 29.
  • 30.