Neel Sundaresan - Teaching a machine to code

Teaching Machines to Code
Neel Sundaresan
Microsoft Corp.
Neel Sundaresan / MLConf 2019 NY

Its all about Data
• ~19M software developers in the world (Source: Tech Republic,
Ranger2013)
• 2/3 professionals, rest hobbyists
• 29 million IT/ICT Professionals
• Growing OSS data through Github, StackOverflow etc.
• 10 years of Github
• 10M users
• 26M projects
• 400M commits
• ~7M committers
• ~1M active users and ~250K monthly new users
• ~800K new projects per month

AI Opportunities in Software Development

New Opportunities
• Take advantage of large scale data, advances in AI algorithms,
availability of distributed systems and cloud and powerful compute
(GPU) to revolutionize developer productivity

Lets first start with Data…
• DE Knuth(1971) analyzed about 800 fortran programs and found that
• 95% of the loops increment the index by 1.
• 85% of loops had 5 statements or less
• 53% of the loops were singly nested.
• More recent analysis ( Allamanis et al) of 25 MLOC showed the following stats:
• 90% have < 15 lines; 90% have no nesting; and very simple control structures.
• 50 classes of loop idioms covering 50% of concrete loops.
• Benefits
• Data driven frameworks for code refactoring
• Opportunities for program opportunities
• Language design opportunities

Statistical model of code
• Lexical/Code generative models (tokenizers)
• E.g. sequence based models (n-gram models in NLP), sequence-sequence
character models in RNN, LSTMs, Sparse pointer based neural model for
Python
• Neural models are superior to N-gram models
• more expensive to train and execute and needs a lot more data.
• perform much better because one can model long range declare-use scenarios;
• can catch patterns across contexts better than n-gram (sequence of codes that are
similar but with changed variables – sentiment of the code)
• Word2Vec, For code more recently: Code2Vec, Code2Seq

• Representational model (Abstract Syntax trees)
• These models are better representation of code than sequence models but
are more expensive.
• There’s work on using LSTM over such representations (for limited program
synthesis applications)

• Latent model
• Looking for hidden design patterns, programming idioms, Standardized APIs,
Summaries, Anamolies etc.
• Need use of unsupervised learning: Challenging!
• Previous research has used Tree substitution grammars to identify similar
grammar productions (program tree fragment)
• Graph based representation used to identify common API usage

Application of code models
• Recommenders Example: Code completion in IDEs
• Instead of using alphabetical or default orders, statistical learning could
• Early work by Bruch et al.
• Bayesian graphical models using structures for predicting next call by Proksh integrated
into Eclipse IDE.
• How to evaluate the recommender systems?
• Keystrokes saved? Overall productivity? Engagement models? Reduced bugs?

Inferring coding conventions
• Coding conventions for better maintenance
• How to format code
• Variable, class naming conventions (Allamanis et al)
• Alternative for linter rules…

Inferring bugs
• Buggy code identification is like anamoly detection
• Buggy code has unusual patterns and their probabilities are quite
different from normal code
• N-gram language model based complexity measures have shown
good results comparable to tools like FindBugs
• Even syntax error reporting closest to where the error occurs
• Since problematic code is rare (like anamolies, by definition) likely
more false positives / high precision is hard to achieve

Program Synthesis
• Autogenerating programs from specifications
• With vast amount of program examples and associated metadata attempts to
match the specs to the metadata and extract matching code
• SciGen  (Automatic paper generator from MIT)
“SCIgen is a program that generates random Computer Science research papers, including
graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements
of the papers. Our aim here is to maximize amusement, rather than coherence”. They use it to
detect bogus conferences!
• AirBnB Sketch2Code (design to code)
• A UX web design mockup to html using deep learning (Pix2Code)
• DeepCoder (MSR/U of Cambridge)
• Uses Inductive Program synthesis: given a set of input/outputs searches from a space of candidate
programs and finds the one that matches it.
• Works for DSLs (domain specific languages) with limited constructs and not to languages like C++
• Automatically finding patches (MIT Prophet/Genesis)
• Bayou system from Rice U.

A Case Study: Intellisense (Code Completion)

The learning dilemma!
Deep Learning
Vs
Cheap Learning

Intellisense

Data Source
Number of C# repos
Number of repos we were
able to build and parse to
form our dataset
Number of .cs documents in
the dataset
2000+
700+
200K+

What questions can we ask of this dataset?
1. Which are the
most frequently
used classes?
2. Are there patterns
in how methods of
one class are
used?
Which features are
useful?
How is C# used? How to make
recommendations?
1. Will the same model and
parameters work for all
classes?
2. Do we have enough
data?
3. Would the previous
usage of methods from
other classes help with
prediction?
When making a prediction
1. Which pieces of
information provided by
code analyzers would be
helpful?
2. What is the reasonable
segment of code to look
at – the entire
document/function or
the most recent calls?

How often is each class used? Top n
classes
Coverage
100 28%
300 37.5%
1088 50%
5,986 70%
13,203 80%
30,668 90%
0
4,500
9,000
13,500
18,000
22,500
27,000
31,500
36,000
40,500
45,000
string
System.Windows.Forms.Control
System.Collections.Generic.List
System.Linq.Enumerable
System.Array
System.Text.StringBuilder
System.Diagnostics.Debug
System.DateTime
System.Collections.Generic.Dictionary
System.Type
object
System.Math
System.IO.Path
double
System.IO.BinaryWriter
System.IO.File
System.Windows.Forms.Form
System.Exception
System.Reflection.MemberInfo
System.Convert
System.IO.BinaryReader
System.StringComparison
System.IO.Stream
System.Text.Encoding
System.IO.TextWriter
System.Collections.Generic.HashSet
System.Windows.Forms.AnchorStyles
ntModel.ComponentResourceManager
System.Enum
System.Environment
tem.Windows.Forms.TableLayoutPanel
Org.BouncyCastle.Math.BigInteger
System.Windows.Forms.TextBox
System.Linq.Expressions.Expression
em.Collections.ObjectModel.Collection
System.Xml.XmlNode
System.Windows.Forms.ComboBox
System.Xml.XmlWriter
System.Linq.Queryable
System.Guid
System.Reflection.Assembly
System.Tuple
OpenGL.Gl.Delegates
System.Collections.Generic.IDictionary
System.Drawing.Graphics
System.TimeSpan
System.Reflection.BindingFlags
System.Uri
System.Drawing.Size
Totalinvocationsindataset
Number of Invocations per Class for the Top 50 Classes

How often do we face the cold start problem?
14.34%
25.42%
36.41%9.63%
15.31%
17.85%
6.65%
9.28%
9.88%
5.30%
6.81%
6.89%64.07%
43.18%
28.98%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Top 100 Classes Top 100 - 200 Classes Top 200 - 300 Classes
Invocation Composition in Different Class Groups
First Invocation Second Invocation Third Invocation
Fourth Invocation Fifth Invocation or After

Sequence Model
• A second-order Markov chain: the
probability of the current invocation
depends on the two previous invocations
• Very fast to train
• Performed quite well in both offline and
online testing

Sequence model performs
better both in offline and
online testing
Modeling Method Calls: Summary
1. Frequency
Model
3. Sequence Model 5.3 MB
0.0% 10.0% 20.0% 30.0% 40.0%
string.Format
string.Equals
string.IsNullOrE…
string.Replace
string.Trim
string.Substring
string.IndexOf
string.Contains
string.IsNullOr…
string.EndsWith
string.ToUpper
string.Compare…
string.LastIndex…
string.ToCharAr…
string.PadLeft
string.IndexOfA…
Percentage of Invocations
1 MB
Top-1 Accuracy:
58%
Top-1 Accuracy:
38%

Our Intellisense system
• Languages supported
• C#, Python, C++, Java, XAML, TypeScript
• Platforms
• VSCode, Visual Studio
• Check out this blog:

A Deep learning approach
Suppose recommendation is requested here.
• The deep learning model consumes ASTs corresponding to code snippets as an input for training
• AST tokens are mapped to numeric embedding vectors, which are learned via backpropagation using Word2Vec
• Substitute method call receiver token with its inferred type, when available
• Optionally normalize local variables according to <var:variable type>
…. "loss", "=", "tf", ".",
"reduce_sum", "(", "tf", ".",
"square", "(", "linear_model", "-",
"y", ")", ")", "n", "optimizer", "=",
"tf", ".", "tensorflow.train", ".”
array([11, 9, 4, 12, 11, 9, 8, 13, 14, 15, 16, 17,
18, 19, 20, 21, 14, 22, 16, 11, 9, 4, 12, 11, 9, 8,
13, 14, 23, 16, 11, 9, 3, 12, 11, 9, 5, 12, 15, 24,
22, 13, 13, 14, 25, 16, 11, 9, 7, 9, 6], dtype=int32)
array([[[-0.00179027, 0.01935565, -0.00102201, ...,
-0.11528983, 0.02137219, 0.08332191],
...,
[-0.04104977, 0.04417963, -0.01034168, ...,
0.04209893, 0.00140189, -0.10478071]]], dtype=float32)
Vectorize3
Code
embedding
Build
matrix
AST parser
Embed4
Extract training sequences2Source code snippet1

Neural network architecture
Code Embedding
Linear layer
Softmax prediction
Code snippets
Predicted embedding
vectors
hT
y0
y1
y2
.
.
.
y|V|
c0 c1 …. cT
l0
…
ldx
hT0 … hTdh
…
x0 x1 …. xT
LSTM
…LSTM
The task of method completion is basically predicting a token m* conditional on a sequence of input
tokens ct where t= 0,..T corresponding to terminal nodes of AST for code snippet T ending in a terminal “.”
xt = Lct where L is the word embedding matrix dx X |V| where dx is the word embedding dimension and V
is the vocabulary
ht = f(xt, ht-1) where f is the stacked LSTM taking the previous hidden state, current input and producing the
next hidden state.
P(m|C) = yt = Softmax(Wht + b) where W is the output projection matrix and b is the bias.
m* = argmax(P(m|C))
Ref: Svyatkovskyy,Fu,Sundaresan,Zhao
The LSTM has 2 layers, 100 hidden units each
with application of recurrent dropout
and L2 regularization

Hyperparameter tuning
Our model has several tunable hyperparameters determined by random search optimization
By rerunning model training till convergence via early stopping and selecting the best performing
combination by comparing accuracy at the validation level
Hyperparameter Best value
Base learning rate 0.002
Learning rate decay per epoch 0.97
Num. recurrent neural network layers 2
Num. hidden units in LSTM, per layer 100
Type of RNN LSTM
Batch size 256
Hyperparameter Best value
Type of loss function Categorical
cross-entropy
Num. lookback tokens 200+
Num. timesteps for backpropagation 100
Embedded vector dimension 150
Stochastic optimization scheme Adam
Weight regularization of all layers 10

Offline model evaluation (top-5 accuracy)
Offline precision for all classes lifted by almost 20%
Category Number of classes
Improved with DL 8014
Approximately the same 1488
Declined with DL 235
Completion available with DL but not MC 263
• Most of the completion classes are improved
with deep learning approach
• 2.5% of classes are declined – mostly belonging
to Python web microframeworks like Flask,
Tornado
• For some classes type information of the
receiver token is not available, DL is still able to
provide completion in that case

Some Numbers
Ref: Svyatkovskyy,Fu,Sundaresan,Zhao

• Deep learning model allow to achieve a better accuracy
• Can be suitable for more advanced completion scenarios (not just methods)
• Opportunity to predict out-of-vocabulary tokens
• Why not?
• Bad interpretability
• Model sizes are bigger, performance is an issue
Why use deep learning?

Deployment challenges
• Need to reduce model size on disk
• Change neural network architecture to reduce number of trainable
parameters
• Reuse the input word embedding matrix as the output classification
matrix, removing the large fully connected (model size reduction from
202 to 152 MB; with no accuracy loss)
• Model compression
• Apply post-training neural network quantization to store weight
matrices in 8-bit integer format (further model size reduction from 152
to 38 MB, 3% accuracy loss)
• Serving speed
• Current serving speeds on the edge 5x slower than a cheap model

Can we teach machines to review code?
• What does data tell us?
• Open source python pull requests
0 5 10 15 20 25 30 35 40 45 50
Affirmitive reviews
Stylistic reviews
Docstring reviews
Python version related
Code duplication
Test related
error/exception related
String manipulation related
Regular expression related
Prin/debug/logging related
Import related
%
Typeofpeerreview
Distribution of type in open source peer reviews of
python pull requests
~43% reviews are basic/ stylistic reviews
~15% reviews are related to comments
Gupta,Sundaresan (KDD 2018)

Architecture
Historical
code reviews
Crawl Code and review
preprocessing
(code,review) pairs
Trainingdata
generation
Relevant,
Non-relevant
pairs
(code,review) pairs
Multi-Encoderdeep
learning model
Model
Training
Git Repositories
New pullrequest
Review candidate
selection
Review
clustering
Repository of common
reviews
Vectorized
(code,review)
Code Multi-Encoderdeep
learning model
(code, candidate
review) pairs
Review with
maximum
model confidence
Training phase
Testing phase
LSTM1
LSTM2
LSTM3
LSTM
Code
Review
DNNs
Relevance
Score
Code context

Opportunities
• OSS gives us lots and lots of Data about code and coders
• Cloud gives us opportunity to process lots and lots of data
• Recent and rapid advances in ML and AI
• Take advantage of newer advances (Transform networks / GPT-x)
• But…
• Challenges remain
• While computer languages are synthetic unlike natural languages and systems they are
programmed by human.
• Scale, sparsity, speed
• We have barely scratched the surface… a lot more to come…

Thank you!
• We have a number of initiatives in the area of applying AI at scale to
Software engineering
• We are hiring! email neels@Microsoft.com

Addendum

Neel Sundaresan - Teaching a machine to code

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Neel Sundaresan - Teaching a machine to code

Similar to Neel Sundaresan - Teaching a machine to code (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Neel Sundaresan - Teaching a machine to code

Editor's Notes