SlideShare a Scribd company logo
Teaching Machines to Code
Neel Sundaresan
Microsoft Corp.
Neel Sundaresan / MLConf 2019 NY
Its all about Data
• ~19M software developers in the world (Source: Tech Republic,
Ranger2013)
• 2/3 professionals, rest hobbyists
• 29 million IT/ICT Professionals
• Growing OSS data through Github, StackOverflow etc.
• 10 years of Github
• 10M users
• 26M projects
• 400M commits
• ~7M committers
• ~1M active users and ~250K monthly new users
• ~800K new projects per month
Neel Sundaresan / MLConf 2019 NY
AI Opportunities in Software Development
Neel Sundaresan / MLConf 2019 NY
New Opportunities
• Take advantage of large scale data, advances in AI algorithms,
availability of distributed systems and cloud and powerful compute
(GPU) to revolutionize developer productivity
Neel Sundaresan / MLConf 2019 NY
Lets first start with Data…
• DE Knuth(1971) analyzed about 800 fortran programs and found that
• 95% of the loops increment the index by 1.
• 85% of loops had 5 statements or less
• 53% of the loops were singly nested.
• More recent analysis ( Allamanis et al) of 25 MLOC showed the following stats:
• 90% have < 15 lines; 90% have no nesting; and very simple control structures.
• 50 classes of loop idioms covering 50% of concrete loops.
• Benefits
• Data driven frameworks for code refactoring
• Opportunities for program opportunities
• Language design opportunities
Neel Sundaresan / MLConf 2019 NY
Statistical model of code
• Lexical/Code generative models (tokenizers)
• E.g. sequence based models (n-gram models in NLP), sequence-sequence
character models in RNN, LSTMs, Sparse pointer based neural model for
Python
• Neural models are superior to N-gram models
• more expensive to train and execute and needs a lot more data.
• perform much better because one can model long range declare-use scenarios;
• can catch patterns across contexts better than n-gram (sequence of codes that are
similar but with changed variables – sentiment of the code)
• Word2Vec, For code more recently: Code2Vec, Code2Seq
Neel Sundaresan / MLConf 2019 NY
Statistical model of code
• Representational model (Abstract Syntax trees)
• These models are better representation of code than sequence models but
are more expensive.
• There’s work on using LSTM over such representations (for limited program
synthesis applications)
Neel Sundaresan / MLConf 2019 NY
Statistical model of code
• Latent model
• Looking for hidden design patterns, programming idioms, Standardized APIs,
Summaries, Anamolies etc.
• Need use of unsupervised learning: Challenging!
• Previous research has used Tree substitution grammars to identify similar
grammar productions (program tree fragment)
• Graph based representation used to identify common API usage
Neel Sundaresan / MLConf 2019 NY
Application of code models
• Recommenders Example: Code completion in IDEs
• Instead of using alphabetical or default orders, statistical learning could
• Early work by Bruch et al.
• Bayesian graphical models using structures for predicting next call by Proksh integrated
into Eclipse IDE.
• How to evaluate the recommender systems?
• Keystrokes saved? Overall productivity? Engagement models? Reduced bugs?
Neel Sundaresan / MLConf 2019 NY
Inferring coding conventions
• Coding conventions for better maintenance
• How to format code
• Variable, class naming conventions (Allamanis et al)
• Alternative for linter rules…
Neel Sundaresan / MLConf 2019 NY
Inferring bugs
• Buggy code identification is like anamoly detection
• Buggy code has unusual patterns and their probabilities are quite
different from normal code
• N-gram language model based complexity measures have shown
good results comparable to tools like FindBugs
• Even syntax error reporting closest to where the error occurs
• Since problematic code is rare (like anamolies, by definition) likely
more false positives / high precision is hard to achieve
Neel Sundaresan / MLConf 2019 NY
Program Synthesis
• Autogenerating programs from specifications
• With vast amount of program examples and associated metadata attempts to
match the specs to the metadata and extract matching code
• SciGen  (Automatic paper generator from MIT)
“SCIgen is a program that generates random Computer Science research papers, including
graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements
of the papers. Our aim here is to maximize amusement, rather than coherence”. They use it to
detect bogus conferences!
• AirBnB Sketch2Code (design to code)
• A UX web design mockup to html using deep learning (Pix2Code)
• DeepCoder (MSR/U of Cambridge)
• Uses Inductive Program synthesis: given a set of input/outputs searches from a space of candidate
programs and finds the one that matches it.
• Works for DSLs (domain specific languages) with limited constructs and not to languages like C++
• Automatically finding patches (MIT Prophet/Genesis)
• Bayou system from Rice U.
Neel Sundaresan / MLConf 2019 NY
A Case Study: Intellisense (Code Completion)
Neel Sundaresan / MLConf 2019 NY
The learning dilemma!
Deep Learning
Vs
Cheap Learning
Neel Sundaresan / MLConf 2019 NY
Intellisense
Neel Sundaresan / MLConf 2019 NY
Intellisense
Neel Sundaresan / MLConf 2019 NY
Intellisense
Neel Sundaresan / MLConf 2019 NY
Intellisense
Neel Sundaresan / MLConf 2019 NY
Neel Sundaresan / MLConf 2019 NY
Intellisense
Data Source
Number of C# repos
Number of repos we were
able to build and parse to
form our dataset
Number of .cs documents in
the dataset
2000+
700+
200K+
Neel Sundaresan / MLConf 2019 NY
What questions can we ask of this dataset?
1. Which are the
most frequently
used classes?
2. Are there patterns
in how methods of
one class are
used?
Which features are
useful?
How is C# used? How to make
recommendations?
1. Will the same model and
parameters work for all
classes?
2. Do we have enough
data?
3. Would the previous
usage of methods from
other classes help with
prediction?
When making a prediction
1. Which pieces of
information provided by
code analyzers would be
helpful?
2. What is the reasonable
segment of code to look
at – the entire
document/function or
the most recent calls?
Neel Sundaresan / MLConf 2019 NY
How often is each class used? Top n
classes
Coverage
100 28%
300 37.5%
1088 50%
5,986 70%
13,203 80%
30,668 90%
0
4,500
9,000
13,500
18,000
22,500
27,000
31,500
36,000
40,500
45,000
string
System.Windows.Forms.Control
System.Collections.Generic.List
System.Linq.Enumerable
System.Array
System.Text.StringBuilder
System.Diagnostics.Debug
System.DateTime
System.Collections.Generic.Dictionary
System.Type
object
System.Math
System.IO.Path
double
System.IO.BinaryWriter
System.IO.File
System.Windows.Forms.Form
System.Exception
System.Reflection.MemberInfo
System.Convert
System.IO.BinaryReader
System.StringComparison
System.IO.Stream
System.Text.Encoding
System.IO.TextWriter
System.Collections.Generic.HashSet
System.Windows.Forms.AnchorStyles
ntModel.ComponentResourceManager
System.Enum
System.Environment
tem.Windows.Forms.TableLayoutPanel
Org.BouncyCastle.Math.BigInteger
System.Windows.Forms.TextBox
System.Linq.Expressions.Expression
em.Collections.ObjectModel.Collection
System.Xml.XmlNode
System.Windows.Forms.ComboBox
System.Xml.XmlWriter
System.Linq.Queryable
System.Guid
System.Reflection.Assembly
System.Tuple
OpenGL.Gl.Delegates
System.Collections.Generic.IDictionary
System.Drawing.Graphics
System.TimeSpan
System.Reflection.BindingFlags
System.Uri
System.Drawing.Size
Totalinvocationsindataset
Number of Invocations per Class for the Top 50 Classes
Neel Sundaresan / MLConf 2019 NY
How often do we face the cold start problem?
14.34%
25.42%
36.41%9.63%
15.31%
17.85%
6.65%
9.28%
9.88%
5.30%
6.81%
6.89%64.07%
43.18%
28.98%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Top 100 Classes Top 100 - 200 Classes Top 200 - 300 Classes
Invocation Composition in Different Class Groups
First Invocation Second Invocation Third Invocation
Fourth Invocation Fifth Invocation or After
Neel Sundaresan / MLConf 2019 NY
Sequence Model
• A second-order Markov chain: the
probability of the current invocation
depends on the two previous invocations
• Very fast to train
• Performed quite well in both offline and
online testing
Neel Sundaresan / MLConf 2019 NY
Sequence model performs
better both in offline and
online testing
Modeling Method Calls: Summary
1. Frequency
Model
3. Sequence Model 5.3 MB
0.0% 10.0% 20.0% 30.0% 40.0%
string.Format
string.Equals
string.IsNullOrE…
string.Replace
string.Trim
string.Substring
string.IndexOf
string.Contains
string.IsNullOr…
string.EndsWith
string.ToUpper
string.Compare…
string.LastIndex…
string.ToCharAr…
string.PadLeft
string.IndexOfA…
Percentage of Invocations
1 MB
Neel Sundaresan / MLConf 2019 NY
Top-1 Accuracy:
58%
Top-1 Accuracy:
38%
Our Intellisense system
• Languages supported
• C#, Python, C++, Java, XAML, TypeScript
• Platforms
• VSCode, Visual Studio
• Check out this blog:
Neel Sundaresan / MLConf 2019 NY
A Deep learning approach
Suppose recommendation is requested here.
• The deep learning model consumes ASTs corresponding to code snippets as an input for training
• AST tokens are mapped to numeric embedding vectors, which are learned via backpropagation using Word2Vec
• Substitute method call receiver token with its inferred type, when available
• Optionally normalize local variables according to <var:variable type>
…. "loss", "=", "tf", ".",
"reduce_sum", "(", "tf", ".",
"square", "(", "linear_model", "-",
"y", ")", ")", "n", "optimizer", "=",
"tf", ".", "tensorflow.train", ".”
array([11, 9, 4, 12, 11, 9, 8, 13, 14, 15, 16, 17,
18, 19, 20, 21, 14, 22, 16, 11, 9, 4, 12, 11, 9, 8,
13, 14, 23, 16, 11, 9, 3, 12, 11, 9, 5, 12, 15, 24,
22, 13, 13, 14, 25, 16, 11, 9, 7, 9, 6], dtype=int32)
array([[[-0.00179027, 0.01935565, -0.00102201, ...,
-0.11528983, 0.02137219, 0.08332191],
...,
[-0.04104977, 0.04417963, -0.01034168, ...,
0.04209893, 0.00140189, -0.10478071]]], dtype=float32)
Vectorize3
Code
embedding
Build
matrix
AST parser
Embed4
Extract training sequences2Source code snippet1
Neel Sundaresan / MLConf 2019 NY
Neural network architecture
Suppose recommendation is requested here.
Code Embedding
Linear layer
Softmax prediction
Code snippets
Predicted embedding
vectors
hT
y0
y1
y2
.
.
.
y|V|
c0 c1 …. cT
l0
…
ldx
hT0 … hTdh
…
x0 x1 …. xT
LSTM
…LSTM
The task of method completion is basically predicting a token m* conditional on a sequence of input
tokens ct where t= 0,..T corresponding to terminal nodes of AST for code snippet T ending in a terminal “.”
xt = Lct where L is the word embedding matrix dx X |V| where dx is the word embedding dimension and V
is the vocabulary
ht = f(xt, ht-1) where f is the stacked LSTM taking the previous hidden state, current input and producing the
next hidden state.
P(m|C) = yt = Softmax(Wht + b) where W is the output projection matrix and b is the bias.
m* = argmax(P(m|C))
Ref: Svyatkovskyy,Fu,Sundaresan,Zhao
The LSTM has 2 layers, 100 hidden units each
with application of recurrent dropout
and L2 regularization
Neel Sundaresan / MLConf 2019 NY
Hyperparameter tuning
Our model has several tunable hyperparameters determined by random search optimization
By rerunning model training till convergence via early stopping and selecting the best performing
combination by comparing accuracy at the validation level
Hyperparameter Best value
Base learning rate 0.002
Learning rate decay per epoch 0.97
Num. recurrent neural network layers 2
Num. hidden units in LSTM, per layer 100
Type of RNN LSTM
Batch size 256
Hyperparameter Best value
Type of loss function Categorical
cross-entropy
Num. lookback tokens 200+
Num. timesteps for backpropagation 100
Embedded vector dimension 150
Stochastic optimization scheme Adam
Weight regularization of all layers 10
Neel Sundaresan / MLConf 2019 NY
Offline model evaluation (top-5 accuracy)
Offline precision for all classes lifted by almost 20%
Category Number of classes
Improved with DL 8014
Approximately the same 1488
Declined with DL 235
Completion available with DL but not MC 263
• Most of the completion classes are improved
with deep learning approach
• 2.5% of classes are declined – mostly belonging
to Python web microframeworks like Flask,
Tornado
• For some classes type information of the
receiver token is not available, DL is still able to
provide completion in that case
Neel Sundaresan / MLConf 2019 NY
Some Numbers
Neel Sundaresan / MLConf 2019 NY
Ref: Svyatkovskyy,Fu,Sundaresan,Zhao
• Deep learning model allow to achieve a better accuracy
• Can be suitable for more advanced completion scenarios (not just methods)
• Opportunity to predict out-of-vocabulary tokens
• Why not?
• Bad interpretability
• Model sizes are bigger, performance is an issue
Why use deep learning?
Neel Sundaresan / MLConf 2019 NY
Deployment challenges
Suppose recommendation is requested here.
• Need to reduce model size on disk
• Change neural network architecture to reduce number of trainable
parameters
• Reuse the input word embedding matrix as the output classification
matrix, removing the large fully connected (model size reduction from
202 to 152 MB; with no accuracy loss)
• Model compression
• Apply post-training neural network quantization to store weight
matrices in 8-bit integer format (further model size reduction from 152
to 38 MB, 3% accuracy loss)
• Serving speed
• Current serving speeds on the edge 5x slower than a cheap model
Neel Sundaresan / MLConf 2019 NY
Can we teach machines to review code?
• What does data tell us?
• Open source python pull requests
0 5 10 15 20 25 30 35 40 45 50
Affirmitive reviews
Stylistic reviews
Docstring reviews
Python version related
Code duplication
Test related
error/exception related
String manipulation related
Regular expression related
Prin/debug/logging related
Import related
%
Typeofpeerreview
Distribution of type in open source peer reviews of
python pull requests
~43% reviews are basic/ stylistic reviews
~15% reviews are related to comments
Gupta,Sundaresan (KDD 2018)
Neel Sundaresan / MLConf 2019 NY
Architecture
Neel Sundaresan / MLConf 2019 NY
Historical
code reviews
Crawl Code and review
preprocessing
(code,review) pairs
Trainingdata
generation
Relevant,
Non-relevant
pairs
(code,review) pairs
Multi-Encoderdeep
learning model
Model
Training
Git Repositories
New pullrequest
Review candidate
selection
Review
clustering
Repository of common
reviews
Vectorized
(code,review)
Code Multi-Encoderdeep
learning model
(code, candidate
review) pairs
Review with
maximum
model confidence
Training phase
Testing phase
LSTM1
LSTM2
LSTM3
LSTM
Code
Review
DNNs
Relevance
Score
Code context
Opportunities
• OSS gives us lots and lots of Data about code and coders
• Cloud gives us opportunity to process lots and lots of data
• Recent and rapid advances in ML and AI
• Take advantage of newer advances (Transform networks / GPT-x)
• But…
• Challenges remain
• While computer languages are synthetic unlike natural languages and systems they are
programmed by human.
• Scale, sparsity, speed
• We have barely scratched the surface… a lot more to come…
Neel Sundaresan / MLConf 2019 NY
Thank you!
• We have a number of initiatives in the area of applying AI at scale to
Software engineering
• We are hiring! email neels@Microsoft.com
Neel Sundaresan / MLConf 2019 NY
Addendum
Neel Sundaresan / MLConf 2019 NY

More Related Content

What's hot

Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
Samir Bessalah
 
Looking into the Future: Using Google's Prediction API
Looking into the Future: Using Google's Prediction APILooking into the Future: Using Google's Prediction API
Looking into the Future: Using Google's Prediction API
Justin Grammens
 
Automate your Machine Learning
Automate your Machine LearningAutomate your Machine Learning
Automate your Machine Learning
Ajit Ananthram
 
Common Problems in Hyperparameter Optimization
Common Problems in Hyperparameter OptimizationCommon Problems in Hyperparameter Optimization
Common Problems in Hyperparameter Optimization
SigOpt
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
Databricks
 
Detecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine LearningDetecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine Learning
Databricks
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
Rui Quintino
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
Georg Heiler
 
From Noob to Tech CEO - june 29th, 2011
From Noob to Tech CEO - june 29th, 2011From Noob to Tech CEO - june 29th, 2011
From Noob to Tech CEO - june 29th, 2011Kareem Amin
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
GDG PDX - An Intro to Google Cloud AutoML Vision
GDG PDX - An Intro to Google Cloud AutoML VisionGDG PDX - An Intro to Google Cloud AutoML Vision
GDG PDX - An Intro to Google Cloud AutoML Vision
jerryhargrove
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
Geo Python16 keynote
Geo Python16 keynoteGeo Python16 keynote
Geo Python16 keynote
Romeo Kienzler
 
Machine Learning Projects Using MATLAB Research Help
Machine Learning Projects Using MATLAB Research HelpMachine Learning Projects Using MATLAB Research Help
Machine Learning Projects Using MATLAB Research Help
Matlab Simulation
 

What's hot (20)

Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Looking into the Future: Using Google's Prediction API
Looking into the Future: Using Google's Prediction APILooking into the Future: Using Google's Prediction API
Looking into the Future: Using Google's Prediction API
 
Automate your Machine Learning
Automate your Machine LearningAutomate your Machine Learning
Automate your Machine Learning
 
Common Problems in Hyperparameter Optimization
Common Problems in Hyperparameter OptimizationCommon Problems in Hyperparameter Optimization
Common Problems in Hyperparameter Optimization
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
Detecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine LearningDetecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine Learning
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
From Noob to Tech CEO - june 29th, 2011
From Noob to Tech CEO - june 29th, 2011From Noob to Tech CEO - june 29th, 2011
From Noob to Tech CEO - june 29th, 2011
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
GDG PDX - An Intro to Google Cloud AutoML Vision
GDG PDX - An Intro to Google Cloud AutoML VisionGDG PDX - An Intro to Google Cloud AutoML Vision
GDG PDX - An Intro to Google Cloud AutoML Vision
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
Geo Python16 keynote
Geo Python16 keynoteGeo Python16 keynote
Geo Python16 keynote
 
Machine Learning Projects Using MATLAB Research Help
Machine Learning Projects Using MATLAB Research HelpMachine Learning Projects Using MATLAB Research Help
Machine Learning Projects Using MATLAB Research Help
 

Similar to Neel Sundaresan - Teaching a machine to code

OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
Paris Open Source Summit
 
IRJET- Hand Sign Recognition using Convolutional Neural Network
IRJET- Hand Sign Recognition using Convolutional Neural NetworkIRJET- Hand Sign Recognition using Convolutional Neural Network
IRJET- Hand Sign Recognition using Convolutional Neural Network
IRJET Journal
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
Sanghamitra Deb
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
Lynn Langit
 
Creating a new language to support open innovation
Creating a new language to support open innovationCreating a new language to support open innovation
Creating a new language to support open innovation
Mike Hucka
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...butest
 
Handwritten Digit Recognition Using CNN
Handwritten Digit Recognition Using CNNHandwritten Digit Recognition Using CNN
Handwritten Digit Recognition Using CNN
IRJET Journal
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
Vsevolod Dyomkin
 
Automated Construction of Node Software Using Attributes in a Ubiquitous Sens...
Automated Construction of Node Software Using Attributes in a Ubiquitous Sens...Automated Construction of Node Software Using Attributes in a Ubiquitous Sens...
Automated Construction of Node Software Using Attributes in a Ubiquitous Sens...
JM code group
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindspore
ijdms
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
Arpitha Gurumurthy
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Deep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data Services
Marco Parenzan
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnetDeep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Marco Parenzan
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
JunKudo2
 
Navodaya R Experience resume
Navodaya R Experience resumeNavodaya R Experience resume
Navodaya R Experience resumeNavodaya R
 
Intro to Deep Learning with Keras - using TensorFlow backend
Intro to Deep Learning with Keras - using TensorFlow backendIntro to Deep Learning with Keras - using TensorFlow backend
Intro to Deep Learning with Keras - using TensorFlow backend
Amin Golnari
 

Similar to Neel Sundaresan - Teaching a machine to code (20)

OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
IRJET- Hand Sign Recognition using Convolutional Neural Network
IRJET- Hand Sign Recognition using Convolutional Neural NetworkIRJET- Hand Sign Recognition using Convolutional Neural Network
IRJET- Hand Sign Recognition using Convolutional Neural Network
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Creating a new language to support open innovation
Creating a new language to support open innovationCreating a new language to support open innovation
Creating a new language to support open innovation
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
 
Handwritten Digit Recognition Using CNN
Handwritten Digit Recognition Using CNNHandwritten Digit Recognition Using CNN
Handwritten Digit Recognition Using CNN
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Automated Construction of Node Software Using Attributes in a Ubiquitous Sens...
Automated Construction of Node Software Using Attributes in a Ubiquitous Sens...Automated Construction of Node Software Using Attributes in a Ubiquitous Sens...
Automated Construction of Node Software Using Attributes in a Ubiquitous Sens...
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindspore
 
migrate-case-study
migrate-case-studymigrate-case-study
migrate-case-study
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Deep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data Services
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnetDeep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnet
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
 
Navodaya R Experience resume
Navodaya R Experience resumeNavodaya R Experience resume
Navodaya R Experience resume
 
Intro to Deep Learning with Keras - using TensorFlow backend
Intro to Deep Learning with Keras - using TensorFlow backendIntro to Deep Learning with Keras - using TensorFlow backend
Intro to Deep Learning with Keras - using TensorFlow backend
 

More from MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
MLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
MLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
MLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
MLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
MLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
MLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
MLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
MLconf
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
MLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
 

Recently uploaded

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 

Recently uploaded (20)

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 

Neel Sundaresan - Teaching a machine to code

  • 1. Teaching Machines to Code Neel Sundaresan Microsoft Corp. Neel Sundaresan / MLConf 2019 NY
  • 2. Its all about Data • ~19M software developers in the world (Source: Tech Republic, Ranger2013) • 2/3 professionals, rest hobbyists • 29 million IT/ICT Professionals • Growing OSS data through Github, StackOverflow etc. • 10 years of Github • 10M users • 26M projects • 400M commits • ~7M committers • ~1M active users and ~250K monthly new users • ~800K new projects per month Neel Sundaresan / MLConf 2019 NY
  • 3. AI Opportunities in Software Development Neel Sundaresan / MLConf 2019 NY
  • 4. New Opportunities • Take advantage of large scale data, advances in AI algorithms, availability of distributed systems and cloud and powerful compute (GPU) to revolutionize developer productivity Neel Sundaresan / MLConf 2019 NY
  • 5. Lets first start with Data… • DE Knuth(1971) analyzed about 800 fortran programs and found that • 95% of the loops increment the index by 1. • 85% of loops had 5 statements or less • 53% of the loops were singly nested. • More recent analysis ( Allamanis et al) of 25 MLOC showed the following stats: • 90% have < 15 lines; 90% have no nesting; and very simple control structures. • 50 classes of loop idioms covering 50% of concrete loops. • Benefits • Data driven frameworks for code refactoring • Opportunities for program opportunities • Language design opportunities Neel Sundaresan / MLConf 2019 NY
  • 6. Statistical model of code • Lexical/Code generative models (tokenizers) • E.g. sequence based models (n-gram models in NLP), sequence-sequence character models in RNN, LSTMs, Sparse pointer based neural model for Python • Neural models are superior to N-gram models • more expensive to train and execute and needs a lot more data. • perform much better because one can model long range declare-use scenarios; • can catch patterns across contexts better than n-gram (sequence of codes that are similar but with changed variables – sentiment of the code) • Word2Vec, For code more recently: Code2Vec, Code2Seq Neel Sundaresan / MLConf 2019 NY
  • 7. Statistical model of code • Representational model (Abstract Syntax trees) • These models are better representation of code than sequence models but are more expensive. • There’s work on using LSTM over such representations (for limited program synthesis applications) Neel Sundaresan / MLConf 2019 NY
  • 8. Statistical model of code • Latent model • Looking for hidden design patterns, programming idioms, Standardized APIs, Summaries, Anamolies etc. • Need use of unsupervised learning: Challenging! • Previous research has used Tree substitution grammars to identify similar grammar productions (program tree fragment) • Graph based representation used to identify common API usage Neel Sundaresan / MLConf 2019 NY
  • 9. Application of code models • Recommenders Example: Code completion in IDEs • Instead of using alphabetical or default orders, statistical learning could • Early work by Bruch et al. • Bayesian graphical models using structures for predicting next call by Proksh integrated into Eclipse IDE. • How to evaluate the recommender systems? • Keystrokes saved? Overall productivity? Engagement models? Reduced bugs? Neel Sundaresan / MLConf 2019 NY
  • 10. Inferring coding conventions • Coding conventions for better maintenance • How to format code • Variable, class naming conventions (Allamanis et al) • Alternative for linter rules… Neel Sundaresan / MLConf 2019 NY
  • 11. Inferring bugs • Buggy code identification is like anamoly detection • Buggy code has unusual patterns and their probabilities are quite different from normal code • N-gram language model based complexity measures have shown good results comparable to tools like FindBugs • Even syntax error reporting closest to where the error occurs • Since problematic code is rare (like anamolies, by definition) likely more false positives / high precision is hard to achieve Neel Sundaresan / MLConf 2019 NY
  • 12. Program Synthesis • Autogenerating programs from specifications • With vast amount of program examples and associated metadata attempts to match the specs to the metadata and extract matching code • SciGen  (Automatic paper generator from MIT) “SCIgen is a program that generates random Computer Science research papers, including graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements of the papers. Our aim here is to maximize amusement, rather than coherence”. They use it to detect bogus conferences! • AirBnB Sketch2Code (design to code) • A UX web design mockup to html using deep learning (Pix2Code) • DeepCoder (MSR/U of Cambridge) • Uses Inductive Program synthesis: given a set of input/outputs searches from a space of candidate programs and finds the one that matches it. • Works for DSLs (domain specific languages) with limited constructs and not to languages like C++ • Automatically finding patches (MIT Prophet/Genesis) • Bayou system from Rice U. Neel Sundaresan / MLConf 2019 NY
  • 13. A Case Study: Intellisense (Code Completion) Neel Sundaresan / MLConf 2019 NY
  • 14. The learning dilemma! Deep Learning Vs Cheap Learning Neel Sundaresan / MLConf 2019 NY
  • 19. Neel Sundaresan / MLConf 2019 NY Intellisense
  • 20. Data Source Number of C# repos Number of repos we were able to build and parse to form our dataset Number of .cs documents in the dataset 2000+ 700+ 200K+ Neel Sundaresan / MLConf 2019 NY
  • 21. What questions can we ask of this dataset? 1. Which are the most frequently used classes? 2. Are there patterns in how methods of one class are used? Which features are useful? How is C# used? How to make recommendations? 1. Will the same model and parameters work for all classes? 2. Do we have enough data? 3. Would the previous usage of methods from other classes help with prediction? When making a prediction 1. Which pieces of information provided by code analyzers would be helpful? 2. What is the reasonable segment of code to look at – the entire document/function or the most recent calls? Neel Sundaresan / MLConf 2019 NY
  • 22. How often is each class used? Top n classes Coverage 100 28% 300 37.5% 1088 50% 5,986 70% 13,203 80% 30,668 90% 0 4,500 9,000 13,500 18,000 22,500 27,000 31,500 36,000 40,500 45,000 string System.Windows.Forms.Control System.Collections.Generic.List System.Linq.Enumerable System.Array System.Text.StringBuilder System.Diagnostics.Debug System.DateTime System.Collections.Generic.Dictionary System.Type object System.Math System.IO.Path double System.IO.BinaryWriter System.IO.File System.Windows.Forms.Form System.Exception System.Reflection.MemberInfo System.Convert System.IO.BinaryReader System.StringComparison System.IO.Stream System.Text.Encoding System.IO.TextWriter System.Collections.Generic.HashSet System.Windows.Forms.AnchorStyles ntModel.ComponentResourceManager System.Enum System.Environment tem.Windows.Forms.TableLayoutPanel Org.BouncyCastle.Math.BigInteger System.Windows.Forms.TextBox System.Linq.Expressions.Expression em.Collections.ObjectModel.Collection System.Xml.XmlNode System.Windows.Forms.ComboBox System.Xml.XmlWriter System.Linq.Queryable System.Guid System.Reflection.Assembly System.Tuple OpenGL.Gl.Delegates System.Collections.Generic.IDictionary System.Drawing.Graphics System.TimeSpan System.Reflection.BindingFlags System.Uri System.Drawing.Size Totalinvocationsindataset Number of Invocations per Class for the Top 50 Classes Neel Sundaresan / MLConf 2019 NY
  • 23. How often do we face the cold start problem? 14.34% 25.42% 36.41%9.63% 15.31% 17.85% 6.65% 9.28% 9.88% 5.30% 6.81% 6.89%64.07% 43.18% 28.98% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Top 100 Classes Top 100 - 200 Classes Top 200 - 300 Classes Invocation Composition in Different Class Groups First Invocation Second Invocation Third Invocation Fourth Invocation Fifth Invocation or After Neel Sundaresan / MLConf 2019 NY
  • 24. Sequence Model • A second-order Markov chain: the probability of the current invocation depends on the two previous invocations • Very fast to train • Performed quite well in both offline and online testing Neel Sundaresan / MLConf 2019 NY
  • 25. Sequence model performs better both in offline and online testing Modeling Method Calls: Summary 1. Frequency Model 3. Sequence Model 5.3 MB 0.0% 10.0% 20.0% 30.0% 40.0% string.Format string.Equals string.IsNullOrE… string.Replace string.Trim string.Substring string.IndexOf string.Contains string.IsNullOr… string.EndsWith string.ToUpper string.Compare… string.LastIndex… string.ToCharAr… string.PadLeft string.IndexOfA… Percentage of Invocations 1 MB Neel Sundaresan / MLConf 2019 NY Top-1 Accuracy: 58% Top-1 Accuracy: 38%
  • 26. Our Intellisense system • Languages supported • C#, Python, C++, Java, XAML, TypeScript • Platforms • VSCode, Visual Studio • Check out this blog: Neel Sundaresan / MLConf 2019 NY
  • 27. A Deep learning approach Suppose recommendation is requested here. • The deep learning model consumes ASTs corresponding to code snippets as an input for training • AST tokens are mapped to numeric embedding vectors, which are learned via backpropagation using Word2Vec • Substitute method call receiver token with its inferred type, when available • Optionally normalize local variables according to <var:variable type> …. "loss", "=", "tf", ".", "reduce_sum", "(", "tf", ".", "square", "(", "linear_model", "-", "y", ")", ")", "n", "optimizer", "=", "tf", ".", "tensorflow.train", ".” array([11, 9, 4, 12, 11, 9, 8, 13, 14, 15, 16, 17, 18, 19, 20, 21, 14, 22, 16, 11, 9, 4, 12, 11, 9, 8, 13, 14, 23, 16, 11, 9, 3, 12, 11, 9, 5, 12, 15, 24, 22, 13, 13, 14, 25, 16, 11, 9, 7, 9, 6], dtype=int32) array([[[-0.00179027, 0.01935565, -0.00102201, ..., -0.11528983, 0.02137219, 0.08332191], ..., [-0.04104977, 0.04417963, -0.01034168, ..., 0.04209893, 0.00140189, -0.10478071]]], dtype=float32) Vectorize3 Code embedding Build matrix AST parser Embed4 Extract training sequences2Source code snippet1 Neel Sundaresan / MLConf 2019 NY
  • 28. Neural network architecture Suppose recommendation is requested here. Code Embedding Linear layer Softmax prediction Code snippets Predicted embedding vectors hT y0 y1 y2 . . . y|V| c0 c1 …. cT l0 … ldx hT0 … hTdh … x0 x1 …. xT LSTM …LSTM The task of method completion is basically predicting a token m* conditional on a sequence of input tokens ct where t= 0,..T corresponding to terminal nodes of AST for code snippet T ending in a terminal “.” xt = Lct where L is the word embedding matrix dx X |V| where dx is the word embedding dimension and V is the vocabulary ht = f(xt, ht-1) where f is the stacked LSTM taking the previous hidden state, current input and producing the next hidden state. P(m|C) = yt = Softmax(Wht + b) where W is the output projection matrix and b is the bias. m* = argmax(P(m|C)) Ref: Svyatkovskyy,Fu,Sundaresan,Zhao The LSTM has 2 layers, 100 hidden units each with application of recurrent dropout and L2 regularization Neel Sundaresan / MLConf 2019 NY
  • 29. Hyperparameter tuning Our model has several tunable hyperparameters determined by random search optimization By rerunning model training till convergence via early stopping and selecting the best performing combination by comparing accuracy at the validation level Hyperparameter Best value Base learning rate 0.002 Learning rate decay per epoch 0.97 Num. recurrent neural network layers 2 Num. hidden units in LSTM, per layer 100 Type of RNN LSTM Batch size 256 Hyperparameter Best value Type of loss function Categorical cross-entropy Num. lookback tokens 200+ Num. timesteps for backpropagation 100 Embedded vector dimension 150 Stochastic optimization scheme Adam Weight regularization of all layers 10 Neel Sundaresan / MLConf 2019 NY
  • 30. Offline model evaluation (top-5 accuracy) Offline precision for all classes lifted by almost 20% Category Number of classes Improved with DL 8014 Approximately the same 1488 Declined with DL 235 Completion available with DL but not MC 263 • Most of the completion classes are improved with deep learning approach • 2.5% of classes are declined – mostly belonging to Python web microframeworks like Flask, Tornado • For some classes type information of the receiver token is not available, DL is still able to provide completion in that case Neel Sundaresan / MLConf 2019 NY
  • 31. Some Numbers Neel Sundaresan / MLConf 2019 NY Ref: Svyatkovskyy,Fu,Sundaresan,Zhao
  • 32. • Deep learning model allow to achieve a better accuracy • Can be suitable for more advanced completion scenarios (not just methods) • Opportunity to predict out-of-vocabulary tokens • Why not? • Bad interpretability • Model sizes are bigger, performance is an issue Why use deep learning? Neel Sundaresan / MLConf 2019 NY
  • 33. Deployment challenges Suppose recommendation is requested here. • Need to reduce model size on disk • Change neural network architecture to reduce number of trainable parameters • Reuse the input word embedding matrix as the output classification matrix, removing the large fully connected (model size reduction from 202 to 152 MB; with no accuracy loss) • Model compression • Apply post-training neural network quantization to store weight matrices in 8-bit integer format (further model size reduction from 152 to 38 MB, 3% accuracy loss) • Serving speed • Current serving speeds on the edge 5x slower than a cheap model Neel Sundaresan / MLConf 2019 NY
  • 34. Can we teach machines to review code? • What does data tell us? • Open source python pull requests 0 5 10 15 20 25 30 35 40 45 50 Affirmitive reviews Stylistic reviews Docstring reviews Python version related Code duplication Test related error/exception related String manipulation related Regular expression related Prin/debug/logging related Import related % Typeofpeerreview Distribution of type in open source peer reviews of python pull requests ~43% reviews are basic/ stylistic reviews ~15% reviews are related to comments Gupta,Sundaresan (KDD 2018) Neel Sundaresan / MLConf 2019 NY
  • 35. Architecture Neel Sundaresan / MLConf 2019 NY Historical code reviews Crawl Code and review preprocessing (code,review) pairs Trainingdata generation Relevant, Non-relevant pairs (code,review) pairs Multi-Encoderdeep learning model Model Training Git Repositories New pullrequest Review candidate selection Review clustering Repository of common reviews Vectorized (code,review) Code Multi-Encoderdeep learning model (code, candidate review) pairs Review with maximum model confidence Training phase Testing phase LSTM1 LSTM2 LSTM3 LSTM Code Review DNNs Relevance Score Code context
  • 36. Opportunities • OSS gives us lots and lots of Data about code and coders • Cloud gives us opportunity to process lots and lots of data • Recent and rapid advances in ML and AI • Take advantage of newer advances (Transform networks / GPT-x) • But… • Challenges remain • While computer languages are synthetic unlike natural languages and systems they are programmed by human. • Scale, sparsity, speed • We have barely scratched the surface… a lot more to come… Neel Sundaresan / MLConf 2019 NY
  • 37. Thank you! • We have a number of initiatives in the area of applying AI at scale to Software engineering • We are hiring! email neels@Microsoft.com Neel Sundaresan / MLConf 2019 NY
  • 38. Addendum Neel Sundaresan / MLConf 2019 NY

Editor's Notes

  1. In order to create an IntelliSense that suggests the right method when you need it, we need lots of examples of realistic usage of the various classes in the dot net framework and other common libraries.  So we crawled all the public C# repos on GitHub with more than 100 stars.  There were 2300 of those. We could automatically restore and build one third of these. This gave us 200,000 .cs files. -------------- When there are multiple solutions in one repo, we only parse the first one to avoid duplication The ones that could not be parsed either did not contain a .sln file, or we could not load/open the first .sln file within 60 seconds, or we could not get a compilation of the code within 60 seconds Each solution was given 2 minutes to “nugget restore” its NuGet packages. One of the issues that JoC raised is that many popular repos on GitHub are libraries, and we suspected that the coding patterns employed could be different from normal application solutions. JoC pointed us a repo from MSIT What are in the data? Different approaches we take. Talk a lot of data, rich information in the data. Jumping into sequence data too fast? Make a silde for questions? Are certain calls different from different. Data driven approach Mention collaboration
  2. Here before we move on to the modeling part, let’s take a detour and think about what this dataset enables us to answer, and look at some data that justify our approach.  Generally, we want to understand which classes are used most often so we can focus our effort.  Are there patterns in how methods are used that we can take advantage of? In relation to making useful recommendations,  we would want to know what are the most informative features,  and also how local should our context information be - the entire document, the current function or the last few calls? Once we develop a model, we’d like to know if one model works for all, whether we have sufficient training data.
  3. Here’s one question that’s readily answerable. This graph shows the number of invocations for the top 50 classes in our dataset.  We can see that the most popular class is string, followed by WinForms Control and List. We can see that the popularity drops off very quickly, and it has a very long tail.  The top 300 classes cover nearly 40%.  (For the precision results you’ll see later, everything is reported for the top 300 classes.)
  4. As is the case with all recommender systems, we also face the cold start problem, meaning that there’s no contextual information to base our recommendation on. This happens when the current invocation is the first time this class is called in the current document, as we have little idea on what the developer is trying to write. Let’s focus on the light blue part of each bar. This is the portion of invocations that are first of its class in the document.  We see that for the top 100 classes, only about 15% of the time would we need to make a recommendation with no context. This gets more severe as the class becomes more rarely used.
  5. We have experimented with three types of models. The Frequency model is a simple popularity ranking which we use for the cold start scenarios. We then implemented variants of the Clustering model because it is a popular approach for API recommenders in the literature.  The precision of the Clustering model was modest, and so we implemented the Sequence model, which we thought was a better model of the coding process. It turned out to have the highest precision and is the model we use in production today. 
  6. Move this sldies after representing snippets