Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Neel Sundaresan - Teaching a machine to code

431 views

Published on

Teaching a machine to code

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Neel Sundaresan - Teaching a machine to code

  1. 1. Teaching Machines to Code Neel Sundaresan Microsoft Corp. Neel Sundaresan / MLConf 2019 NY
  2. 2. Its all about Data • ~19M software developers in the world (Source: Tech Republic, Ranger2013) • 2/3 professionals, rest hobbyists • 29 million IT/ICT Professionals • Growing OSS data through Github, StackOverflow etc. • 10 years of Github • 10M users • 26M projects • 400M commits • ~7M committers • ~1M active users and ~250K monthly new users • ~800K new projects per month Neel Sundaresan / MLConf 2019 NY
  3. 3. AI Opportunities in Software Development Neel Sundaresan / MLConf 2019 NY
  4. 4. New Opportunities • Take advantage of large scale data, advances in AI algorithms, availability of distributed systems and cloud and powerful compute (GPU) to revolutionize developer productivity Neel Sundaresan / MLConf 2019 NY
  5. 5. Lets first start with Data… • DE Knuth(1971) analyzed about 800 fortran programs and found that • 95% of the loops increment the index by 1. • 85% of loops had 5 statements or less • 53% of the loops were singly nested. • More recent analysis ( Allamanis et al) of 25 MLOC showed the following stats: • 90% have < 15 lines; 90% have no nesting; and very simple control structures. • 50 classes of loop idioms covering 50% of concrete loops. • Benefits • Data driven frameworks for code refactoring • Opportunities for program opportunities • Language design opportunities Neel Sundaresan / MLConf 2019 NY
  6. 6. Statistical model of code • Lexical/Code generative models (tokenizers) • E.g. sequence based models (n-gram models in NLP), sequence-sequence character models in RNN, LSTMs, Sparse pointer based neural model for Python • Neural models are superior to N-gram models • more expensive to train and execute and needs a lot more data. • perform much better because one can model long range declare-use scenarios; • can catch patterns across contexts better than n-gram (sequence of codes that are similar but with changed variables – sentiment of the code) • Word2Vec, For code more recently: Code2Vec, Code2Seq Neel Sundaresan / MLConf 2019 NY
  7. 7. Statistical model of code • Representational model (Abstract Syntax trees) • These models are better representation of code than sequence models but are more expensive. • There’s work on using LSTM over such representations (for limited program synthesis applications) Neel Sundaresan / MLConf 2019 NY
  8. 8. Statistical model of code • Latent model • Looking for hidden design patterns, programming idioms, Standardized APIs, Summaries, Anamolies etc. • Need use of unsupervised learning: Challenging! • Previous research has used Tree substitution grammars to identify similar grammar productions (program tree fragment) • Graph based representation used to identify common API usage Neel Sundaresan / MLConf 2019 NY
  9. 9. Application of code models • Recommenders Example: Code completion in IDEs • Instead of using alphabetical or default orders, statistical learning could • Early work by Bruch et al. • Bayesian graphical models using structures for predicting next call by Proksh integrated into Eclipse IDE. • How to evaluate the recommender systems? • Keystrokes saved? Overall productivity? Engagement models? Reduced bugs? Neel Sundaresan / MLConf 2019 NY
  10. 10. Inferring coding conventions • Coding conventions for better maintenance • How to format code • Variable, class naming conventions (Allamanis et al) • Alternative for linter rules… Neel Sundaresan / MLConf 2019 NY
  11. 11. Inferring bugs • Buggy code identification is like anamoly detection • Buggy code has unusual patterns and their probabilities are quite different from normal code • N-gram language model based complexity measures have shown good results comparable to tools like FindBugs • Even syntax error reporting closest to where the error occurs • Since problematic code is rare (like anamolies, by definition) likely more false positives / high precision is hard to achieve Neel Sundaresan / MLConf 2019 NY
  12. 12. Program Synthesis • Autogenerating programs from specifications • With vast amount of program examples and associated metadata attempts to match the specs to the metadata and extract matching code • SciGen  (Automatic paper generator from MIT) “SCIgen is a program that generates random Computer Science research papers, including graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements of the papers. Our aim here is to maximize amusement, rather than coherence”. They use it to detect bogus conferences! • AirBnB Sketch2Code (design to code) • A UX web design mockup to html using deep learning (Pix2Code) • DeepCoder (MSR/U of Cambridge) • Uses Inductive Program synthesis: given a set of input/outputs searches from a space of candidate programs and finds the one that matches it. • Works for DSLs (domain specific languages) with limited constructs and not to languages like C++ • Automatically finding patches (MIT Prophet/Genesis) • Bayou system from Rice U. Neel Sundaresan / MLConf 2019 NY
  13. 13. A Case Study: Intellisense (Code Completion) Neel Sundaresan / MLConf 2019 NY
  14. 14. The learning dilemma! Deep Learning Vs Cheap Learning Neel Sundaresan / MLConf 2019 NY
  15. 15. Intellisense Neel Sundaresan / MLConf 2019 NY
  16. 16. Intellisense Neel Sundaresan / MLConf 2019 NY
  17. 17. Intellisense Neel Sundaresan / MLConf 2019 NY
  18. 18. Intellisense Neel Sundaresan / MLConf 2019 NY
  19. 19. Neel Sundaresan / MLConf 2019 NY Intellisense
  20. 20. Data Source Number of C# repos Number of repos we were able to build and parse to form our dataset Number of .cs documents in the dataset 2000+ 700+ 200K+ Neel Sundaresan / MLConf 2019 NY
  21. 21. What questions can we ask of this dataset? 1. Which are the most frequently used classes? 2. Are there patterns in how methods of one class are used? Which features are useful? How is C# used? How to make recommendations? 1. Will the same model and parameters work for all classes? 2. Do we have enough data? 3. Would the previous usage of methods from other classes help with prediction? When making a prediction 1. Which pieces of information provided by code analyzers would be helpful? 2. What is the reasonable segment of code to look at – the entire document/function or the most recent calls? Neel Sundaresan / MLConf 2019 NY
  22. 22. How often is each class used? Top n classes Coverage 100 28% 300 37.5% 1088 50% 5,986 70% 13,203 80% 30,668 90% 0 4,500 9,000 13,500 18,000 22,500 27,000 31,500 36,000 40,500 45,000 string System.Windows.Forms.Control System.Collections.Generic.List System.Linq.Enumerable System.Array System.Text.StringBuilder System.Diagnostics.Debug System.DateTime System.Collections.Generic.Dictionary System.Type object System.Math System.IO.Path double System.IO.BinaryWriter System.IO.File System.Windows.Forms.Form System.Exception System.Reflection.MemberInfo System.Convert System.IO.BinaryReader System.StringComparison System.IO.Stream System.Text.Encoding System.IO.TextWriter System.Collections.Generic.HashSet System.Windows.Forms.AnchorStyles ntModel.ComponentResourceManager System.Enum System.Environment tem.Windows.Forms.TableLayoutPanel Org.BouncyCastle.Math.BigInteger System.Windows.Forms.TextBox System.Linq.Expressions.Expression em.Collections.ObjectModel.Collection System.Xml.XmlNode System.Windows.Forms.ComboBox System.Xml.XmlWriter System.Linq.Queryable System.Guid System.Reflection.Assembly System.Tuple OpenGL.Gl.Delegates System.Collections.Generic.IDictionary System.Drawing.Graphics System.TimeSpan System.Reflection.BindingFlags System.Uri System.Drawing.Size Totalinvocationsindataset Number of Invocations per Class for the Top 50 Classes Neel Sundaresan / MLConf 2019 NY
  23. 23. How often do we face the cold start problem? 14.34% 25.42% 36.41%9.63% 15.31% 17.85% 6.65% 9.28% 9.88% 5.30% 6.81% 6.89%64.07% 43.18% 28.98% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Top 100 Classes Top 100 - 200 Classes Top 200 - 300 Classes Invocation Composition in Different Class Groups First Invocation Second Invocation Third Invocation Fourth Invocation Fifth Invocation or After Neel Sundaresan / MLConf 2019 NY
  24. 24. Sequence Model • A second-order Markov chain: the probability of the current invocation depends on the two previous invocations • Very fast to train • Performed quite well in both offline and online testing Neel Sundaresan / MLConf 2019 NY
  25. 25. Sequence model performs better both in offline and online testing Modeling Method Calls: Summary 1. Frequency Model 3. Sequence Model 5.3 MB 0.0% 10.0% 20.0% 30.0% 40.0% string.Format string.Equals string.IsNullOrE… string.Replace string.Trim string.Substring string.IndexOf string.Contains string.IsNullOr… string.EndsWith string.ToUpper string.Compare… string.LastIndex… string.ToCharAr… string.PadLeft string.IndexOfA… Percentage of Invocations 1 MB Neel Sundaresan / MLConf 2019 NY Top-1 Accuracy: 58% Top-1 Accuracy: 38%
  26. 26. Our Intellisense system • Languages supported • C#, Python, C++, Java, XAML, TypeScript • Platforms • VSCode, Visual Studio • Check out this blog: Neel Sundaresan / MLConf 2019 NY
  27. 27. A Deep learning approach Suppose recommendation is requested here. • The deep learning model consumes ASTs corresponding to code snippets as an input for training • AST tokens are mapped to numeric embedding vectors, which are learned via backpropagation using Word2Vec • Substitute method call receiver token with its inferred type, when available • Optionally normalize local variables according to <var:variable type> …. "loss", "=", "tf", ".", "reduce_sum", "(", "tf", ".", "square", "(", "linear_model", "-", "y", ")", ")", "n", "optimizer", "=", "tf", ".", "tensorflow.train", ".” array([11, 9, 4, 12, 11, 9, 8, 13, 14, 15, 16, 17, 18, 19, 20, 21, 14, 22, 16, 11, 9, 4, 12, 11, 9, 8, 13, 14, 23, 16, 11, 9, 3, 12, 11, 9, 5, 12, 15, 24, 22, 13, 13, 14, 25, 16, 11, 9, 7, 9, 6], dtype=int32) array([[[-0.00179027, 0.01935565, -0.00102201, ..., -0.11528983, 0.02137219, 0.08332191], ..., [-0.04104977, 0.04417963, -0.01034168, ..., 0.04209893, 0.00140189, -0.10478071]]], dtype=float32) Vectorize3 Code embedding Build matrix AST parser Embed4 Extract training sequences2Source code snippet1 Neel Sundaresan / MLConf 2019 NY
  28. 28. Neural network architecture Suppose recommendation is requested here. Code Embedding Linear layer Softmax prediction Code snippets Predicted embedding vectors hT y0 y1 y2 . . . y|V| c0 c1 …. cT l0 … ldx hT0 … hTdh … x0 x1 …. xT LSTM …LSTM The task of method completion is basically predicting a token m* conditional on a sequence of input tokens ct where t= 0,..T corresponding to terminal nodes of AST for code snippet T ending in a terminal “.” xt = Lct where L is the word embedding matrix dx X |V| where dx is the word embedding dimension and V is the vocabulary ht = f(xt, ht-1) where f is the stacked LSTM taking the previous hidden state, current input and producing the next hidden state. P(m|C) = yt = Softmax(Wht + b) where W is the output projection matrix and b is the bias. m* = argmax(P(m|C)) Ref: Svyatkovskyy,Fu,Sundaresan,Zhao The LSTM has 2 layers, 100 hidden units each with application of recurrent dropout and L2 regularization Neel Sundaresan / MLConf 2019 NY
  29. 29. Hyperparameter tuning Our model has several tunable hyperparameters determined by random search optimization By rerunning model training till convergence via early stopping and selecting the best performing combination by comparing accuracy at the validation level Hyperparameter Best value Base learning rate 0.002 Learning rate decay per epoch 0.97 Num. recurrent neural network layers 2 Num. hidden units in LSTM, per layer 100 Type of RNN LSTM Batch size 256 Hyperparameter Best value Type of loss function Categorical cross-entropy Num. lookback tokens 200+ Num. timesteps for backpropagation 100 Embedded vector dimension 150 Stochastic optimization scheme Adam Weight regularization of all layers 10 Neel Sundaresan / MLConf 2019 NY
  30. 30. Offline model evaluation (top-5 accuracy) Offline precision for all classes lifted by almost 20% Category Number of classes Improved with DL 8014 Approximately the same 1488 Declined with DL 235 Completion available with DL but not MC 263 • Most of the completion classes are improved with deep learning approach • 2.5% of classes are declined – mostly belonging to Python web microframeworks like Flask, Tornado • For some classes type information of the receiver token is not available, DL is still able to provide completion in that case Neel Sundaresan / MLConf 2019 NY
  31. 31. Some Numbers Neel Sundaresan / MLConf 2019 NY Ref: Svyatkovskyy,Fu,Sundaresan,Zhao
  32. 32. • Deep learning model allow to achieve a better accuracy • Can be suitable for more advanced completion scenarios (not just methods) • Opportunity to predict out-of-vocabulary tokens • Why not? • Bad interpretability • Model sizes are bigger, performance is an issue Why use deep learning? Neel Sundaresan / MLConf 2019 NY
  33. 33. Deployment challenges Suppose recommendation is requested here. • Need to reduce model size on disk • Change neural network architecture to reduce number of trainable parameters • Reuse the input word embedding matrix as the output classification matrix, removing the large fully connected (model size reduction from 202 to 152 MB; with no accuracy loss) • Model compression • Apply post-training neural network quantization to store weight matrices in 8-bit integer format (further model size reduction from 152 to 38 MB, 3% accuracy loss) • Serving speed • Current serving speeds on the edge 5x slower than a cheap model Neel Sundaresan / MLConf 2019 NY
  34. 34. Can we teach machines to review code? • What does data tell us? • Open source python pull requests 0 5 10 15 20 25 30 35 40 45 50 Affirmitive reviews Stylistic reviews Docstring reviews Python version related Code duplication Test related error/exception related String manipulation related Regular expression related Prin/debug/logging related Import related % Typeofpeerreview Distribution of type in open source peer reviews of python pull requests ~43% reviews are basic/ stylistic reviews ~15% reviews are related to comments Gupta,Sundaresan (KDD 2018) Neel Sundaresan / MLConf 2019 NY
  35. 35. Architecture Neel Sundaresan / MLConf 2019 NY Historical code reviews Crawl Code and review preprocessing (code,review) pairs Trainingdata generation Relevant, Non-relevant pairs (code,review) pairs Multi-Encoderdeep learning model Model Training Git Repositories New pullrequest Review candidate selection Review clustering Repository of common reviews Vectorized (code,review) Code Multi-Encoderdeep learning model (code, candidate review) pairs Review with maximum model confidence Training phase Testing phase LSTM1 LSTM2 LSTM3 LSTM Code Review DNNs Relevance Score Code context
  36. 36. Opportunities • OSS gives us lots and lots of Data about code and coders • Cloud gives us opportunity to process lots and lots of data • Recent and rapid advances in ML and AI • Take advantage of newer advances (Transform networks / GPT-x) • But… • Challenges remain • While computer languages are synthetic unlike natural languages and systems they are programmed by human. • Scale, sparsity, speed • We have barely scratched the surface… a lot more to come… Neel Sundaresan / MLConf 2019 NY
  37. 37. Thank you! • We have a number of initiatives in the area of applying AI at scale to Software engineering • We are hiring! email neels@Microsoft.com Neel Sundaresan / MLConf 2019 NY
  38. 38. Addendum Neel Sundaresan / MLConf 2019 NY

×