Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Replicating Parser Behavior using Neural Machine Translation

179 views

Published on

More than other machine learning techniques, neural networks
have been shown to excel at tasks where humans traditionally
outperform computers: recognizing objects in images, distinguishing spoken words from background noise or playing “Go”. These are hard problems, where hand-crafting a solution is rarely feasible due to its inherent complexity. Higher level program comprehension is not dissimilar in nature: while a compiler or program analysis tool can quickly extract certain facts from (correctly written) code, it has no intrinsic ‘understanding’ of the data and for the majority of real-world problems in program comprehension, a human developer is needed - for example to find and fix a bug or to summarize the bahavior of a method.
We perform a pilot study to determine the suitability of neural networks for processing plain-text source code. We find that, on one hand, neural machine translation is too fragile to accurately tokenize code, while on the other hand, it can precisely recognize different types of tokens and make accurate guesses regarding their relative position in the local syntax tree. Our results suggest that neural machine translation may be exploited for annotating and enriching out-of-context code snippets to support automated tooling for code comprehension problems. We also identify
several challenges in applying neural networks to learning from source code and determine key differences between the application of existing neural network models to source code instead of natural language.

  • Be the first to comment

  • Be the first to like this

Replicating Parser Behavior using Neural Machine Translation

  1. 1. Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall {alexandru,panichella,gall}@ifi.uzh.ch 23. May 2017
  2. 2. 1 Strange new worlds Nouveaux mondes étranges
  3. 3. 2 Strange new worlds Nouveaux mondes étranges
  4. 4. 2 Strange new worlds Nouveaux mondes étranges
  5. 5. 2 Strange new worlds Nouveaux mondes étranges
  6. 6. 3 public int sum(int[] numbers) { int s = 0; for (int n : numbers) { s = s - n; } return s; }
  7. 7. 3 public int sum(int[] numbers) { int s = 0; for (int n : numbers) { s = s - n; } return s; }
  8. 8. 3 Difficult to codify, hard to detect and fixpublic int sum(int[] numbers) { int s = 0; for (int n : numbers) { s = s - n; } return s; }
  9. 9. 3 Human blackbox brain can easily spot this public int sum(int[] numbers) { int s = 0; for (int n : numbers) { s = s - n; } return s; } Difficult to codify, hard to detect and fix
  10. 10. Where to begin? 4 print("Hello World")
  11. 11. Where to begin? 4 print("Hello World") Can we teach a machine to "read" code?
  12. 12. Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace: frontière de l'infini"
  13. 13. Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace: frontière de l'infini" Space : the final frontier Espace frontière de l' infini: tokenize
  14. 14. Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace: frontière de l'infini" Space : the final frontier Espace frontière de l' infini: tokenize 808 41 5 241 1020 701 624 12 9 -174 vectorize and build vocabulary
  15. 15. Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace: frontière de l'infini" Space : the final frontier Espace frontière de l' infini: tokenize 808 41 5 241 1020 701 624 12 9 -174 vectorize and build vocabulary Vocabulary has maximum size; Uncommon words may not be included and will be represented as a special "unknown word"
  16. 16. Neural Machine Translation 6 RNN (LSTM/GRU) Source sequences Target sequences "Space: the final frontier" "Espace: frontière de l'infini" Space : the final frontier Espace frontière de l' infini: 808 41 5 241 1020 701 624 12 9 -174 tokenize vectorize and build vocabulary Vocabulary has maximum size; Uncommon words may not be included and will be represented as a special "unknown word"
  17. 17. Data Gathering and Preparation 7 clone 1000 repos language:java sort:stars
  18. 18. Data Gathering and Preparation 7 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess
  19. 19. Data Gathering and Preparation 8 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
  20. 20. Data Gathering and Preparation 8 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; 1 char per word Replace space words with unassigned Unicode char
  21. 21. Data Gathering and Preparation 9 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
  22. 22. Data Gathering and Preparation 9 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 = continue or start token 1 = end token blank space = ignore character
  23. 23. Data Gathering and Preparation 9 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Why not translate to actual tokens?! → Target vocabulary would not contain all possible tokens (although there are ways around that...)
  24. 24. Data Gathering and Preparation 10 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ;
  25. 25. Data Gathering and Preparation 10 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ; Replace spaces in words with unassigned Unicode char 1 token per word
  26. 26. Data Gathering and Preparation 11 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ; Node type & AST depth annotations Expression│12 Expression│13 Literal│17 Expression│13 Statement│11
  27. 27. Data Gathering and Preparation 11 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ; Node type & AST depth annotations Expression│12 Expression│13 Literal│17 Expression│13 Statement│11
  28. 28. Data Gathering and Preparation 11 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ; Node type & AST depth annotations Expression│12 Expression│13 Literal│17 Expression│13 Statement│11 Only contains AST node types correlating to literal tokens
  29. 29. Data Gathering and Preparation clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ; Node type & AST depth annotations Expression│12 Expression│13 Literal│17 Expression│13 Statement│11 Data creation tool is open source - define your own extractions and translations and apply them easily to 1000s of repos: https://bitbucket.org/sealuzh/parsenn Creation of 2x2 datasets for two translations steps: plaintext → tokens tokens → annotations 12
  30. 30. Results: Tokenization 13 Vocab size: 2185 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) Plain text source code Lexing Instructions Vocab size: 3 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ; i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ; p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ { s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ; } 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
  31. 31. Results: Tokenization 13 Vocab size: 2185 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) Plain text source code Lexing Instructions Vocab size: 3 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ; i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ; p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ { s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ; } 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 NMT Bi-RNN 7 epochs 7 days Perplexity: 1.11
  32. 32. Results: Tokenization 13 Vocab size: 2189 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) Plain text source code Lexing Instructions Vocab size: 7 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ; i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ; p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ { s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ; } 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 NMT Bi-RNN 7 epochs 7 days Perplexity: 1.11 What is perplexity? Lower Perplexity is better Meaning of perplexity value depends on target vocab size In the context of NMT: Perplexity describes how "confused" a probability model is on a given test data set. A perfect model has perplexity 1.
  33. 33. Results: Tokenization 13 Vocab size: 2185 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) Plain text source code Lexing Instructions Vocab size: 3 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ; i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ; p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ { s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ; } 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 NMT Bi-RNN 7 epochs 7 days Perplexity: 1.11
  34. 34. Results: Tokenization 13 Vocab size: 2185 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) Plain text source code Lexing Instructions Vocab size: 3 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ; i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ; p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ { s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ; } 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 NMT Bi-RNN 7 epochs 7 days Perplexity: 1.11 @Test(expected = NullPointerException.class) 10001100000001 1 000000000000000000000000011 Failed translation example:
  35. 35. Results: Token Annotation 14 Vocab size: 50000 Train: 25M samples (904Mb) Validation: 2M samples (76M) Tokens Type/Depth Annotations Vocab size: 87|4459 Train: 25M samples (3Gb) Validation: 2M samples (251Mb) import android . graphics . Bitmap ; import com . facebook . common . references . ResourceReleaser ; public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > { private static SimpleBitmapReleaser sInstance ; public static SimpleBitmapReleaser getInstance ( ) { if ( sInstance == null ) { sInstance = new SimpleBitmapReleaser ( ) ; } ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13 Block│14 2
  36. 36. Results: Token Annotation 14 Vocab size: 50000 Train: 25M samples (904Mb) Validation: 2M samples (76M) Tokens Type/Depth Annotations Vocab size: 87|4459 Train: 25M samples (3Gb) Validation: 2M samples (251Mb) import android . graphics . Bitmap ; import com . facebook . common . references . ResourceReleaser ; public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > { private static SimpleBitmapReleaser sInstance ; public static SimpleBitmapReleaser getInstance ( ) { if ( sInstance == null ) { sInstance = new SimpleBitmapReleaser ( ) ; } ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13 Block│14 NMT RNN 11 epochs 2 days Perplexity: 1.28 2
  37. 37. Results: Token Annotation 14 Vocab size: 50000 Train: 25M samples (904Mb) Validation: 2M samples (76M) Tokens Type/Depth Annotations Vocab size: 87|4459 Train: 25M samples (3Gb) Validation: 2M samples (251Mb) import android . graphics . Bitmap ; import com . facebook . common . references . ResourceReleaser ; public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > { private static SimpleBitmapReleaser sInstance ; public static SimpleBitmapReleaser getInstance ( ) { if ( sInstance == null ) { sInstance = new SimpleBitmapReleaser ( ) ; } ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13 Block│14 NMT RNN 11 epochs 2 days Perplexity: 1.28 2
  38. 38. Results: Token Annotation 14 Vocab size: 50000 Train: 25M samples (904Mb) Validation: 2M samples (76M) Tokens Type/Depth Annotations Vocab size: 87|4459 Train: 25M samples (3Gb) Validation: 2M samples (251Mb) import android . graphics . Bitmap ; import com . facebook . common . references . ResourceReleaser ; public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > { private static SimpleBitmapReleaser sInstance ; public static SimpleBitmapReleaser getInstance ( ) { if ( sInstance == null ) { sInstance = new SimpleBitmapReleaser ( ) ; } ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13 Block│14 NMT RNN 11 epochs 2 days Perplexity: 1.28 2 A successful example: List<Throwable> errors = TestHelper.trackPluginErrors(); 000110000000011 000001 1 0000000001100000000000000001111 [ClassOrInterfaceType|14] [TypeArguments|15] [ClassOrInterfaceType|18] [TypeArguments|15] [VariableDeclaratorId|15] [VariableDeclarator|14] [Primary|19] [Expression|17] [Expression|17] [Expression|16] [Expression|16] [LocalVariableDeclarationStatement|11]
  39. 39. Take-home messages: • NN can learn to "read" code (tokens / syntactic elements) → What else could we teach? Type resolution? Calls & attribute access? Inheritance? → Could we follow the "human path" of learning to program to teach an AI? 16
  40. 40. Take-home messages: • NN can learn to "read" code (tokens / syntactic elements) → What else could we teach? Type resolution? Calls & attribute access? Inheritance? → Could we follow the "human path" of learning to program to teach an AI? • "If only we had good data" → Bug reports, commit messages etc. are still unstructured. This needs to change if we want to leverage deep learning in SE and PC. 16
  41. 41. Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall {alexandru,panichella,gall}@ifi.uzh.ch 23. May 2017 t.uzh.ch/Hb t.uzh.ch/Hc Data creation tool: Paper:

×