A Syntactic Neural Model
for General-Purpose Code
Generation
Pengcheng Yin, Graham Neubig
ACL2017
M1 Tomoya Ogata
1
Abstract
• Input
• natural language descriptions
• Output
• source code written in a general-purpose programming
language
• Existing data-driven methods treat this problem as a
language generation task without considering the
underlying syntax of the target programming language
• propose a novel neural architecture powered by a
grammar model to explicitly capture the target syntax
as prior knowledge
2
Given an NL description x, our task is to generate the
code snippet c in a modern PL based on the intent of x.
We define a probabilistic grammar model of generating
an AST y given x
𝑦 is then deterministically converted to the
corresponding surface code
an AST is generated by applying several production rules
composed of a head node and multiple child nodes
The Code Generation Problem
3
Grammar Model
• APPLYRULE Actions
• GENTOKEN Actions
4
APPLYRULE Actions
• APPLYRULE chooses a rule from the subset that has
a head matching the type of 𝑛 𝑓𝑡
• uses r to expand 𝑛 𝑓𝑡
by appending all child nodes
specified by the selected production
• When a variable terminal node is added to the
derivation and becomes the frontier node, the
grammar model then switches to GENTOKEN
actions to populate the variable terminal with
tokens
5
GENTOKEN Actions
Once we reach a frontier node 𝑛 𝑓𝑡
that corresponds
to a variable type, GENTOKEN actions are used to fill
this node with values.
At each time step, GENTOKEN appends one terminal
token to the current frontier variable node
6
Tracking Generation States
Action Embedding: at
Context Vector: ct
Parent Feeding: pt
7
Parent Feeding
• parent information 𝑝𝑡 from two sources
• (1) the hidden state of parent action 𝑠 𝑝 𝑡
• (2) the embedding of parent action 𝑎 𝑝 𝑡
• The parent feeding schema enables the model to
utilize the information of parent code segments to
make more confident predictions
8
Calculating Action Probabilities
g(st): tanh(W・st+b)
e(r): one-hot vector for rule r
p(gen|・), p(copy|・): tanh(W・st+b)
9
Training
• Given a dataset of pairs of NL descriptions 𝑥𝑖 and
code 𝑐𝑖 snippets
• we parse 𝑐𝑖 into its AST 𝑦𝑖
• decompose 𝑦𝑖 into a sequence of oracle actions
• The model is then optimized by maximizing the log-
likelihood of the oracle action sequence
10
Experimental Evaluation
• Datasets
• HEARTHSTONE (HS) dataset
• a collection of Python classes that implement cards for the card
game HearthStone
• DJANGO dataset
• a collection of lines of code from the Django web framework, each
with a manually annotated NL description
• IFTTT dataset
• a domain- specific benchmark that provides an interesting side
comparison
• Metrics
• accuracy
• BLUE-4
11
Experimental Evaluation
• Preprocessing
• Input are tokenized using NLTK
• replacing quoted strings in the inputs with place holders
• extract unary closures whose frequency is larger than a
threshold
• Configuration
• node type embeddings: 64
• All other embeddings: 128
• RNN states: 256
• hidden layers: 50
12
Model
• Latent Predictor Network (LPN), a state-of-the-art
sequence- to-sequence code generation model
• SEQ2TREE, a neural semantic parsing model
• NMT system using a standard encoder-decoder
architecture with attention and unknown word
replacement
13
Result (HS, DJANG0)
14
Output Example
15
Result (IFTTT)
16
Error Analysis
• randomly sampled and labeled 100 and 50 failed
examples from DJANGO and HS, respectively
• using different parameter names when defining a function
• omitting (or adding) default values of parameters in function
calls
• DJANGO
• 30%: the pointer network failed
• 25%: the generated code only partially implemented the
required functionality
• 10%: malformed English inputs
• 5%: preprocessing errors
• 30%: could not be easily categorized into the above
• HS
• partial implementation errors
17
Conclusion
• This paper proposes a syntax-driven neural code
generation approach that generates an abstract
syntax tree by sequentially applying actions from a
grammar model
18

[論文紹介]A syntactic neural model for general purpose code generation

  • 1.
    A Syntactic NeuralModel for General-Purpose Code Generation Pengcheng Yin, Graham Neubig ACL2017 M1 Tomoya Ogata 1
  • 2.
    Abstract • Input • naturallanguage descriptions • Output • source code written in a general-purpose programming language • Existing data-driven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language • propose a novel neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge 2
  • 3.
    Given an NLdescription x, our task is to generate the code snippet c in a modern PL based on the intent of x. We define a probabilistic grammar model of generating an AST y given x 𝑦 is then deterministically converted to the corresponding surface code an AST is generated by applying several production rules composed of a head node and multiple child nodes The Code Generation Problem 3
  • 4.
    Grammar Model • APPLYRULEActions • GENTOKEN Actions 4
  • 5.
    APPLYRULE Actions • APPLYRULEchooses a rule from the subset that has a head matching the type of 𝑛 𝑓𝑡 • uses r to expand 𝑛 𝑓𝑡 by appending all child nodes specified by the selected production • When a variable terminal node is added to the derivation and becomes the frontier node, the grammar model then switches to GENTOKEN actions to populate the variable terminal with tokens 5
  • 6.
    GENTOKEN Actions Once wereach a frontier node 𝑛 𝑓𝑡 that corresponds to a variable type, GENTOKEN actions are used to fill this node with values. At each time step, GENTOKEN appends one terminal token to the current frontier variable node 6
  • 7.
    Tracking Generation States ActionEmbedding: at Context Vector: ct Parent Feeding: pt 7
  • 8.
    Parent Feeding • parentinformation 𝑝𝑡 from two sources • (1) the hidden state of parent action 𝑠 𝑝 𝑡 • (2) the embedding of parent action 𝑎 𝑝 𝑡 • The parent feeding schema enables the model to utilize the information of parent code segments to make more confident predictions 8
  • 9.
    Calculating Action Probabilities g(st):tanh(W・st+b) e(r): one-hot vector for rule r p(gen|・), p(copy|・): tanh(W・st+b) 9
  • 10.
    Training • Given adataset of pairs of NL descriptions 𝑥𝑖 and code 𝑐𝑖 snippets • we parse 𝑐𝑖 into its AST 𝑦𝑖 • decompose 𝑦𝑖 into a sequence of oracle actions • The model is then optimized by maximizing the log- likelihood of the oracle action sequence 10
  • 11.
    Experimental Evaluation • Datasets •HEARTHSTONE (HS) dataset • a collection of Python classes that implement cards for the card game HearthStone • DJANGO dataset • a collection of lines of code from the Django web framework, each with a manually annotated NL description • IFTTT dataset • a domain- specific benchmark that provides an interesting side comparison • Metrics • accuracy • BLUE-4 11
  • 12.
    Experimental Evaluation • Preprocessing •Input are tokenized using NLTK • replacing quoted strings in the inputs with place holders • extract unary closures whose frequency is larger than a threshold • Configuration • node type embeddings: 64 • All other embeddings: 128 • RNN states: 256 • hidden layers: 50 12
  • 13.
    Model • Latent PredictorNetwork (LPN), a state-of-the-art sequence- to-sequence code generation model • SEQ2TREE, a neural semantic parsing model • NMT system using a standard encoder-decoder architecture with attention and unknown word replacement 13
  • 14.
  • 15.
  • 16.
  • 17.
    Error Analysis • randomlysampled and labeled 100 and 50 failed examples from DJANGO and HS, respectively • using different parameter names when defining a function • omitting (or adding) default values of parameters in function calls • DJANGO • 30%: the pointer network failed • 25%: the generated code only partially implemented the required functionality • 10%: malformed English inputs • 5%: preprocessing errors • 30%: could not be easily categorized into the above • HS • partial implementation errors 17
  • 18.
    Conclusion • This paperproposes a syntax-driven neural code generation approach that generates an abstract syntax tree by sequentially applying actions from a grammar model 18