SANN: Programming Code Representation Using Attention Neural Network with Optimized Subtree Extraction

SANN: Programming Code Representation
Using Attention Neural Network with
Optimized Subtree Extraction
Muntasir Hoq, Sushanth Reddy Chilla, Melika Ahmadi Ranjbar,
Peter Brusilovsky, Bita Akram
32nd ACM International Conference on Information and Knowledge Management (CIKM) 2023

• Programming code representation has various intelligent functionalities
• Code classification [White et al. 2016; Bui et al. 2018]
• Bug detection [Elmishali et al. 2019; Shi et al. 2021]
• Code summarization [Jiang et al. 2017; Abdelaziz et al. 2022]
• Automated programming code analysis tools can help in CS education
• Understanding student knowledge (what are the key concepts/skills/competencies)
• Tracking student learning (student modeling, knowledge )
• Identifying misconceptions
• Providing personalized guidance
• This study aims to develop a representation model by
• Capturing task-relevant information dynamically
• Dealing with sparse solution space and larger programs
• Ensuring interpretability
Motivation
2

Student Modeling with
Concept-level Code Representation
1. Extract concepts (unigrams) from problem or student code
• Hosseini, R. and Brusilovsky, P. (2013) JavaParser: A Fine-Grain Concept Indexing Tool for Java Problems.
In: Proceedings of The First Workshop on AI-supported Education for Computer Science (AIEDCS) at the
16th Annual Conference on Artificial Intelligence in Education, Memphis, TN, USA, pp. 60-63.
2. Use student modeling approaches to maintain mastery levels for
concepts as students work with problems (in any order)
• Barria-Pineda, J., Guerra, J., Huang, Y., and Brusilovsky, P. (2017) Concept-Level Knowledge Visualization
for Supporting Self-Regulated Learning. In: Proceedings of Companion of the 22nd International
Conference on Intelligent User Interfaces (IUI '17), Limassol, Cyprus, ACM, pp. 141-144

Personalized Learning with
Concept-level Code Representation
3. Use the current state of the student model to recommend best
problems to work with (and explain why they are the best)
Barria-Pineda, J., Akhuseyinoglu, K.,
Želem-Ćelap, S., Brusilovsky, P.,
Klasnja Milicevic, A., and Ivanovic,
M. (2021) Explainable
Recommendations in a Personalized
Programming Practice System. In:
22nd International Conference on
Artificial Intelligence in Education,
AIED 2021, Springer, pp. 64-76.

Need Better Code Representation
• Concepts are not enough – it is how they are combined in the code
that matters as well
• Huang, Y., Guerra Hollstein, J., Barria Pineda, J., and Brusilovsky, P. (2017) Learner Modeling for
Integration Skills. In: Proceedings of the 25th Conference on User Modeling, Adaptation and
Personalization, Bratislava, Slovakia, pp. 85-93.
• Efficient concept combinations that represent student competencies
could be learned from data
• Akram, B., Azizsoltani, H., Min, W., Wiebe, E., Mott, B., Navied, A., Boyer, K. E., and Lester, J. (2020)
Automated Assessment of Computer Science Competencies from Student Programs with Gaussian
Process Regression. In: A. N. Rafferty, J. Whitehill, V. Cavalli-Sforza and C. Romero (eds.) Proceedings of
13th International Conference on Educational Data Mining, July 10-13, 2020, pp. 555-559.
• Need to find semantic-level code structure-based code
representation to represent code patterns

Code Representation with Deep-learning
6
• AST-based:
• code2vec (exploiting context paths using an attention mechanism, Alon et al. 2019), code2seq (merging context path information using LSTM, Alon et al.
2019), TBCNN (using convolutional kernel, Mou et al. 2016), ASTNN (encoding statement trees using GRUs, Zhang et al. 2019), ast2vec (merging
subtree information recursively, Paaben et al. 2021)
• Graph-based:
• Gated graph neural network-based approaches based on different graphs: CFG, DFG, read-
write (Li et al. 2016, Zhou et al. 2019, MVG: Long et al. 2022)
• Pre-trained Transformer-based:
• GPT (fine-tuning GPT-2 on programming analysis task, Lajko et al. 2022), Llama, CodeBERT (code sequence based NL-PL approach, Feng et al. 2020),
GraphCodeBERT (Data flow graph-based, Guo et al. 2021), UniXcoder (AST-based, Guo et al. 2022)

Research Gaps
7
• Dynamic splitting of subtrees
• Capture task-relevant syntactic and semantic information from code.
• Preserve long term dependencies.
• Extracting both node and structural information
• Assist in variety of tasks.
• Provide deeper insights into the local semantics of ASTs.
• Interpretability
• Important subtrees and substructures getting more importance in the vector representation.
• Understand the important structures of the code responsible for the predicted outputs to
enhance interpretability of the model.

What is Missing?
• Structure information available in larger ASTs might not be fully
captured
• Current models fails to capture syntactic and semantic information
dynamically based on the prediction tasks – most use static AST
splittig
• Structure is important, but concepts (nodes) are important as well
• Current models are interpretable (except code2vec & code2seq, but
does not address 1)

Bridging Gaps with SANN
9
• Optimized sequential subtree extraction
• To effectively capture information by splitting program ASTs into subtrees of task-relevant size and
preserve the sequence of subtrees
• Our model aims to be effective for small student dataset representing sparse programming
solution space and also for larger programs (with larger ASTs) at the same time.
• Two-way embedding
• To capture both node-based and subtree-based information
• Attention mechanism for interpretability
• To emphasize on the important part of the code when representing the code vector and interpret
the model predictions

Splitting Abstract Syntax Tree (AST)
10
Public void print() {
System.out.println(“Hello World!”);
}
Subtrees of depth 2

Task-based optimization of subtree sizes
11

12
Code-Embedding Process
e1
e2
ei
time-distributed

Experiments
13
Task 1 Task 2
Dataset CodeWorkOut OJ
Programming Language Java C
Compliable programs 9403 52000
Classes
2 (correct/incorrect, 8
problems)
104 (algorithm classes)
AVG data points per class 4000 500
Max AST depth 22 76
AVG AST depth 8.2 13.4
Max AST nodes 464 7027
AVG AST nodes 76 190

14
Tf-idf
SVM KNN XGBoost SANN
10-fold cross validation
hyperparameter tuning
20% Held out test set
Correctness prediction
CodeWorkout Dataset
GA-based
Subtree
Extraction
Statement
Trees
Context
Paths
ASTNN
(2019)
code2vec
(2019)
20% validation set
Task 1 - Program correctness prediction
• SVM
• kernel = rbf, C = 10
• KNN
• n_neighbors = 10,
p = 2 (manhattan distance)
• XGBoost
• gamma = 1, max_depth = 6
• Code2vec, ASTNN
• embedding_size = 128
• SANN
• embedding size = 128,
(using GA) subtree size = 3,
maximum length & number
of subtrees = 100
• GPT-2
• 1.5b parameters, fine-tuned
• CodeBERT
• Pre-trained on 6.4m code,
fine-tuned
Fine-tune
GPT-2
CodeBERT
(2020)

Results
15
Model Accuracy Precision Recall F1-score
SVM 0.74 0.71 0.70 0.70
KNN 0.75 0.72 0.70 0.71
XGBoost 0.77 0.75 0.74 0.74
code2vec 0.79 0.76 0.76 0.76
ASTNN 0.82 0.81 0.79 0.80
GPT-2 (pre-
trained model)
0.77 0.76 0.74 0.75
CodeBERT (pre-
trained model) 1 0.88 0.85 0.87 0.86
SANN 1 0.87 0.86 0.83 0.85
1
Statistically insignificant (p-value>0.05)

16
MVG
(2022)
TBCNN
(2016) SANN
20% Held out test set
Algorithm Detection (accuracy)
OJ Dataset
GA-based
Subtree
Extraction
Statement
Trees
Context
Paths
ASTNN
(2019)
code2vec
(2019)
Convolutio
nal kernel
CFG
DFG
R/WG
20% validation set
Task 2 - Algorithm Detection
0.94 0.94 0.90 0.97 0.96
Fine-tune
GPT-2
CodeBERT
(2020)
0.82 0.95
• MVG
• TBCNN
Convolutional Layer Dimension = 600
• Code2vec, ASTNN
• embedding size = 128
• SANN
(using GA) subtree size = 2,
maximum length & number of subtrees = 90
• GPT-2
• 1.5b parameters, fine-tuned
• CodeBERT
• Pre-trained on 6.4m code,
fine-tuned

17
ASTNN
4 superclasses
In-depth Comparison of SANN and ASTNN
SANN
0.98
OJ Dataset
0.99
• Superclass 1:
• Algorithms that involve string comparison.
• Superclass 2:
• Algorithms that involve string
replacement.
• Superclass 3:
• Algorithms that involve sorting.
• Superclass 4:
• Algorithms that involve reversing order in
a data structure.
Task 1 Task 2 Task 3
Dataset CodeWorkout OJ OJ’
Programming Language Java C C
Classes 2 104
4 (merging 24
classes)
AVG data points per
class
4000 500 3000

18
SANN
Task 1
CodeWorkout
GA-based Subtree Extraction Complete Subtree Extraction
Task 2
OJ
Optimization Effectiveness
SANN
Task 1
CodeWorkout
Task 2
OJ
0.86 0.87
0.87 0.96

19
SANN (subtree embedding +
node embedding)
GA-based Subtree Extraction
SANN
(node embedding)
SANN
(subtree
embedding)
Task 1 Task 2 Task 1 Task 2 Task 1 Task 2
Two-way Embedding Effectiveness
0.96
0.87
0.84
0.80
0.89
0.75

20
SANN
Task 1
CodeWorkout
Sequential Subtree Extraction Randomized Subtree Extraction
Task 2
OJ
SANN
Task 1
CodeWorkout
Task 2
OJ
0.84 0.90
0.87 0.96
Sequential Subtree Extraction Effectiveness

21
Interpretability Case Study
92% attention
Correct statement:
speed -= 5
An incorrect student solution for the problem caughtSpeeding

22
Interpretability Case Study-2
92% attention
Correct statement:
speed -= 5
An incorrect student solution for the problem caughtSpeeding
Incorrect statements are
given 5x higher attention

Conclusion
23
• The study proposed a novel interpretable model for programming code
representation using Subtree-based Attention Neural Network (SANN) with
optimized subtree extraction using Genetic Algorithm.
• The study demonstrated the effectiveness of the model to analyze sparse
solution space and larger ASTs in two tasks: program correctness prediction and
algorithm using student programs.
• Competitive performance, interpretability, and effectiveness on small classroom
datasets make SANN an ideal tool for analyzing student programs
• In the future, the model can be a valuable tool in computer science education by
providing insight into student learning of programming and helping educators
adapt their teaching methods to support their students.

Limitations
24
• Higher training time for optimization step.
• Once trained, can be used in offline educational setting and across different programming courses due to similar scope, scale and
course nature.
• Fixed sizes for embedding vectors.
• Vector size might have relationship with the size of ASTs.
Future Work
• Develop a multi-task classifier.
• Investigate the dynamic adaptation of vector sizes based on optimized subtree sizes.
• Build a pre-trained SANN model to ensure the highest accuracy and interpretability simultaneously.
• Explore the interpretable model to understand student programs and their learning and mistakes at a more
granular level

Want to Read More about it?
• Guerra Hollstein, J., Barria Pineda, J., Schunn, C., Bull, S., and Brusilovsky, P. (2017) Fine-Grained Open
Learner Models: Complexity Versus Support. In: Proceedings of 25th Conference on User Modeling, Adaptation
and Personalization, Bratislava, Slovakia, ACM, pp. 41-49.
• Barria-Pineda, J., Akhuseyinoglu, K., and Brusilovsky, P. (2023) Adaptive Navigational Support and Explainable
Recommendations in a Personalized Programming Practice System. ACM, 1-9.
• Huang, Y., Brusilovsky, P., Guerra, J., Koedinger, K., and Schunn, C. (2023) Supporting skill integration in an
intelligent tutoring system for code tracing. Journal of Computer Assisted Learning 39 (2), 477-500.
• Akram, B., Azizsoltani, H., Min, W., Wiebe, E., Mott, B., Navied, A., Boyer, K. E., and Lester, J. (2020)
Automated Assessment of Computer Science Competencies from Student Programs with Gaussian Process
Regression. In: A. N. Rafferty, J. Whitehill, V. Cavalli-Sforza and C. Romero (eds.) Proceedings of 13th
International Conference on Educational Data Mining, July 10-13, 2020, pp. 555-559.
• Yoder, S., Hoq, M., Brusilovsky, P., and Akram, B. (2022) Exploring Sequential Code Embeddings for Predicting
Student Success in an Introductory Programming Course. In: Proceedings of 6th Educational Data Mining in
Computer Science Education (CSEDM) Workshop at EDM2022, Durham, UK, July 27, 2022, Zenodo.
• Hoq, M., Brusilovsky, P., and Akram, B. (2023) Analysis of an Explainable Student Performance Prediction
Model in an Introductory Programming Course. In: Proceedings of the 16th International Conference on
Educational Data Mining (EDM 2023), Bengaluru, India., July 11-14, 2023, pp. 79-90.

SANN: Programming Code Representation Using Attention Neural Network with Optimized Subtree Extraction

Recommended

Recommended

More Related Content

Similar to SANN: Programming Code Representation Using Attention Neural Network with Optimized Subtree Extraction

Similar to SANN: Programming Code Representation Using Attention Neural Network with Optimized Subtree Extraction (20)

More from Peter Brusilovsky

More from Peter Brusilovsky (20)

Recently uploaded

Recently uploaded (20)

SANN: Programming Code Representation Using Attention Neural Network with Optimized Subtree Extraction