SlideShare a Scribd company logo
1 of 14
Download to read offline
CAST: Enhancing Code Summarization
with Hierarchical Splitting and
Reconstruction of Abstract Syntax Trees
NLP팀 박희수, 신동진
Task: Code Summarization
- 소스 코드를 간결한 자연어로 나타내는 task
- 소프트웨어 유지보수와 프로그램의 이해에 중요한 역할
- 하지만 노동과 시간이 많이 듦 개발자가 코드에 대한 좋은 요약을 직접 작성해야
함
- Methods
- 전통적: 복잡한 문법이나 구조는 무시하고 소스 코드를 일반 텍스트처럼
간주하고 룰 기반 혹은 IR 기반으로 접근
- 최근: ASTs(abstract syntax trees)를 도입
Limitations
- AST를 tree 기반의 뉴럴넷으로 인코딩 → 학습 시간이 너무 길어짐
- 높은 프로그램의 복잡도와 크기 때문
- e.g. HybridDrl: AST 를 이진트리로 변형
⇒ 더욱더 깊은 트리를 만들게 되면서 정보 손실
- AST 직렬화 → AST 의 계층 정보의 손실
- ASTNN: AST를 statement tree 로 분할 ⇒ 큰 tree 학습의 어려움을 해소
- 하지만 각각의 서브 트리는 한 개의 statement 만을 포함할 수 있음
- Sub-tree 각각을 직렬화하여 입력 ⇒ 계층 구조 정보의 손실
Solution: Hierarchical splitting and reconstruction
1. 전체 AST를 적절하게 분리
2. Tree 기반 모델로 각 sub-tree를 학습
3. 이후 조합
⇒ 전체 AST의 representation
Model Structure
AST Encoder
- AST를 만들고 preorder 순회
- Composite structure (if, while) 방문
- placeholder node 삽입
- subtree 생성 → semantic을 placeholder로 배치
AST Encoding
1. Encode subtree: tree-based RvNN (Recursive NN) + max pooling
2. Hierarchical relationship: aggregation → RvNN
Code Token Encoder
- Transformer
- multi-head self-attention + relative position embedding module
- Code token → output vector
Decoder w/ Copy Mechanism
- 2 Encoding source: AST encoder + code token encoder
⇒ Serial strategy
- Copy Mechanism: Input code에서 token을 복사
- Attention layer로 copy probability 학습
Experiment - Setup
- Dataset (Java)
- TL-CodeSum (83,661)
- Funcom (2,111,230)
- Vocab size
- AST: 10k
- Code: 30k
- Summary: 50k
- Metrics:
- BLEU-CN ([0%, 100%])
- Meteor ([0%, 100%])
- Rouge-L ([0%, 100%])
- Cider ([0, 10])
Result - Metrics
Result - Human Evaluation
Thank You

More Related Content

Similar to CAST:Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_sparkNAVER D2
 
Ch1 일래스틱서치 클러스터 시작
Ch1 일래스틱서치 클러스터 시작Ch1 일래스틱서치 클러스터 시작
Ch1 일래스틱서치 클러스터 시작Minchul Jung
 
Dragon flow and tricircle
Dragon flow and tricircleDragon flow and tricircle
Dragon flow and tricircleYongyoon Shin
 
Fundamental of ELK Stack
Fundamental of ELK StackFundamental of ELK Stack
Fundamental of ELK Stack주표 홍
 
Attention is all you need
Attention is all you needAttention is all you need
Attention is all you needHoon Heo
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0Minwoo Kim
 
ECMAScript 6의 새로운 것들!
ECMAScript 6의 새로운 것들!ECMAScript 6의 새로운 것들!
ECMAScript 6의 새로운 것들!WooYoung Cho
 
쓰레드.pdf
쓰레드.pdf쓰레드.pdf
쓰레드.pdfSeokju Hong
 
No sql 이해 및 활용 공개용
No sql 이해 및 활용 공개용No sql 이해 및 활용 공개용
No sql 이해 및 활용 공개용YOUNGGYU CHUN
 
[2B7]시즌2 멀티쓰레드프로그래밍이 왜 이리 힘드나요
[2B7]시즌2 멀티쓰레드프로그래밍이 왜 이리 힘드나요[2B7]시즌2 멀티쓰레드프로그래밍이 왜 이리 힘드나요
[2B7]시즌2 멀티쓰레드프로그래밍이 왜 이리 힘드나요NAVER D2
 
Assembly 스터디 1
Assembly 스터디 1Assembly 스터디 1
Assembly 스터디 1Jinkyoung Kim
 
Cloud datacenter network architecture (2014)
Cloud datacenter network architecture (2014)Cloud datacenter network architecture (2014)
Cloud datacenter network architecture (2014)Gasida Seo
 
함수형사고 실용적사고
함수형사고 실용적사고함수형사고 실용적사고
함수형사고 실용적사고Sunggon Song
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장Juhui Park
 
LSTM 네트워크 이해하기
LSTM 네트워크 이해하기LSTM 네트워크 이해하기
LSTM 네트워크 이해하기Mad Scientists
 
C# 개요 및 소개 [ 유니티 및 C# 스터디 / 2024-04-19 ]
C# 개요 및 소개 [ 유니티 및 C# 스터디 / 2024-04-19 ]C# 개요 및 소개 [ 유니티 및 C# 스터디 / 2024-04-19 ]
C# 개요 및 소개 [ 유니티 및 C# 스터디 / 2024-04-19 ]leusin2
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK TelecomGruter
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackElasticsearch
 

Similar to CAST:Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees (20)

[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark
 
Ch1 일래스틱서치 클러스터 시작
Ch1 일래스틱서치 클러스터 시작Ch1 일래스틱서치 클러스터 시작
Ch1 일래스틱서치 클러스터 시작
 
Dragon flow and tricircle
Dragon flow and tricircleDragon flow and tricircle
Dragon flow and tricircle
 
Fundamental of ELK Stack
Fundamental of ELK StackFundamental of ELK Stack
Fundamental of ELK Stack
 
Attention is all you need
Attention is all you needAttention is all you need
Attention is all you need
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0
 
ECMAScript 6의 새로운 것들!
ECMAScript 6의 새로운 것들!ECMAScript 6의 새로운 것들!
ECMAScript 6의 새로운 것들!
 
Start spark
Start sparkStart spark
Start spark
 
Thread programming
Thread programmingThread programming
Thread programming
 
쓰레드.pdf
쓰레드.pdf쓰레드.pdf
쓰레드.pdf
 
No sql 이해 및 활용 공개용
No sql 이해 및 활용 공개용No sql 이해 및 활용 공개용
No sql 이해 및 활용 공개용
 
[2B7]시즌2 멀티쓰레드프로그래밍이 왜 이리 힘드나요
[2B7]시즌2 멀티쓰레드프로그래밍이 왜 이리 힘드나요[2B7]시즌2 멀티쓰레드프로그래밍이 왜 이리 힘드나요
[2B7]시즌2 멀티쓰레드프로그래밍이 왜 이리 힘드나요
 
Assembly 스터디 1
Assembly 스터디 1Assembly 스터디 1
Assembly 스터디 1
 
Cloud datacenter network architecture (2014)
Cloud datacenter network architecture (2014)Cloud datacenter network architecture (2014)
Cloud datacenter network architecture (2014)
 
함수형사고 실용적사고
함수형사고 실용적사고함수형사고 실용적사고
함수형사고 실용적사고
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장
 
LSTM 네트워크 이해하기
LSTM 네트워크 이해하기LSTM 네트워크 이해하기
LSTM 네트워크 이해하기
 
C# 개요 및 소개 [ 유니티 및 C# 스터디 / 2024-04-19 ]
C# 개요 및 소개 [ 유니티 및 C# 스터디 / 2024-04-19 ]C# 개요 및 소개 [ 유니티 및 C# 스터디 / 2024-04-19 ]
C# 개요 및 소개 [ 유니티 및 C# 스터디 / 2024-04-19 ]
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic Stack
 

More from taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

CAST:Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

  • 1. CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees NLP팀 박희수, 신동진
  • 2. Task: Code Summarization - 소스 코드를 간결한 자연어로 나타내는 task - 소프트웨어 유지보수와 프로그램의 이해에 중요한 역할 - 하지만 노동과 시간이 많이 듦 개발자가 코드에 대한 좋은 요약을 직접 작성해야 함 - Methods - 전통적: 복잡한 문법이나 구조는 무시하고 소스 코드를 일반 텍스트처럼 간주하고 룰 기반 혹은 IR 기반으로 접근 - 최근: ASTs(abstract syntax trees)를 도입
  • 3. Limitations - AST를 tree 기반의 뉴럴넷으로 인코딩 → 학습 시간이 너무 길어짐 - 높은 프로그램의 복잡도와 크기 때문 - e.g. HybridDrl: AST 를 이진트리로 변형 ⇒ 더욱더 깊은 트리를 만들게 되면서 정보 손실 - AST 직렬화 → AST 의 계층 정보의 손실 - ASTNN: AST를 statement tree 로 분할 ⇒ 큰 tree 학습의 어려움을 해소 - 하지만 각각의 서브 트리는 한 개의 statement 만을 포함할 수 있음 - Sub-tree 각각을 직렬화하여 입력 ⇒ 계층 구조 정보의 손실
  • 4. Solution: Hierarchical splitting and reconstruction 1. 전체 AST를 적절하게 분리 2. Tree 기반 모델로 각 sub-tree를 학습 3. 이후 조합 ⇒ 전체 AST의 representation
  • 6. AST Encoder - AST를 만들고 preorder 순회 - Composite structure (if, while) 방문 - placeholder node 삽입 - subtree 생성 → semantic을 placeholder로 배치
  • 7.
  • 8. AST Encoding 1. Encode subtree: tree-based RvNN (Recursive NN) + max pooling 2. Hierarchical relationship: aggregation → RvNN
  • 9. Code Token Encoder - Transformer - multi-head self-attention + relative position embedding module - Code token → output vector
  • 10. Decoder w/ Copy Mechanism - 2 Encoding source: AST encoder + code token encoder ⇒ Serial strategy - Copy Mechanism: Input code에서 token을 복사 - Attention layer로 copy probability 학습
  • 11. Experiment - Setup - Dataset (Java) - TL-CodeSum (83,661) - Funcom (2,111,230) - Vocab size - AST: 10k - Code: 30k - Summary: 50k - Metrics: - BLEU-CN ([0%, 100%]) - Meteor ([0%, 100%]) - Rouge-L ([0%, 100%]) - Cider ([0, 10])
  • 13. Result - Human Evaluation