Our presentation slide at the 13th IEEE International Conference on Knowledge and Systems Engineering (KSE 2021).
In this paper, we present our participated systems for three Vietnamese legal text processing tasks at Automated Legal Question Answering Competition (ALQAC 2021). In our systems, we leverage the strength of traditional information retrieval methods (BM25), pre-trained masked language models (BERT), and legal domain knowledge. Our proposed methods help to overcome the shortage of training data. Especially, in the legal textual entailment task, we propose a novel data augmentation
method that is based on legal domain knowledge. Evaluation
results show the effectiveness of our proposed methods.
3. Overview of Our Approaches
3
■ Used traditional Information Retrieval models,
pre-trained language models, legal domain
knowledge
■ Propose a data augmentation method and text
matching method in Task 2: Legal Textual
Entailment
◻ Based on analysing structural characteristics of legal
documents
■ First prize in Task 2 (72.16%), ranked second in Task
1 (80.61% of F2) and Task 3 (64.77% of accuracy)
4. Main Findings
4
■ Task 1 - Legal document retrieval:
◻ Combining lexical matching model with supporting
model (BERT + CNN, Domain Invariant) improves the
accuracy of document retrieval
■ Task 2 - Legal textual entailment:
◻ Augmenting more training data from law articles helps
tackling the data shortage problem.
◻ Using the most relevant part of an article to the input
query improved the accuracy of legal textual entailment
5. Task 1: Legal Document Retrieval
5
■ Task components:
◻ Questions
◻ Set of law articles
■ Objective: Automatically retrieving relevant law
articles with respect to the input question
■ Following Nguyen et al., we combine two models
◻ Lexical matching model (BM25)
◻ Supporting model
6. Proposed Approaches
6
■ Models that can be complementary to the hard
lexical matching (BM25)
◻ Supporting model capture features that are distinct from
those captured by the lexical matching
■ Proposed two support models:
◻ Domain Invariant
◻ Deep CNN
7. Domain Invariant
7
■ Three main components:
◻ Feature Extractor
◻ Domain Classifier (Id of the law)
◻ Classifier (relevant or not)
■ Training objective:
◻ No discriminative information about the domain
◻ Keeping meaningful information for the classification task
8. Deep CNN
8
● Using BERT to encode
candidate article and
question
● Using various CNN layers to
extract higher
representations
● Final representations of
article and question are
concatenated
9. Task 2: Legal Textual Entailment
9
■ Input: question/statement & its relevant articles
■ Output: Yes/No
■ Example:
Statement: Chỉ những hành vi pháp lý đơn phương làm
thay đổi quyền, nghĩa vụ dân sự mới được coi là giao
dịch dân sự.
Relevant articles:
Giao dịch dân sự
Giao dịch dân sự là hợp đồng hoặc hành vi pháp lý đơn
phương làm phát sinh, thay đổi hoặc chấm dứt quyền,
nghĩa vụ dân sự.
⇒ No (The statement is false based on the content of
legal articles)
11. Data Augmentation
11
■ By utilizing structural features of a Vietnamese law
article to generate a positive instance:
◻ concatenate each consequence part in clauses with
every condition that followed the consequence
◻ rewrite clauses that do not include any point
■ By utilizing BM25 in Task 1 to generate negative
samples
⇒ Finally, obtain 4237 training samples.
13. Text Matching
13
Hút thuốc là hành vi bị nghiêm cấm trong cơ sở giáo
dục. (Smoking is a prohibited act in educational
institutions)
Các hành vi bị nghiêm cấm trong cơ sở giáo
dục (Prohibited acts in educational
institutions)
1. Xúc phạm nhân phẩm, danh dự, xâm
phạm thân thể nhà giáo, cán bộ, người lao
động của cơ sở giáo dục và người học.
(1. Infringing on dignity and honor,
infringing upon the body of teachers,
officials and employees of educational
institutions and learners.)
2. Xuyên tạc nội dung giáo dục. (2.
Misrepresenting of educational content.)
3. Gian lận trong học tập, kiểm tra, thi,
tuyển sinh. (3. Cheating in study, test,
exam, enrollment.)
4. Hút thuốc; uống rượu, bia; gây rối an
ninh, trật tự. (4. Smoking; drinking beer;
disrupting security and order.)
...
0.3
0.2
0.6
Các hành vi bị nghiêm cấm trong cơ sở giáo
dục (Prohibited acts in educational
institutions)
4. Hút thuốc; uống rượu, bia; gây rối an ninh,
trật tự. (4. Smoking; drinking beer; disrupting
security and order.)
15. 15
Fine-tuning BERT
Legal Entailment as
sentence pair classification
■ Pair the question with the
matched clauses
■ Insert [CLS] and [SEP]
■ Concatenate vectors of 4
last hidden states ⇒
embedding vector of the
sequence pair
16. Task 3: Legal Question Answering
16
■ Input: question/statement
■ Output: Yes/No
■ Example:
Statement: Chỉ những hành vi pháp lý đơn phương
làm thay đổi quyền, nghĩa vụ dân sự mới được coi
là giao dịch dân sự.
⇒ No
17. Our Approach
17
■ Combine Task 1 and Task 2 with a slight difference of
the legal textual entailment model.
Legal Query Legal Document
Retrieval
Relevant
Articles
Law Article Data
Legal Textual
Entailment
Legal Query
If there is at least one relevant article
entail the legal query, then the legal
query is TRUE
18. Experiments and Results: Task 1
18
Run Accuracy Rank
(1) Only BM25 78.42% #7
(2) BM25+DANN 80.61% #2
(3) BM25+CNN 80.61% #2
19. Experiments and Results: Task 2
19
■ Divided augmented data into training and
development subsets
◻ 3813 samples for training, 424 samples for validation
■ Extra experiment: used whole data for training
◻ Obtained 72.16% of accuracy of the private test set
Run Accuracy Rank
(1) BERT, lr = 2e-5 68.89% #1
(2) BERT, lr = 1e-4 67.61% #3
(3) Domain Variant Model 43% #8
20. Experiments and Results: Task 3
20
■ Train the model on 80% of the original training data
■ Max length: 256 at inference phase, 512 at training
Run Accuracy Rank
(1) BM25 + Text Matching 64.77% #2
(2) BM25, Domain Variant Model 61.36% #4
(3) BM25, Deep CNN 61.36% #4
21. Conclusion
21
■ Our systems are based on:
◻ Traditional approaches (BM25, cosine similarity, tf-idf)
◻ Deep learning models (pre-trained language models)
◻ Legal-domain-knowledge-based data augmentation
techniques
■ Our proposed data augmentation and text matching
methods can be applied to other legal text
processing tasks in other languages rather than
Vietnamese.