A Feature-Based Model for Nested Named-Entity
Recognition at VLSP-2018
NER Evaluation Campaign
Ph m Quang Nh t Minh
Alt Vietnam Co., Ltd
pham.minh@alt.ai
March 23, 2018
Table of Contents
1 Introduction
2 System Description
3 Evaluation
4 Conclusion
Pham Quang Nhat Minh VLSP 2018 NER shared task 2/17
Table of Contents
1 Introduction
2 System Description
3 Evaluation
4 Conclusion
Pham Quang Nhat Minh VLSP 2018 NER shared task 3/17
Named Entity Recognition
INPUT: V đ m tàu x y ra ngoài khơi th tr n Sabratha, phía B c
th đô Tripoli, v n là đi m t p k t và kh i hành c a nh ng ngư i
tìm cách di cư trái phép sang châu Âu.
OUTPUT: V đ m tàu x y ra ngoài khơi [Location th tr n
Sabratha], phía B c [Location th đô Tripoli], v n là đi m t p k t
và kh i hành c a nh ng ngư i tìm cách di cư trái phép sang
[Location châu Âu].
Pham Quang Nhat Minh VLSP 2018 NER shared task 4/17
VLSP 2018 NER Evaluation Campaign
VLSP 2018 VLSP 2016
Only raw text with tagged entity Entity tags, word segmentation,
PoS, Chunks (in 1st dataset)
Data is categorized into domains No domain information
Includes nested entities Includes nested entities
Considers entities Very few system tackled
at all levels (in evaluation) nested entities
Table: VLSP 2018 NER Evaluation
Pham Quang Nhat Minh VLSP 2018 NER shared task 5/17
Definition of entity levels
Level-1 entities are entities that do not contain other entities
inside them.
E.g., <ENAMEX TYPE=“LOC”>Hà N i</ENAMEX>.
Level-2 entities are entities contain only level-1 entities inside
them.
E.g., <ENAMEX TYPE=“ORG”>UBND thành ph
<ENAMEX TYPE=“LOC”>Hà
N i</ENAMEX></ENAMEX>.
Level-3 entities are entities that contain at least one level-2
entity and may contain some level-1 entities.
<ENAMEX TYPE=“ORG”>Khoa Toán, <ENAMEX
TYPE=“ORG”>ĐHQG <ENAMEX TYPE=“LOC”>Hà
N i</ENAMEX></ENAMEX></ENAMEX>
Pham Quang Nhat Minh VLSP 2018 NER shared task 6/17
Our approach
Formalize the NER task as a sequence labeling problem
Use B-I-O encoding scheme
Investigate different approaches to tackle nested entities
Train separated models for entity levels vs. Train a joint model
for all levels
Word Level-1 Tag Level-2 Tag Joint Tag
ông O O O+O
Ngô_Văn_Quý B-PER O B-PER+O
- O O O+O
Phó O O O+O
Ch _t ch O O O+O
UBND O B-ORG O+B-ORG
TP B-LOC I-ORG B-LOC+I-ORG
Hà_N i I-LOC I-ORG I-LOC+I-ORG
Table: Generating joint-tags by combing entity tags at all levels of a
token
Pham Quang Nhat Minh VLSP 2018 NER shared task 7/17
Table of Contents
1 Introduction
2 System Description
3 Evaluation
4 Conclusion
Pham Quang Nhat Minh VLSP 2018 NER shared task 8/17
System architecture
Pham Quang Nhat Minh VLSP 2018 NER shared task 9/17
Conditional Random Fields
Directly model conditional probabilities of a tag sequence
y = (y1, y2, . . . , ym) given a word sequence
x = (x1, x2, . . . , xm)
P(y|x) =
exp(w · F(y, x))
y ∈Y exp(w · F(y , x))
Global feature F(y, x)
Fj(y, x) =
m
i=1
fj(yi−1, yi , x, i)
Pham Quang Nhat Minh VLSP 2018 NER shared task 10/17
Preprocessing
Sentence segmentation
Rule-based method: period . followed by spaces and a word
starting with uppercase char.
Exceptions: Mr. Minh, Bs. Ti n
Word segmentation: RDRsegmenter (Nguyen et al., 2018)1
1
http://bit.ly/2FUiNKX
Pham Quang Nhat Minh VLSP 2018 NER shared task 11/17
Features: word and word-shapes
Word features
Word-shape features (Minh et al., 2018)
Feature Example
shape “Đ ng” → “ULLL”
shaped “Đ ng” → “UL”
type “1234” → “AllDigit”
fregex See (Le-Hong, 2016)
mix “iPhone”
... ...
Table: Word shape features
Pham Quang Nhat Minh VLSP 2018 NER shared task 12/17
Word-representation features
Brown-cluster features
Train Brown-clusters from 7GB text data, 5120 clusters
Use whole bitstrings, prefixes of length 2, 4, 6, 8, 10, 12, 16,
20.
E.g.,
111111101111000 di n_viên → {“11”, “1111”, “111111”,...,
“111111101111000”}
11111110111101111 ngư i_m u → {“11”, “1111”,
“111111”, ..., “111111101111000”}
Word-embedding features
Same as (Le-Hong and Pham, 2017)
Train word embedding from 7GB text data using Glove with
dimension 25
Pham Quang Nhat Minh VLSP 2018 NER shared task 13/17
Table of Contents
1 Introduction
2 System Description
3 Evaluation
4 Conclusion
Pham Quang Nhat Minh VLSP 2018 NER shared task 14/17
Dataset
Type Train Dev Test
L1 L2 L3 L1 L2 L3 L1 L2 L3
LOC 8831 7 0 3043 2 0 2525 2 0
ORG 3471 1655 63 1203 690 14 1616 557 22
PER 6427 0 0 2168 0 0 3518 1 0
MISC 805 1 0 179 1 0 296 0 0
Total 19534 1663 63 6593 694 14 7955 561 22
Table: Number of entities of each type in each level in train/development
and test set (L1: level-1, L2: level-2, L3: level-3)
Pham Quang Nhat Minh VLSP 2018 NER shared task 15/17
Evaluation on development set
Model Precision Recall F1
Level-1 Model 91.04 84.41 87.6
Joint Model 90.42 84.72 87.47
Table: Evaluation results on dev set of recognizing level-1 entities
Method Precision Recall F1
Level-2 85.81 72.44 78.56
Joint Model 84.36 77.06 80.54
Table: Evaluation results on dev set of recognizing level-2 entities
Pham Quang Nhat Minh VLSP 2018 NER shared task 16/17
Submitted runs
Separated: level-1 + level-2 model
Joint: joint model.
Hybrid: level-1 model to recognize level-1 entities, joint model
to recognize level-2 entities
Runs Method Sent Segmentation
Run-1 Hybrid YES
Run-2 Hybrid NO
Run-3 Joint YES
Run-4 Joint NO
Run-5 Separated YES
Run-6 Separated NO
Table: Six submitted runs
Pham Quang Nhat Minh VLSP 2018 NER shared task 17/17
Results
Run Precision Recall F1
Run-1 76.08 70.68 73.28
Run-2 76.75 70.37 73.42
Run-3 76.32 70.25 73.16
Run-4 76.16 70.98 73.48
Run-5 75.70 70.28 72.89
Run-6 76.26 69.90 72.94
Table: Official evaluation results on test set, which consider entities at
all levels. We trained models on combination of train and developement
dataset
Pham Quang Nhat Minh VLSP 2018 NER shared task 18/17
Table of Contents
1 Introduction
2 System Description
3 Evaluation
4 Conclusion
Pham Quang Nhat Minh VLSP 2018 NER shared task 19/17
Conclusions
Joint model that combines tags of all levels improved
accuracy of nested named entity recognition.
There is a big gap between accuracy on development set and
the test set.
Domain adaptation techniques are needed
Dictionary may help
Pham Quang Nhat Minh VLSP 2018 NER shared task 20/17
Context/common-sense in NER
INPUT: Ông nói: "Tôi đang d t Bobby đi trên đư ng thì hai con
chó ti n đ n. Con chó gi ng Rottweiler t n công Bobby t phía
sau. Bobby đã c g ng ch ng tr , nhưng con chó tr ng l n kia
không tha cho nó. Nó há mi ng và ngo m ch t l y c chú chó c a
tôi, khi n nó không th làm gì đư c n a."
SYSTEM OUTPUT: Ông nói: "Tôi đang d t <ENAMEX
TYPE="PERSON">Bobby</ENAMEX> đi trên đư ng thì hai
con chó ti n đ n. Con chó gi ng Rottweiler t n công Bobby t
phía sau. <ENAMEX TYPE="PERSON">Bobby</ENAMEX>
đã c g ng ch ng tr , nhưng con chó tr ng l n kia không tha cho
nó. Nó há mi ng và ngo m ch t l y c chú chó c a tôi, khi n nó
không th làm gì đư c n a."
Pham Quang Nhat Minh VLSP 2018 NER shared task 21/17

A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Evaluation Campaign

  • 1.
    A Feature-Based Modelfor Nested Named-Entity Recognition at VLSP-2018 NER Evaluation Campaign Ph m Quang Nh t Minh Alt Vietnam Co., Ltd pham.minh@alt.ai March 23, 2018
  • 2.
    Table of Contents 1Introduction 2 System Description 3 Evaluation 4 Conclusion Pham Quang Nhat Minh VLSP 2018 NER shared task 2/17
  • 3.
    Table of Contents 1Introduction 2 System Description 3 Evaluation 4 Conclusion Pham Quang Nhat Minh VLSP 2018 NER shared task 3/17
  • 4.
    Named Entity Recognition INPUT:V đ m tàu x y ra ngoài khơi th tr n Sabratha, phía B c th đô Tripoli, v n là đi m t p k t và kh i hành c a nh ng ngư i tìm cách di cư trái phép sang châu Âu. OUTPUT: V đ m tàu x y ra ngoài khơi [Location th tr n Sabratha], phía B c [Location th đô Tripoli], v n là đi m t p k t và kh i hành c a nh ng ngư i tìm cách di cư trái phép sang [Location châu Âu]. Pham Quang Nhat Minh VLSP 2018 NER shared task 4/17
  • 5.
    VLSP 2018 NEREvaluation Campaign VLSP 2018 VLSP 2016 Only raw text with tagged entity Entity tags, word segmentation, PoS, Chunks (in 1st dataset) Data is categorized into domains No domain information Includes nested entities Includes nested entities Considers entities Very few system tackled at all levels (in evaluation) nested entities Table: VLSP 2018 NER Evaluation Pham Quang Nhat Minh VLSP 2018 NER shared task 5/17
  • 6.
    Definition of entitylevels Level-1 entities are entities that do not contain other entities inside them. E.g., <ENAMEX TYPE=“LOC”>Hà N i</ENAMEX>. Level-2 entities are entities contain only level-1 entities inside them. E.g., <ENAMEX TYPE=“ORG”>UBND thành ph <ENAMEX TYPE=“LOC”>Hà N i</ENAMEX></ENAMEX>. Level-3 entities are entities that contain at least one level-2 entity and may contain some level-1 entities. <ENAMEX TYPE=“ORG”>Khoa Toán, <ENAMEX TYPE=“ORG”>ĐHQG <ENAMEX TYPE=“LOC”>Hà N i</ENAMEX></ENAMEX></ENAMEX> Pham Quang Nhat Minh VLSP 2018 NER shared task 6/17
  • 7.
    Our approach Formalize theNER task as a sequence labeling problem Use B-I-O encoding scheme Investigate different approaches to tackle nested entities Train separated models for entity levels vs. Train a joint model for all levels Word Level-1 Tag Level-2 Tag Joint Tag ông O O O+O Ngô_Văn_Quý B-PER O B-PER+O - O O O+O Phó O O O+O Ch _t ch O O O+O UBND O B-ORG O+B-ORG TP B-LOC I-ORG B-LOC+I-ORG Hà_N i I-LOC I-ORG I-LOC+I-ORG Table: Generating joint-tags by combing entity tags at all levels of a token Pham Quang Nhat Minh VLSP 2018 NER shared task 7/17
  • 8.
    Table of Contents 1Introduction 2 System Description 3 Evaluation 4 Conclusion Pham Quang Nhat Minh VLSP 2018 NER shared task 8/17
  • 9.
    System architecture Pham QuangNhat Minh VLSP 2018 NER shared task 9/17
  • 10.
    Conditional Random Fields Directlymodel conditional probabilities of a tag sequence y = (y1, y2, . . . , ym) given a word sequence x = (x1, x2, . . . , xm) P(y|x) = exp(w · F(y, x)) y ∈Y exp(w · F(y , x)) Global feature F(y, x) Fj(y, x) = m i=1 fj(yi−1, yi , x, i) Pham Quang Nhat Minh VLSP 2018 NER shared task 10/17
  • 11.
    Preprocessing Sentence segmentation Rule-based method:period . followed by spaces and a word starting with uppercase char. Exceptions: Mr. Minh, Bs. Ti n Word segmentation: RDRsegmenter (Nguyen et al., 2018)1 1 http://bit.ly/2FUiNKX Pham Quang Nhat Minh VLSP 2018 NER shared task 11/17
  • 12.
    Features: word andword-shapes Word features Word-shape features (Minh et al., 2018) Feature Example shape “Đ ng” → “ULLL” shaped “Đ ng” → “UL” type “1234” → “AllDigit” fregex See (Le-Hong, 2016) mix “iPhone” ... ... Table: Word shape features Pham Quang Nhat Minh VLSP 2018 NER shared task 12/17
  • 13.
    Word-representation features Brown-cluster features TrainBrown-clusters from 7GB text data, 5120 clusters Use whole bitstrings, prefixes of length 2, 4, 6, 8, 10, 12, 16, 20. E.g., 111111101111000 di n_viên → {“11”, “1111”, “111111”,..., “111111101111000”} 11111110111101111 ngư i_m u → {“11”, “1111”, “111111”, ..., “111111101111000”} Word-embedding features Same as (Le-Hong and Pham, 2017) Train word embedding from 7GB text data using Glove with dimension 25 Pham Quang Nhat Minh VLSP 2018 NER shared task 13/17
  • 14.
    Table of Contents 1Introduction 2 System Description 3 Evaluation 4 Conclusion Pham Quang Nhat Minh VLSP 2018 NER shared task 14/17
  • 15.
    Dataset Type Train DevTest L1 L2 L3 L1 L2 L3 L1 L2 L3 LOC 8831 7 0 3043 2 0 2525 2 0 ORG 3471 1655 63 1203 690 14 1616 557 22 PER 6427 0 0 2168 0 0 3518 1 0 MISC 805 1 0 179 1 0 296 0 0 Total 19534 1663 63 6593 694 14 7955 561 22 Table: Number of entities of each type in each level in train/development and test set (L1: level-1, L2: level-2, L3: level-3) Pham Quang Nhat Minh VLSP 2018 NER shared task 15/17
  • 16.
    Evaluation on developmentset Model Precision Recall F1 Level-1 Model 91.04 84.41 87.6 Joint Model 90.42 84.72 87.47 Table: Evaluation results on dev set of recognizing level-1 entities Method Precision Recall F1 Level-2 85.81 72.44 78.56 Joint Model 84.36 77.06 80.54 Table: Evaluation results on dev set of recognizing level-2 entities Pham Quang Nhat Minh VLSP 2018 NER shared task 16/17
  • 17.
    Submitted runs Separated: level-1+ level-2 model Joint: joint model. Hybrid: level-1 model to recognize level-1 entities, joint model to recognize level-2 entities Runs Method Sent Segmentation Run-1 Hybrid YES Run-2 Hybrid NO Run-3 Joint YES Run-4 Joint NO Run-5 Separated YES Run-6 Separated NO Table: Six submitted runs Pham Quang Nhat Minh VLSP 2018 NER shared task 17/17
  • 18.
    Results Run Precision RecallF1 Run-1 76.08 70.68 73.28 Run-2 76.75 70.37 73.42 Run-3 76.32 70.25 73.16 Run-4 76.16 70.98 73.48 Run-5 75.70 70.28 72.89 Run-6 76.26 69.90 72.94 Table: Official evaluation results on test set, which consider entities at all levels. We trained models on combination of train and developement dataset Pham Quang Nhat Minh VLSP 2018 NER shared task 18/17
  • 19.
    Table of Contents 1Introduction 2 System Description 3 Evaluation 4 Conclusion Pham Quang Nhat Minh VLSP 2018 NER shared task 19/17
  • 20.
    Conclusions Joint model thatcombines tags of all levels improved accuracy of nested named entity recognition. There is a big gap between accuracy on development set and the test set. Domain adaptation techniques are needed Dictionary may help Pham Quang Nhat Minh VLSP 2018 NER shared task 20/17
  • 21.
    Context/common-sense in NER INPUT:Ông nói: "Tôi đang d t Bobby đi trên đư ng thì hai con chó ti n đ n. Con chó gi ng Rottweiler t n công Bobby t phía sau. Bobby đã c g ng ch ng tr , nhưng con chó tr ng l n kia không tha cho nó. Nó há mi ng và ngo m ch t l y c chú chó c a tôi, khi n nó không th làm gì đư c n a." SYSTEM OUTPUT: Ông nói: "Tôi đang d t <ENAMEX TYPE="PERSON">Bobby</ENAMEX> đi trên đư ng thì hai con chó ti n đ n. Con chó gi ng Rottweiler t n công Bobby t phía sau. <ENAMEX TYPE="PERSON">Bobby</ENAMEX> đã c g ng ch ng tr , nhưng con chó tr ng l n kia không tha cho nó. Nó há mi ng và ngo m ch t l y c chú chó c a tôi, khi n nó không th làm gì đư c n a." Pham Quang Nhat Minh VLSP 2018 NER shared task 21/17