A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Evaluation Campaign

A Feature-Based Model for Nested Named-Entity
Recognition at VLSP-2018
NER Evaluation Campaign
Ph m Quang Nh t Minh
Alt Vietnam Co., Ltd
pham.minh@alt.ai
March 23, 2018

Table of Contents
1 Introduction
2 System Description
3 Evaluation
4 Conclusion
Pham Quang Nhat Minh VLSP 2018 NER shared task 2/17

Table of Contents
1 Introduction
3 Evaluation
4 Conclusion

Named Entity Recognition
INPUT: V đ m tàu x y ra ngoài khơi th tr n Sabratha, phía B c
th đô Tripoli, v n là đi m t p k t và kh i hành c a nh ng ngư i
tìm cách di cư trái phép sang châu Âu.
OUTPUT: V đ m tàu x y ra ngoài khơi [Location th tr n
Sabratha], phía B c [Location th đô Tripoli], v n là đi m t p k t
và kh i hành c a nh ng ngư i tìm cách di cư trái phép sang
[Location châu Âu].

VLSP 2018 NER Evaluation Campaign
VLSP 2018 VLSP 2016
Only raw text with tagged entity Entity tags, word segmentation,
PoS, Chunks (in 1st dataset)
Data is categorized into domains No domain information
Includes nested entities Includes nested entities
Considers entities Very few system tackled
at all levels (in evaluation) nested entities
Table: VLSP 2018 NER Evaluation

Definition of entity levels
Level-1 entities are entities that do not contain other entities
inside them.
E.g., <ENAMEX TYPE=“LOC”>Hà N i</ENAMEX>.
Level-2 entities are entities contain only level-1 entities inside
them.
E.g., <ENAMEX TYPE=“ORG”>UBND thành ph
<ENAMEX TYPE=“LOC”>Hà
N i</ENAMEX></ENAMEX>.
Level-3 entities are entities that contain at least one level-2
entity and may contain some level-1 entities.
<ENAMEX TYPE=“ORG”>Khoa Toán, <ENAMEX
TYPE=“ORG”>ĐHQG <ENAMEX TYPE=“LOC”>Hà
N i</ENAMEX></ENAMEX></ENAMEX>

Our approach
Formalize the NER task as a sequence labeling problem
Use B-I-O encoding scheme
Investigate different approaches to tackle nested entities
Train separated models for entity levels vs. Train a joint model
for all levels
Word Level-1 Tag Level-2 Tag Joint Tag
ông O O O+O
Ngô_Văn_Quý B-PER O B-PER+O
- O O O+O
Phó O O O+O
Ch _t ch O O O+O
UBND O B-ORG O+B-ORG
TP B-LOC I-ORG B-LOC+I-ORG
Hà_N i I-LOC I-ORG I-LOC+I-ORG
Table: Generating joint-tags by combing entity tags at all levels of a
token

Table of Contents
1 Introduction
3 Evaluation
4 Conclusion

System architecture

Conditional Random Fields
Directly model conditional probabilities of a tag sequence
y = (y1, y2, . . . , ym) given a word sequence
x = (x1, x2, . . . , xm)
P(y|x) =
exp(w · F(y, x))
y ∈Y exp(w · F(y , x))
Global feature F(y, x)
Fj(y, x) =
m
i=1
fj(yi−1, yi , x, i)

Preprocessing
Sentence segmentation
Rule-based method: period . followed by spaces and a word
starting with uppercase char.
Exceptions: Mr. Minh, Bs. Ti n
Word segmentation: RDRsegmenter (Nguyen et al., 2018)1
1
http://bit.ly/2FUiNKX

Features: word and word-shapes
Word features
Word-shape features (Minh et al., 2018)
Feature Example
shape “Đ ng” → “ULLL”
shaped “Đ ng” → “UL”
type “1234” → “AllDigit”
fregex See (Le-Hong, 2016)
mix “iPhone”
... ...
Table: Word shape features

Word-representation features
Brown-cluster features
Train Brown-clusters from 7GB text data, 5120 clusters
Use whole bitstrings, prefixes of length 2, 4, 6, 8, 10, 12, 16,
20.
E.g.,
111111101111000 di n_viên → {“11”, “1111”, “111111”,...,
“111111101111000”}
11111110111101111 ngư i_m u → {“11”, “1111”,
“111111”, ..., “111111101111000”}
Word-embedding features
Same as (Le-Hong and Pham, 2017)
Train word embedding from 7GB text data using Glove with
dimension 25

Table of Contents
1 Introduction
3 Evaluation
4 Conclusion

Dataset
Type Train Dev Test
L1 L2 L3 L1 L2 L3 L1 L2 L3
LOC 8831 7 0 3043 2 0 2525 2 0
ORG 3471 1655 63 1203 690 14 1616 557 22
PER 6427 0 0 2168 0 0 3518 1 0
MISC 805 1 0 179 1 0 296 0 0
Total 19534 1663 63 6593 694 14 7955 561 22
Table: Number of entities of each type in each level in train/development
and test set (L1: level-1, L2: level-2, L3: level-3)

Evaluation on development set
Model Precision Recall F1
Level-1 Model 91.04 84.41 87.6
Joint Model 90.42 84.72 87.47
Table: Evaluation results on dev set of recognizing level-1 entities
Method Precision Recall F1
Level-2 85.81 72.44 78.56
Joint Model 84.36 77.06 80.54
Table: Evaluation results on dev set of recognizing level-2 entities

Submitted runs
Separated: level-1 + level-2 model
Joint: joint model.
Hybrid: level-1 model to recognize level-1 entities, joint model
to recognize level-2 entities
Runs Method Sent Segmentation
Run-1 Hybrid YES
Run-2 Hybrid NO
Run-3 Joint YES
Run-4 Joint NO
Run-5 Separated YES
Run-6 Separated NO
Table: Six submitted runs

Results
Run Precision Recall F1
Run-1 76.08 70.68 73.28
Run-2 76.75 70.37 73.42
Run-3 76.32 70.25 73.16
Run-4 76.16 70.98 73.48
Run-5 75.70 70.28 72.89
Run-6 76.26 69.90 72.94
Table: Official evaluation results on test set, which consider entities at
all levels. We trained models on combination of train and developement
dataset

Table of Contents
1 Introduction
3 Evaluation
4 Conclusion

Conclusions
Joint model that combines tags of all levels improved
accuracy of nested named entity recognition.
There is a big gap between accuracy on development set and
the test set.
Domain adaptation techniques are needed
Dictionary may help

Context/common-sense in NER
INPUT: Ông nói: "Tôi đang d t Bobby đi trên đư ng thì hai con
chó ti n đ n. Con chó gi ng Rottweiler t n công Bobby t phía
sau. Bobby đã c g ng ch ng tr , nhưng con chó tr ng l n kia
không tha cho nó. Nó há mi ng và ngo m ch t l y c chú chó c a
tôi, khi n nó không th làm gì đư c n a."
SYSTEM OUTPUT: Ông nói: "Tôi đang d t <ENAMEX
TYPE="PERSON">Bobby</ENAMEX> đi trên đư ng thì hai
con chó ti n đ n. Con chó gi ng Rottweiler t n công Bobby t
phía sau. <ENAMEX TYPE="PERSON">Bobby</ENAMEX>
đã c g ng ch ng tr , nhưng con chó tr ng l n kia không tha cho
nó. Nó há mi ng và ngo m ch t l y c chú chó c a tôi, khi n nó
không th làm gì đư c n a."

A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Evaluation Campaign

More Related Content

What's hot

Similar to A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Evaluation Campaign

More from Minh Pham

Recently uploaded

A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Evaluation Campaign