This document discusses experiments analyzing how the performance of a named entity recognition (NER) tagger changes over time using a Portuguese news corpus spanning 8 years. The experiments measure corpus similarity, NER performance as measured by F1-score, and name list overlaps across different time gaps between training and test data. The results show that as the time gap increases, corpus similarity decreases, F1-score tends to decay, and name list overlaps decrease, indicating that an NER tagger's performance declines when applied to texts from time periods increasingly distant from its training data.
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Is this NE tagger getting old?
1. Outline
Introduction
Corpus Analysis
NER Performance Analysis
Experiments
Final Remarks
Is this NE tagger getting old?
Language Resources and Evaluation Conference
Marrakech, Morocco - May 28th - 30th 2008
Cristina Mota and Ralph Grishman
IST & L2F INESC-ID (Portugal) & NYU (USA)
and
New York University (USA)
(Advisors: Ralph Grishman & Nuno Mamede)
This research was funded by Funda¸ao para a Ciˆncia e a Tecnologia (doctoral scholarship SFRH/BD/3237/2000)
c˜ e
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
2. Outline
Introduction
Corpus Analysis
NER Performance Analysis
Experiments
Final Remarks
Outline
1 Introduction
2 Corpus Analysis
3 NER Performance Analysis
4 Experiments
5 Final Remarks
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
3. Outline
Introduction
Corpus Analysis Motivation
NER Performance Analysis Approach
Experiments
Final Remarks
1 Introduction
Motivation
Approach
2 Corpus Analysis
3 NER Performance Analysis
4 Experiments
5 Final Remarks
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
4. Outline
Introduction
Corpus Analysis Motivation
NER Performance Analysis Approach
Experiments
Final Remarks
What is NER?
Mary is studying in Rabat at Mohammed V University
NE Tagger
MaryPER is studying in RabatLOC at Mohammed V
UniversityORG
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
5. Outline
Introduction
Corpus Analysis Motivation
NER Performance Analysis Approach
Experiments
Final Remarks
The Problem
x
o UE
25
x CEE
O União Europeia
x x Comunidade Europeia
Name occurrences / 100K words
20
O
O
x O O
O
15
O
O O O
o O
o o
x
10
x o o o o
x
o o
o
x
5
x O
x
O x
x O O x
O x x x
O x x x x x x
x x x x x
o o o o o o x x x
x x x
0
91a 92a 93a 94a 95a 96a 97a 98a
Time frame (semester)
Do texts vary over time in a way that affects NE recognition?
Should NE taggers be also conceived time-aware?
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
6. Outline
Introduction
Corpus Analysis Motivation
NER Performance Analysis Approach
Experiments
Final Remarks
Approach
Corpus Analysis
NER Performance Analysis
Measure corpus similarity based on
Assess performance by training
Words and testing with different
Compute name list overlaps configurations (train,test)
By type Increase time gap between
training and test data
By token
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
7. Outline
Introduction
Corpus Analysis Corpus Similarity Algorithm (Kilgarriff, 2001)
NER Performance Analysis Name List Overlaps
Experiments
Final Remarks
1 Introduction
2 Corpus Analysis
Corpus Similarity Algorithm (Kilgarriff, 2001)
Name List Overlaps
3 NER Performance Analysis
4 Experiments
5 Final Remarks
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
8. Outline
Introduction
Corpus Analysis Corpus Similarity Algorithm (Kilgarriff, 2001)
NER Performance Analysis Name List Overlaps
Experiments
Final Remarks
Corpus Similarity Algorithm (Kilgarriff, 2001)
Similarity(A,B):
Split corpus A and B into k slices each
Repeat m times:
Randomly allocate k slices to Ai and k to Bi
2 2
Construct word frequency lists for Ai and Bi
Compute CBDF between A and B for the n most frequent
words of the joint corpus (Ai +Bi )
[CBDF = χ2 by degrees of freedom]
Output mean and standard deviation of CBDF of all
experiments
Repeat using corpus A only: Similarity(A,A) → Homogeneity(A)
Repeat using corpus B only: Similarity(B,B) → Homogeneity(B)
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
9. Outline
Introduction
Corpus Analysis Corpus Similarity Algorithm (Kilgarriff, 2001)
NER Performance Analysis Name List Overlaps
Experiments
Final Remarks
Corpus Similarity Algorithm (Kilgarriff, 2001)
1 1
Corpus A 2 Corpus A + 2 Corpus B Corpus B
DAA′ 1 DAB ′ 1 DBB ′ 1
DAA′ 2 DAB ′ 2 DBB ′ 2
. . .
. . .
. . .
DAA′ n DAB ′ n DBB ′ n
¯
DAA′ ¯ ¯
DBB ′
DAB
Homogeneity(A) Similarity(A, B) Homogeneity(B)
¯
Lower values of D ⇒ higher homogeneity/similarity
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
10. Outline
Introduction
Corpus Analysis Corpus Similarity Algorithm (Kilgarriff, 2001)
NER Performance Analysis Name List Overlaps
Experiments
Final Remarks
Name List Overlaps
|TA ∩ TB |
type overlap = (1)
|TA | + |TB | − |TA ∩ TB |
N
i =1 min(fA (i ), fb (i ))
token overlap = N
(2)
i =1 max(fA (i ), fB (i ))
TA = list of different names (name types) of text A
fA (i ) = frequency of name i in text A
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
11. Outline
Introduction
Corpus Analysis Corpus Similarity Algorithm (Kilgarriff, 2001)
NER Performance Analysis Name List Overlaps
Experiments
Final Remarks
Name List Overlaps
A name list: Mary (3), Rabat (5), Mohammed V University (4)
B name list: John (1), Rabat (2), Mohammed V Universirty (6)
Type Overlap
|{Rabat, MohammedVUniversity }|
= 2/4
|{Mary , Rabat, MohammedVUniversity , John}|
Token Overlap
min(3, 0) + min(5, 2) + min(4, 6) + min(0, 1)
= 6/15
max(3, 0) + max(5, 2) + max(4, 6) + max(0, 1)
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
12. Outline
Introduction
Corpus Analysis
NE Tagger Description (Collins & Singer, 1999)
NER Performance Analysis
Experiments
Final Remarks
1 Introduction
2 Corpus Analysis
3 NER Performance Analysis
NE Tagger Description (Collins & Singer, 1999)
4 Experiments
5 Final Remarks
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
13. Outline
Introduction
Corpus Analysis
NE Tagger Description (Collins & Singer, 1999)
NER Performance Analysis
Experiments
Final Remarks
NE Tagger Description (Collins & Singer, 1999)
Raw TEXT
Classification in detail:
?
POS Tagging + Parsing Name Rules :- Name seeds
? ?
Shallow Parsed TEXT Label with Name Rules 6
? ?
NE Identification - TEXT with unclassified NE Infer Contextual Rules
? ?
List of Examples (NE,context) Label with Contextual Rules
? ?
NE Classification Name seeds Infer Name Rules -
? ?
List of Labeled Examples (NE, context, label) Label with Name + Contextual Rules
? ?
Text Update + NE Propagation List of Labeled Examples (NE, Context, Label)
? ?
TEXT with classified NE
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
14. Outline
Experimental Setting
Introduction
F-Measure over Time
Corpus Analysis
Politics Dissimilarity over Time
NER Performance Analysis
Politics Name List Overlap over Time
Experiments
F-Measure compared to Dissimilarity
Final Remarks
1 Introduction
2 Corpus Analysis
3 NER Performance Analysis
4 Experiments
Experimental Setting
F-Measure over Time
Politics Dissimilarity over Time
Politics Name List Overlap over Time
F-Measure compared to Dissimilarity
5 Final Remarks
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
15. Outline
Experimental Setting
Introduction
F-Measure over Time
Corpus Analysis
Politics Dissimilarity over Time
NER Performance Analysis
Politics Name List Overlap over Time
Experiments
F-Measure compared to Dissimilarity
Final Remarks
Experimental Setting
CETEMPublico (Santos Rocha, 2001) is a
Portuguese public journalistic corpus
Culture
Size: 180 million words
1e+07
Sports
Economy
Time span: 8 years
Politics
Society
8e+06
Number of words
6e+06
Organization: randomly shuffled extracts
4e+06
[1 extract ≅ 2 paragraphs]
2e+06
Classification: 10 topics and 16 time
0e+00
frames (year + semester)
91a 92a 93a 94a 95a 96a 97a 98a
Time frame (semester)
Mark up: paragraphs, sentences,
enumeration lists and authors
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
16. Outline
Experimental Setting
Introduction
F-Measure over Time
Corpus Analysis
Politics Dissimilarity over Time
NER Performance Analysis
Politics Name List Overlap over Time
Experiments
F-Measure compared to Dissimilarity
Final Remarks
Experimental Setting
Topic: politics
Time unit: year
Text unit: sentence
Size: 10 slices x 60000 words per time frame
N most frequent words: 2000 words
Names compared: 82400 per time frame
Seeds (S): different names in the first 2500 name instances [first
198 extracts per semester]
Test (T): next 208 extracts per semester grouped by year
Unlabeled examples (U): first 82456 names with context per year
[following 7856 extracts]
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
17. Outline
Experimental Setting
Introduction
F-Measure over Time
Corpus Analysis
Politics Dissimilarity over Time
NER Performance Analysis
Politics Name List Overlap over Time
Experiments
F-Measure compared to Dissimilarity
Final Remarks
NER Performance: F-Measure over Time
When the texts are from the same
0.85
year (time gap = 0), the
0.84
F-measure ranges approximately
from 82% to 85%
0.83
F−measure (%)
When the texts are 5 years apart
0.82
the F-measure ranges from about
0.81
79% to 82%
0.80
As the time gap between (Sk , Uk )
and Tj increases, the F-measure
0.79
0 1 2 3 4 5 6 7
Time gap (year)
shows a tendency to decay
Training-test configuration: (Si ,Ui ,Tj ), i =91..98, j=91..98 [64 tests]
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
18. Outline
Experimental Setting
Introduction
F-Measure over Time
Corpus Analysis
Politics Dissimilarity over Time
NER Performance Analysis
Politics Name List Overlap over Time
Experiments
F-Measure compared to Dissimilarity
Final Remarks
Politics Corpus Dissimilarity over time
The homogeneity for all the texts
is very close to 1
6
Increasing the time gap to one
Dissimilarity (= mean CBDF)
year, the dissimilarity ranges from
5
2.5 to 4.5
4
At a distance of five years
dissimilarity ranges from 4.7 to
3
almost 6.5
2
The dissimilarity shows a tendency
1
0 1 2 3 4 5 6 7
to increase as the time gap
Time gap (year)
increases
Corpus comparisons: (Ui ,Uj ), i =91..98, j=91..98 [64 comparisons; Higher values = Lower similarity]
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
19. Outline
Experimental Setting
Introduction
F-Measure over Time
Corpus Analysis
Politics Dissimilarity over Time
NER Performance Analysis
Politics Name List Overlap over Time
Experiments
F-Measure compared to Dissimilarity
Final Remarks
Politics Name List Overlap over Time
6.0
2.2
5.5
Name token overlap (%)
Name type overlap (%)
2.1
5.0
2.0
1.9
4.5
1.8
4.0
1.7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Time gap (year) Time gap (year)
Within the same time frame, the type overlap varies between 5% and 6%
At a distance of 5 years it varies between 3.5% and 4.5%
Within the same year, the name token overlap varied between 4.2% and 4.4%
At distance of 5 years varied between 3.2% and 3.7%
Overlap between name lists also decreases over time
Corpus comparisons: (Ui ,Tj ), i =91..98, j=91..98 [64 comparisons]
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
20. Outline
Experimental Setting
Introduction
F-Measure over Time
Corpus Analysis
Politics Dissimilarity over Time
NER Performance Analysis
Politics Name List Overlap over Time
Experiments
F-Measure compared to Dissimilarity
Final Remarks
F-Measure compared to Dissimilarity
0.85
0.84
There is an inverse association
0.83
F−measure (%)
between dissimilarity and
F-measure: for higher levels of
0.82
dissimilarity (i.e, higher distance
0.81
values) we obtain lower
performance values
0.80
0.79
1 2 3 4 5 6
Dissimilarity (= mean CBDF)
OBS: Higher values = Lower similarity
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
21. Outline
Introduction
Corpus Analysis Main Results
NER Performance Analysis Work in Progress
Experiments
Final Remarks
1 Introduction
2 Corpus Analysis
3 NER Performance Analysis
4 Experiments
5 Final Remarks
Main Results
Work in Progress
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
22. Outline
Introduction
Corpus Analysis Main Results
NER Performance Analysis Work in Progress
Experiments
Final Remarks
Main Results
Within a period of 8 years we observed that:
Corpus similarity and name overlaps tend to decrease as the
two corpora become more temporally distant
The performance of a co-training based NE tagger trained and
tested on those texts shows a decay as we increase the time
gap between the training and the test data
There is an association between the results of the corpus
analysis and the tagger performance
Cristina Mota and Ralph Grishman Is this NE tagger getting old?
23. Outline
Introduction
Corpus Analysis Main Results
NER Performance Analysis Work in Progress
Experiments
Final Remarks
Work in Progress
Other related issues we are currently investigating aiming at better
named entity recognition
Analyze the NE surrounding contexts to verify if they also
tend to overlap less over time
Investigate how we can avoid the performance decay
Do we need more data?
Do we need more labeled data within the same time frame?
Do we need more unlabeled data within the same time frame?
Cristina Mota and Ralph Grishman Is this NE tagger getting old?