Thu Huong Nguyen - On Road Defects Detection and Classification
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting Terms from Scientific Texts
1. A
HEURISTIC
STRATEGY
FOR
EXTRACTING
TERMS
FROM
SCIENTIFIC
TEXTS
Elena
I.
Bolshakova,
Natalia
E.
Efremova
Lomonosov
Moscow
State
University,
NaGonal
Research
University
Higher
School
of
Economics
Моscow,
Russia
2. CONTENTS
● Approaches
to
term
extracGon
● Term
extracGon
from
scienGfic
texts
● Types
and
paPerns
of
extracted
terms
● Term
extracGon
procedures
● Steps
of
heurisGc
term
extracGon
strategy
● ComparaGve
evaluaGon
of
the
strategy
● Conclusions
2
3. TERMS
and
TERM
EXTRACTION
Terms
are
words
or
mulGword
units
that
refer
to
concepts
of
specific
domains
nonlinear
plan,
coefficient
adjustment
learning
Term
recogni,on
techniques:
ü staGsGcal
and
linguisGcs
criteria
ü shallow
syntacGc
analysis
Applica,ons
of
automa,c
term
extrac,on:
compiling
terminology
dicGonaries
construcGng
thesauri
and
ontologies
text
abstracGng
and
summarizaGon
computer-‐aided
wriGng
and
ediGng
of
specialized
texts
ü construcGon
glossaries
and
subject
indexes
3
4. APROACHES
to
TERM
EXTRACTION
Corpus-‐based
terminology
extrac,on:
● large
text
collecGons
and
corpora
are
processed
● staGsGcal
criteria
of
term
recogniGon
are
exploited
(like
3.idf
measure
and
its
numerous
modificaGons)
● poor
linguisGc
informaGon
is
used
(such
as
part
of
speech
of
words)
Term
recogni,on
in
a
single
text:
● small
and
medium-‐sized
texts
are
processed
● staGsGcal
measures
becomes
less
significant,
contrast
corpora
are
not
always
available
● more
comprehensive
linguisGc
informaGon
is
required
for
reliable
term
extracGon
4
5. TERM
EXTRACTION
from
SCIENTIFIC
TEXTS
ScienGfic
texts:
intensive
use
of
terms
u Our
main
goal:
to
improve
the
quality
of
automaGc
term
extracGon
Ø from
a
parGcular
scienGfic
text
Ø by
exploiGng
various
linguisGc
informaGon
about
terms
and
their
occurrences
in
texts
u ApplicaGons
of
term
recogniGon
in
a
single
text:
ü creaGon
of
glossaries
and
subject
indexes
ü checkups
of
term
consistency
and
accuracy
ScienGfic
texts
in
Russian
are
processed
in
our
work
5
6. STAGES
of
OUR
RESEARCH
Our
work
included:
Empirical
study
of
scienGfic
texts
and
terminological
dicGonaries
in
Russian
(on
computer
science
and
physics)
FormalizaGon
of
linguisGcs
features
of
mulG-‐word
terms
and
their
occurrences
in
texts:
● typical
term
structures
● terminological
contexts
● text
variants
of
terms
LSPL
(Lexico-‐SyntacGc
PaPern
Language)
is
used
as
a
tool
SpecificaGon
of
types
of
extracted
terms
on
the
basis
of
linguisGc
informaGon
used
for
term
recogniGon
Development
of
extracGon
procedures
for
each
term
type
TesGng
the
procedures
and
working
out
a
strategy
for
combining
the
sets
of
terms
extracted
by
them
6
7. TYPES
and
LSPL-‐PATTERNS
of
EXTRACTED
TERMS
(1)
q Term
candidates
have
specified
grammaGcal
structures
стерильный
нейтрино
–
sterile
neutrinos
A
N
<A=N>
(LSPL-‐paPern)
q Author’s
terms
appear
in
contexts
of
definiGons
Вероятность
есть
степень
возможности…
–
Probability
is
the
measure
of
the
likeliness…
Term<c=nom>
"есть"
Defin<c=nom>
=>
Term
q Term
synonyms:
инфракрасный
(ИК)
–
infrared
(IR)
Term1
"("Term2")"
<Term1.c=Term2.c>
=>
Term1,
Term2
q Dic,onary
terms
from
a
terminological
dicGonary
адрес,
адрес
возврата
–
address,
return
address
N1<адрес>
[N2<возврат,c=gen>]
7
8. TYPES
and
LSPL-‐PATTERNS
of
EXTRACTED
TERMS
(2)
q Combina,ons
of
several
mulG-‐word
terms
N1
A
N2<c=gen>
<A=N2>
=>
N1
N2<c=gen>,
A
N2
<A=N2>
A1
"и"
A2
N
<A1=A2=N>
=>
A1
N
<A1=N>,
A2
N
<A2=N>
q Text
variants
of
a
single
term
фрейм
активации
è
фрейм,
запись
активации
acevaeon
frame
è
frame,
acevaeon
record
A1
N
<A1=N>
=>
N,
A2
N
<A2=N>
<Syn(A1,A2)>
8
разрядность
внутреннего
регистра
=
разрядность
регистра
+
внутренний
регистр
capacity
of
internal
register
=
capacity
of
register
+
internal
register
гравитационная
и
инертная
масса
=
гравитационная
масса
+
инертная
масса
gravitaeonal
and
inereal
mass
=
gravitaeonal
mass
+
inereal
mass
9. TERM
EXTRACTION
PROCEDURES
● FormalizaGon
=>
6
groups
of
LSPL-‐paPerns,
according
to
types
of
extracted
terms
● For
each
group,
an
automaGc
term
extracGon
procedure
was
developed
● Each
procedure
was
tested
on
texts
in
computer
science
and
physics
domains:
ü sizes
of
the
texts
vary
from
1500
to
4700
words
(total
volume
≈
16000
words)
● DicGonary
terms
in
physics
(>
3000
)
and
in
computer
science
(>
4000)
were
used
9
10. EVALUATION
of
THE
PROCEDURES
Rates
both
for
recogniGon
of
terms
and
their
occurrences
For
example:
The
geodeec
effect
represents
the
effect
of
the
curvature
of
spaceeme…
The
geodeec
effect
was
first
predicted
by
...
Extracted
term
geodeec
effect
+
two
recognized
occurrences
10
Procedure
and
Type
of
Terms
ExtracGon
of
Terms
RecogniGon
of
their
Occurrences
Recall
Precision
Recall
Precision
Term
candidates
57,7%
27,4%
59,6%
48,6%
Author’s
terms
92,3%
95,9%
73,7%
77,9%
Synonyms
of
terms
64,0%
49,9%
––
––
DicGonary
terms
94,0%
83,2%
89,2%
72,0%
Term
combinaGon
81,7%
24,7%
––
––
11. STRATEGY
FOR
TERMS
EXTRACTION:
KEY
IDEAS
Analysis
of
incompleteness
and
inaccuracy
of
term
extracGon
shows:
● certain
terms
are
not
extracted
because
of
their
complex
grammaGcal
structure
● some
paPerns
of
term
definiGons
are
ambiguous
(their
addiGon
increases
recall
but
decreases
precision)
● paPerns
of
term
combinaGons
and
term
candidates
fix
only
their
grammaGcal
structure,
so
many
non-‐terms
(e.g.,
important
problem
of
astronomy)
match
the
paPerns
● dicGonary
terms
are
not
recognized
in
the
cases,
when
they
are
broken
within
term
combinaGons
Linguis,cs
features
of
terms
are
not
mutually
exclusive
=>
the
sets
of
terms
extracted
by
the
procedures
are
intersected
So
a
strategy
for
combining
extracted
sets
by
heurisGc
selecGon
was
worked
out,
in
order
to
improve
the
quality
of
extracGon
11
12. HEURISTIC
STRATEGY:
STEPS
1-‐3
q The
final
set
S
of
terms
is
formed
incrementally,
iniGally
S
is
empty
q In
each
step
of
the
strategy
some
terms
from
pre-‐extracted
sets
of
terms
are
added
to
S
12
Step
Set
SelecGon
and
addiGon
S:= ∅
Step
1
S1:=
AUTHOR’S
TERMS
+
DICTIONARY
TERMS
that
aren’t
fragments
of
TERM
CANDITATES
Step
2
S2:=
DICTIONARY
TERMS
that
are
consGtuents
of
CONJUNCTIONLESS
TERM
COMBINATIONS
+
CONJUNCTIONLESS
TERM
COMBINATIONS
that
include
DICTIONARY
TERMS
as
consGtuent
S:= S1∪S2
Step
3
S3:= SYNONYMS
of
all
terms
that
belong
to
actual
S
S:= S∪S3
13. HEURISTIC
STRATEGY:
STEPS
4-‐8
13
Step
Set
SelecGon
and
addiGon
Step
4
S4:=
DICTIONARY
TERMS
and
TERM
CANDITATES
if
they
are
consGtuents
of
a
TERM
COMBINATION
WITH
CONJUNCTION
that
includes
a
term
from
S,
a
DICTIONARY
TERM
or
a
broken
TERM
CANDITATE
S:= S∪S4
Step
5
S5:=
DICTIONARY
TERMS
and
TERM
CANDITATES
if
they
are
consGtuents
of
a
CONJUNCTIONLESS
TERM
COMBINATION
that
includes
a
broken
term
from
S,
a
broken
DICTIONARY
TERM
or
a
broken
TERM
CANDITATE
If S3∪S4∪S5≠∅ then S:=S∪S5; goto Step
3
Step
6
S6:= TERM
VARIANTS
of
all
terms
that
belong
to
actual
S
Step
7
S7:= TERM
CANDITATES
with
frequency
more
than
F
Step
8
S8:= DICTIONARY
TERMS
that
are
not
yet
in
S
A{er
each
step
i=6,
7,
8:
If Si≠∅ then S:=S∪Si; goto Step
3
14. COMPARATIVE
EVALUATION
of
HEURISTIC
STRATEGY
CollecGon
of
texts
(≈
33000
words)
of
different
genres
and
sizes
(on
computer
science
and
physics)
Comparison
with
several
methods
commonly
used
for
term
extracGon
from
text
corpora:
14
Mutual-‐Inf
two-‐word
terms
extracGon
based
on
staGsGcs
of
word
occurrences
and
co-‐occurrences
Mod-‐Mutual
modificaGon
of
Mutual-‐Inf
methods
SP
terms
extracGon
according
to
their
grammaGcal
structures
C-‐Value
term
recogniGon
by
using
frequencies
of
words
and
informaGon
about
embedded
terms
15. EVALUATION
of
THE
STRATEGY
q 17,6%
increase
of
F-‐measure
for
extracGon
of
terms
q 11,7%
increase
of
F-‐measure
for
recogniGon
of
term
occurrences
15
Methods
ExtracGon
of
Terms
RecogniGon
of
their
Occurrences
Recall
Precision
F-‐measure
Recall
Precision
F-‐measure
Mutual-‐Inf
27,3%
13,0%
17,6%
24,4%
20,4%
22,2%
Mod-‐Mutual
54,1%
37,4%
44,2%
69,2%
41,5%
51,9%
SP
51,4%
22,6%
31,4%
37,3%
29,7%
33,1%
C-‐Value
35,5%
4,9%
8,6%
21,3%
5,9%
9,3%
Стратегия
53,6%
73,1%
61,8%
68,1%
59,7%
63,6%
16. CONCLUSIONS
● We
propose
a
heurisGc
strategy
for
term
extracGon
based
on
various
linguisGcs
informaGon
including
ü grammaGcal
structures
of
mulGword
scienGfic
terms
ü their
text
variants
ü contexts
of
their
usage
● The
informaGon
has
been
represented
as
a
set
of
LSPL
lexico-‐syntacGc
paPerns
● Experimental
evaluaGon
of
our
strategy
shows
increase
of
F-‐measure
in
comparison
with
the
commonly-‐used
methods
of
term
extracGon
Nevertheless,
the
strategy
needs
further
verificaGon
on
texts
of
various
scienGfic
domains
and
sizes
16