Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting Terms from Scientific Texts

A
HEURISTIC
STRATEGY

FOR
EXTRACTING
TERMS

FROM
SCIENTIFIC
TEXTS

Elena
I.
Bolshakova,
Natalia
E.
Efremova

Lomonosov
Moscow
State
University,

NaGonal
Research
University
Higher
School
of
Economics

Моscow,
Russia

CONTENTS

● Approaches
to
term
extracGon

● Term
extracGon
from
scienGﬁc
texts

● Types
and
paPerns
of
extracted
terms

● Term
extracGon
procedures

● Steps
of
heurisGc
term
extracGon
strategy

● ComparaGve
evaluaGon
of
the
strategy

● Conclusions

2

TERMS
and
TERM
EXTRACTION

Terms
are
words
or
mulGword
units
that
refer
to
concepts

of
speciﬁc
domains

nonlinear
plan,
coeﬃcient
adjustment
learning

Term
recogni,on
techniques:

ü  staGsGcal
and
linguisGcs
criteria

ü  shallow
syntacGc
analysis

Applica,ons
of
automa,c
term
extrac,on:

  compiling
terminology
dicGonaries

  construcGng
thesauri
and
ontologies

  text
abstracGng
and
summarizaGon

  computer-‐aided
wriGng
and
ediGng
of
specialized
texts

ü construcGon
glossaries
and
subject
indexes

3

APROACHES
to
TERM
EXTRACTION

Corpus-‐based
terminology
extrac,on:

●  large
text
collecGons
and
corpora
are
processed

●  staGsGcal
criteria
of
term
recogniGon
are
exploited

(like

3.idf

measure
and
its
numerous
modiﬁcaGons)

●  poor
linguisGc
informaGon
is
used

(such
as
part
of
speech
of
words)

Term
recogni,on
in
a
single
text:

●  small
and
medium-‐sized
texts
are
processed

●  staGsGcal
measures
becomes
less
signiﬁcant,

contrast
corpora
are
not
always
available

●  more
comprehensive
linguisGc
informaGon

is
required
for
reliable
term
extracGon

4

TERM
EXTRACTION

from
SCIENTIFIC
TEXTS

ScienGfic
texts:

intensive
use
of
terms

u Our
main
goal:

to
improve
the
quality
of
automaGc
term
extracGon

Ø  from
a
parGcular
scienGfic
text

Ø  by
exploiGng
various
linguisGc
informaGon
about

terms
and
their
occurrences
in
texts

u ApplicaGons
of
term
recogniGon
in
a
single
text:

ü creaGon
of
glossaries
and
subject
indexes

ü checkups
of
term
consistency
and
accuracy

ScienGfic
texts
in
Russian
are
processed
in
our
work

5

STAGES
of
OUR
RESEARCH

Our
work
included:

  Empirical
study
of
scienGﬁc
texts
and
terminological

dicGonaries
in
Russian
(on
computer
science
and
physics)

  FormalizaGon
of
linguisGcs
features
of
mulG-‐word
terms

and
their
occurrences
in
texts:

●  typical
term
structures

●  terminological
contexts

●  text
variants
of
terms

LSPL
(Lexico-‐SyntacGc
PaPern
Language)
is
used
as
a
tool

  SpeciﬁcaGon
of
types
of
extracted
terms
on
the
basis
of

linguisGc
informaGon
used
for
term
recogniGon

  Development
of
extracGon
procedures
for
each
term
type

  TesGng
the
procedures
and
working
out
a
strategy
for

combining

the
sets
of
terms
extracted
by
them

6

TYPES
and
LSPL-‐PATTERNS

of
EXTRACTED
TERMS
(1)

q  Term
candidates
have
specified
grammaGcal
structures

стерильный
нейтрино
–
sterile
neutrinos

A
N
<A=N>

(LSPL-‐paPern)

q  Author’s
terms
appear
in
contexts
of
definiGons

Вероятность
есть
степень
возможности…
–

Probability
is
the
measure
of
the
likeliness…

Term<c=nom>
"есть"
Defin<c=nom>
=>
Term

q  Term
synonyms:

инфракрасный
(ИК)
–
infrared
(IR)

Term1
"("Term2")"
<Term1.c=Term2.c>
=>
Term1,
Term2

q  Dic,onary
terms
from
a
terminological
dicGonary

адрес,
адрес
возврата
–
address,
return
address

N1<адрес>
[N2<возврат,c=gen>]

7

TYPES
and
LSPL-‐PATTERNS

of
EXTRACTED
TERMS
(2)

q  Combina,ons
of
several
mulG-‐word
terms

N1
A
N2<c=gen>
<A=N2>

=>

N1
N2<c=gen>,

A
N2
<A=N2>

A1
"и"
A2
N
<A1=A2=N>

=>

A1
N
<A1=N>,

A2
N
<A2=N>

q  Text
variants
of
a
single
term

фрейм
активации
è
фрейм,
запись
активации

acevaeon
frame
è
frame,
acevaeon
record

A1
N
<A1=N>
=>
N,
A2
N
<A2=N>
<Syn(A1,A2)>

8

разрядность
внутреннего

регистра

=

разрядность

регистра

+

внутренний

регистр

capacity
of
internal
register

=
capacity
of
register

+
internal
register

гравитационная
и
инертная

масса

=

гравитационная

масса

+
инертная
масса

gravitaeonal
and
inereal
mass

=
gravitaeonal
mass
+
inereal
mass

TERM
EXTRACTION
PROCEDURES

●  FormalizaGon

=>

6
groups
of
LSPL-‐paPerns,

according
to
types
of
extracted
terms

●  For
each
group,
an
automaGc
term
extracGon

procedure
was
developed

●  Each
procedure
was
tested
on
texts
in
computer

science
and
physics
domains:

ü sizes
of
the
texts
vary
from
1500
to
4700

words

(total
volume
≈
16000
words)

●  DicGonary
terms
in
physics
(>
3000
)
and
in

computer
science
(>
4000)
were
used

9

EVALUATION
of
THE
PROCEDURES

Rates
both
for
recogniGon
of
terms
and
their
occurrences

For
example:

The
geodeec
effect
represents
the
effect
of
the
curvature
of

spaceeme…
The
geodeec
effect
was
first
predicted
by
...

Extracted
term
geodeec
effect

+
two
recognized
occurrences

10

Procedure
and

Type
of
Terms

ExtracGon
of
Terms

RecogniGon
of
their
Occurrences

Recall

Precision

Recall

Precision

Term
candidates
57,7%
27,4%
59,6%
48,6%

Author’s
terms
92,3%
95,9%
73,7%
77,9%

Synonyms
of
terms

64,0%
49,9%
––
––

DicGonary
terms
94,0%
83,2%
89,2%
72,0%

Term
combinaGon
81,7%
24,7%
––
––

STRATEGY
FOR
TERMS
EXTRACTION:

KEY
IDEAS

Analysis
of
incompleteness
and
inaccuracy
of
term
extracGon
shows:

●  certain
terms
are
not
extracted
because
of
their
complex

grammaGcal
structure

●  some
paPerns
of
term
deﬁniGons
are
ambiguous
(their
addiGon

increases
recall
but
decreases
precision)

●  paPerns
of
term
combinaGons
and
term
candidates
ﬁx
only
their

grammaGcal
structure,
so
many
non-‐terms
(e.g.,
important

problem
of
astronomy)
match
the
paPerns

●  dicGonary
terms
are
not
recognized
in
the
cases,
when
they
are

broken
within
term
combinaGons

Linguis,cs
features
of
terms
are
not
mutually
exclusive

=>

the
sets
of
terms
extracted
by
the
procedures
are
intersected

So
a
strategy
for
combining
extracted
sets
by
heurisGc
selecGon
was

worked
out,
in
order
to
improve
the
quality
of
extracGon

11

HEURISTIC
STRATEGY:
STEPS
1-‐3

q  The
ﬁnal
set
S
of
terms
is
formed
incrementally,

iniGally
S
is
empty

q  In
each
step
of
the
strategy
some
terms
from
pre-‐extracted
sets

of
terms
are
added
to
S
12

Step
Set
SelecGon
and
addiGon

S:= ∅
Step
1
S1:=
AUTHOR’S
TERMS
+
DICTIONARY
TERMS
that
aren’t

fragments
of
TERM
CANDITATES

Step
2
S2:=
DICTIONARY
TERMS
that
are
consGtuents
of

CONJUNCTIONLESS
TERM
COMBINATIONS
+

CONJUNCTIONLESS
TERM
COMBINATIONS
that
include

DICTIONARY
TERMS
as
consGtuent

S:= S1∪S2
Step
3
S3:= SYNONYMS
of
all
terms
that
belong
to
actual
S
S:= S∪S3

HEURISTIC
STRATEGY:
STEPS
4-‐8

13

Step
Set
SelecGon
and
addiGon

Step
4
S4:=
DICTIONARY
TERMS
and
TERM
CANDITATES
if
they
are

consGtuents
of
a
TERM
COMBINATION
WITH
CONJUNCTION

that
includes
a
term
from
S,
a
DICTIONARY
TERM
or
a
broken

TERM
CANDITATE
S:= S∪S4
Step
5
S5:=
DICTIONARY
TERMS
and
TERM
CANDITATES
if
they
are

consGtuents
of
a
CONJUNCTIONLESS
TERM
COMBINATION

that
includes
a
broken
term
from
S,
a
broken
DICTIONARY

TERM
or
a
broken
TERM
CANDITATE
If S3∪S4∪S5≠∅ then S:=S∪S5; goto Step
3

Step
6
S6:= TERM
VARIANTS
of
all
terms
that
belong
to
actual
S
Step
7
S7:= TERM
CANDITATES
with
frequency
more
than
F

Step
8
S8:= DICTIONARY
TERMS
that
are
not
yet
in
S
A{er
each
step
i=6,
7,
8:

If Si≠∅ then S:=S∪Si; goto Step
3

COMPARATIVE
EVALUATION

of
HEURISTIC
STRATEGY

  CollecGon
of
texts
(≈
33000
words)
of
diﬀerent
genres
and
sizes

(on
computer
science
and
physics)

  Comparison
with
several
methods
commonly
used
for
term

extracGon
from
text
corpora:

14

Mutual-‐Inf
two-‐word
terms
extracGon
based
on
staGsGcs
of

word
occurrences
and
co-‐occurrences

Mod-‐Mutual
modiﬁcaGon
of
Mutual-‐Inf
methods

SP

terms
extracGon
according
to
their
grammaGcal

structures

C-‐Value
term
recogniGon
by
using
frequencies
of
words

and
informaGon
about
embedded
terms

EVALUATION
of
THE
STRATEGY

q 17,6%
increase
of
F-‐measure
for
extracGon
of
terms

q 11,7%
increase
of
F-‐measure
for
recogniGon
of
term

occurrences

15

Methods
ExtracGon
of
Terms

RecogniGon
of
their
Occurrences

Recall

Precision

F-‐measure
Recall

Precision

F-‐measure

Mutual-‐Inf
27,3%
13,0%
17,6%
24,4%
20,4%
22,2%

Mod-‐Mutual
54,1%
37,4%
44,2%
69,2%
41,5%
51,9%

SP
51,4%
22,6%
31,4%
37,3%
29,7%
33,1%

C-‐Value
35,5%
4,9%
8,6%
21,3%
5,9%
9,3%

Стратегия
53,6%
73,1%
61,8%
68,1%
59,7%
63,6%

CONCLUSIONS

●  We
propose
a
heurisGc
strategy
for
term
extracGon

based
on
various
linguisGcs
informaGon
including

ü grammaGcal
structures
of
mulGword
scienGfic
terms

ü their
text
variants

ü contexts
of
their
usage

●  The
informaGon
has
been
represented
as
a
set
of
LSPL

lexico-‐syntacGc
paPerns

●  Experimental
evaluaGon
of
our
strategy
shows

increase
of
F-‐measure
in
comparison
with
the

commonly-‐used
methods
of
term
extracGon

 Nevertheless,
the
strategy
needs
further
verificaGon

on
texts
of
various
scienGfic
domains
and
sizes

16

THANKS
FOR
YOUR

ATTENTION!

Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting Terms from Scientific Texts

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting Terms from Scientific Texts

Similar to Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting Terms from Scientific Texts (20)

More from AIST

More from AIST (20)

Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting Terms from Scientific Texts