SlideShare a Scribd company logo
1 of 17
Download to read offline
A	
  HEURISTIC	
  STRATEGY	
  	
  
FOR	
  EXTRACTING	
  TERMS	
  	
  
FROM	
  SCIENTIFIC	
  TEXTS	
  	
  
Elena	
  I.	
  Bolshakova,	
  Natalia	
  E.	
  Efremova	
  
	
  
Lomonosov	
  Moscow	
  State	
  University,	
  	
  
NaGonal	
  Research	
  University	
  Higher	
  School	
  of	
  Economics	
  
Моscow,	
  Russia	
  
CONTENTS	
  
● Approaches	
  to	
  term	
  extracGon	
  
● Term	
  extracGon	
  from	
  scienGfic	
  texts	
  
● Types	
  and	
  paPerns	
  of	
  extracted	
  terms	
  
● Term	
  extracGon	
  procedures	
  
● Steps	
  of	
  heurisGc	
  term	
  extracGon	
  strategy	
  
● ComparaGve	
  evaluaGon	
  of	
  the	
  strategy	
  	
  
● Conclusions	
  
2	
  
TERMS	
  and	
  TERM	
  EXTRACTION	
  
Terms	
  are	
  words	
  or	
  mulGword	
  units	
  that	
  refer	
  to	
  concepts	
  
of	
  specific	
  domains	
  
	
  nonlinear	
  plan,	
  coefficient	
  adjustment	
  learning	
  
Term	
  recogni,on	
  techniques:	
  	
  
ü  staGsGcal	
  and	
  linguisGcs	
  criteria	
  	
  
ü  shallow	
  syntacGc	
  analysis	
  	
  
Applica,ons	
  of	
  automa,c	
  term	
  extrac,on:	
  	
  
  compiling	
  terminology	
  dicGonaries	
  
  construcGng	
  thesauri	
  and	
  ontologies	
  
  text	
  abstracGng	
  and	
  summarizaGon	
  
  computer-­‐aided	
  wriGng	
  and	
  ediGng	
  of	
  specialized	
  texts	
  
ü construcGon	
  glossaries	
  and	
  subject	
  indexes	
  
3	
  
APROACHES	
  to	
  TERM	
  EXTRACTION	
  	
  
Corpus-­‐based	
  terminology	
  extrac,on:	
  	
  
●  large	
  text	
  collecGons	
  and	
  corpora	
  are	
  processed	
  
●  staGsGcal	
  criteria	
  of	
  term	
  recogniGon	
  are	
  exploited	
  
	
  	
  	
  	
  (like	
  	
  3.idf	
  	
  measure	
  and	
  its	
  numerous	
  modificaGons)	
  
●  poor	
  linguisGc	
  informaGon	
  is	
  used	
  
	
  	
  	
  	
  (such	
  as	
  part	
  of	
  speech	
  of	
  words)	
  	
  
Term	
  recogni,on	
  in	
  a	
  single	
  text:	
  
●  small	
  and	
  medium-­‐sized	
  texts	
  are	
  processed	
  	
  
●  staGsGcal	
  measures	
  becomes	
  less	
  significant,	
  	
  
	
  	
  	
  	
  contrast	
  corpora	
  are	
  not	
  always	
  available	
  
●  more	
  comprehensive	
  linguisGc	
  informaGon	
  	
  
is	
  required	
  for	
  reliable	
  term	
  extracGon	
  	
  
4	
  
TERM	
  EXTRACTION	
  	
  
from	
  SCIENTIFIC	
  TEXTS	
  
ScienGfic	
  texts:	
  	
  intensive	
  use	
  of	
  terms	
  
u Our	
  main	
  goal:	
  	
  
to	
  improve	
  the	
  quality	
  of	
  automaGc	
  term	
  extracGon	
  
Ø  from	
  a	
  parGcular	
  scienGfic	
  text	
  	
  
Ø  by	
  exploiGng	
  various	
  linguisGc	
  informaGon	
  about	
  
terms	
  and	
  their	
  occurrences	
  in	
  texts	
  	
  
u ApplicaGons	
  of	
  term	
  recogniGon	
  in	
  a	
  single	
  text:	
  	
  	
  
ü creaGon	
  of	
  glossaries	
  and	
  subject	
  indexes	
  
ü checkups	
  of	
  term	
  consistency	
  and	
  accuracy	
  
ScienGfic	
  texts	
  in	
  Russian	
  are	
  processed	
  in	
  our	
  work	
  
5	
  
STAGES	
  of	
  OUR	
  RESEARCH	
  
Our	
  work	
  included:	
  
  Empirical	
  study	
  of	
  scienGfic	
  texts	
  and	
  terminological	
  
dicGonaries	
  in	
  Russian	
  (on	
  computer	
  science	
  and	
  physics)	
  
  FormalizaGon	
  of	
  linguisGcs	
  features	
  of	
  mulG-­‐word	
  terms	
  
and	
  their	
  occurrences	
  in	
  texts:	
  
●  typical	
  term	
  structures	
  
●  terminological	
  contexts	
  
●  text	
  variants	
  of	
  terms	
  	
  
	
  	
  	
  	
  	
  LSPL	
  (Lexico-­‐SyntacGc	
  PaPern	
  Language)	
  is	
  used	
  as	
  a	
  tool	
  
  SpecificaGon	
  of	
  types	
  of	
  extracted	
  terms	
  on	
  the	
  basis	
  of	
  
linguisGc	
  informaGon	
  used	
  for	
  term	
  recogniGon	
  
  Development	
  of	
  extracGon	
  procedures	
  for	
  each	
  term	
  type	
  
  TesGng	
  the	
  procedures	
  and	
  working	
  out	
  a	
  strategy	
  for	
  
combining	
  	
  the	
  sets	
  of	
  terms	
  extracted	
  by	
  them	
  	
  
	
   6	
  
TYPES	
  and	
  LSPL-­‐PATTERNS	
  	
  
of	
  EXTRACTED	
  TERMS	
  (1)	
  
q  Term	
  candidates	
  have	
  specified	
  grammaGcal	
  structures	
  
	
  стерильный	
  нейтрино	
  –	
  sterile	
  neutrinos	
  
A	
  N	
  <A=N>	
  	
  (LSPL-­‐paPern)	
  	
  
q  Author’s	
  terms	
  appear	
  in	
  contexts	
  of	
  definiGons	
  	
  
	
  	
  	
  	
  	
  Вероятность	
  есть	
  степень	
  возможности…	
  –	
  	
  
	
  	
  	
  	
  	
  Probability	
  is	
  the	
  measure	
  of	
  the	
  likeliness…	
  
Term<c=nom>	
  "есть"	
  Defin<c=nom>	
  =>	
  Term	
  	
  
q  Term	
  synonyms: 	
   	
   	
  инфракрасный	
  (ИК)	
  –	
  infrared	
  (IR)	
  
Term1	
  "("Term2")"	
  <Term1.c=Term2.c>	
  =>	
  Term1,	
  Term2	
  
q  Dic,onary	
  terms	
  from	
  a	
  terminological	
  dicGonary	
  
	
  	
  	
  	
  	
  адрес,	
  адрес	
  возврата	
  –	
  address,	
  return	
  address	
  
	
  	
  	
  	
  	
  N1<адрес>	
  [N2<возврат,c=gen>]	
  	
  
	
  	
  7	
  
TYPES	
  and	
  LSPL-­‐PATTERNS	
  	
  
of	
  EXTRACTED	
  TERMS	
  (2)	
  
q  Combina,ons	
  of	
  several	
  mulG-­‐word	
  terms	
  	
  
	
  	
  	
  	
  N1	
  A	
  N2<c=gen>	
  <A=N2>	
  	
  =>	
  	
  N1	
  N2<c=gen>,	
  	
  	
  	
  	
  	
  	
  A	
  N2	
  <A=N2>	
  
	
  
	
  
	
  	
  	
  	
  	
  A1	
  "и"	
  A2	
  N	
  <A1=A2=N>	
  	
  	
  =>	
  	
  	
  A1	
  N	
  <A1=N>,	
  	
  	
  	
  	
  	
  	
  	
  A2	
  N	
  <A2=N>	
  	
  
	
  
	
  
q  Text	
  variants	
  of	
  a	
  single	
  term	
  
	
  	
  	
  	
  	
  	
  фрейм	
  активации	
  è	
  фрейм,	
  запись	
  активации	
  
	
  	
  	
  	
  	
  acevaeon	
  frame	
  è	
  frame,	
  acevaeon	
  record	
  	
  
	
  	
  	
  	
  	
  	
  	
  A1	
  N	
  <A1=N>	
  =>	
  N,	
  A2	
  N	
  <A2=N>	
  <Syn(A1,A2)>	
  	
  
	
   8	
  
разрядность	
  внутреннего	
  
регистра	
  	
  
=	
  
разрядность	
  
регистра	
  	
  
+	
  
внутренний	
  
регистр	
  	
  
capacity	
  of	
  internal	
  register	
  	
   =	
   capacity	
  of	
  register	
  	
   +	
   internal	
  register	
  	
  
гравитационная	
  и	
  инертная	
  
масса	
  	
  
=	
  
гравитационная	
  
масса	
  
+	
   инертная	
  масса	
  	
  
gravitaeonal	
  and	
  inereal	
  mass	
  	
   =	
   gravitaeonal	
  mass	
   +	
   inereal	
  mass	
  	
  
TERM	
  EXTRACTION	
  PROCEDURES	
  
●  FormalizaGon	
  	
  =>	
  	
  6	
  groups	
  of	
  LSPL-­‐paPerns,	
  
according	
  to	
  types	
  of	
  extracted	
  terms	
  
●  For	
  each	
  group,	
  an	
  automaGc	
  term	
  extracGon	
  
procedure	
  was	
  developed	
  	
  
●  Each	
  procedure	
  was	
  tested	
  on	
  texts	
  in	
  computer	
  
science	
  and	
  physics	
  domains:	
  	
  
ü sizes	
  of	
  the	
  texts	
  vary	
  from	
  1500	
  to	
  4700	
  
words	
  
	
  (total	
  volume	
  ≈	
  16000	
  words)	
  
●  DicGonary	
  terms	
  in	
  physics	
  (>	
  3000	
  )	
  and	
  in	
  
computer	
  science	
  (>	
  4000)	
  were	
  used	
  
9	
  
EVALUATION	
  of	
  THE	
  PROCEDURES	
  	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Rates	
  both	
  for	
  recogniGon	
  of	
  terms	
  and	
  their	
  occurrences	
  
For	
  example:	
  	
  	
  	
  
The	
  geodeec	
  effect	
  represents	
  the	
  effect	
  of	
  the	
  curvature	
  of	
  	
  	
  	
  
spaceeme…	
  The	
  geodeec	
  effect	
  was	
  first	
  predicted	
  by	
  ...	
  	
  	
  	
  
Extracted	
  term	
  geodeec	
  effect	
  	
  +	
  two	
  recognized	
  occurrences	
  
	
   10	
  
Procedure	
  and	
  
Type	
  of	
  Terms	
  	
  
ExtracGon	
  of	
  Terms	
  	
   RecogniGon	
  of	
  their	
  Occurrences	
  	
  
Recall	
  	
   Precision	
  	
   Recall	
  	
   Precision	
  	
  
Term	
  candidates	
   57,7%	
   27,4%	
   59,6%	
   48,6%	
  
Author’s	
  terms	
   92,3%	
   95,9%	
   73,7%	
   77,9%	
  
Synonyms	
  of	
  terms	
  	
   64,0%	
   49,9%	
   ––	
   ––	
  
DicGonary	
  terms	
   94,0%	
   83,2%	
   89,2%	
   72,0%	
  
Term	
  combinaGon	
   81,7%	
   24,7%	
   ––	
   ––	
  
STRATEGY	
  FOR	
  TERMS	
  EXTRACTION:	
  
KEY	
  IDEAS	
  
Analysis	
  of	
  incompleteness	
  and	
  inaccuracy	
  of	
  term	
  extracGon	
  shows:	
  
●  certain	
  terms	
  are	
  not	
  extracted	
  because	
  of	
  their	
  complex	
  
grammaGcal	
  structure	
  
●  some	
  paPerns	
  of	
  term	
  definiGons	
  are	
  ambiguous	
  (their	
  addiGon	
  
increases	
  recall	
  but	
  decreases	
  precision)	
  
●  paPerns	
  of	
  term	
  combinaGons	
  and	
  term	
  candidates	
  fix	
  only	
  their	
  
grammaGcal	
  structure,	
  so	
  many	
  non-­‐terms	
  (e.g.,	
  important	
  
problem	
  of	
  astronomy)	
  match	
  the	
  paPerns	
  
●  dicGonary	
  terms	
  are	
  not	
  recognized	
  in	
  the	
  cases,	
  when	
  they	
  are	
  
broken	
  within	
  term	
  combinaGons	
  
Linguis,cs	
  features	
  of	
  terms	
  are	
  not	
  mutually	
  exclusive	
  	
  	
  =>	
  	
  	
  
	
  the	
  sets	
  of	
  terms	
  extracted	
  by	
  the	
  procedures	
  are	
  intersected	
  	
  	
  
So	
  a	
  strategy	
  for	
  combining	
  extracted	
  sets	
  by	
  heurisGc	
  selecGon	
  was	
  
worked	
  out,	
  in	
  order	
  to	
  improve	
  the	
  quality	
  of	
  extracGon	
  	
  
11	
  
HEURISTIC	
  STRATEGY:	
  STEPS	
  1-­‐3	
  
q  The	
  final	
  set	
  S	
  of	
  terms	
  is	
  formed	
  incrementally,	
  	
  
	
  iniGally	
  S	
  is	
  empty	
  
q  In	
  each	
  step	
  of	
  the	
  strategy	
  some	
  terms	
  from	
  pre-­‐extracted	
  sets	
  
of	
  terms	
  are	
  added	
  to	
  S
12	
  
Step	
   Set	
   SelecGon	
  and	
  addiGon	
  
S:= ∅
Step	
  1	
   S1:=
AUTHOR’S	
  TERMS	
  +	
  DICTIONARY	
  TERMS	
  that	
  aren’t	
  
fragments	
  of	
  TERM	
  CANDITATES	
  
Step	
  2	
   S2:=
DICTIONARY	
  TERMS	
  that	
  are	
  consGtuents	
  of	
  
CONJUNCTIONLESS	
  TERM	
  COMBINATIONS	
  +	
  	
  
CONJUNCTIONLESS	
  TERM	
  COMBINATIONS	
  that	
  include	
  
DICTIONARY	
  TERMS	
  as	
  consGtuent	
  
S:= S1∪S2
Step	
  3	
   S3:= SYNONYMS	
  of	
  all	
  terms	
  that	
  belong	
  to	
  actual	
  S
S:= S∪S3
HEURISTIC	
  STRATEGY:	
  STEPS	
  4-­‐8	
  
	
  
	
  
	
  
	
  
	
  
	
  
13	
  
Step	
   Set	
   SelecGon	
  and	
  addiGon	
  
Step	
  4	
   S4:=
DICTIONARY	
  TERMS	
  and	
  TERM	
  CANDITATES	
  if	
  they	
  are	
  
consGtuents	
  of	
  a	
  TERM	
  COMBINATION	
  WITH	
  CONJUNCTION	
  
that	
  includes	
  a	
  term	
  from	
  S,	
  a	
  DICTIONARY	
  TERM	
  or	
  a	
  broken	
  
TERM	
  CANDITATE
S:= S∪S4
Step	
  5	
   S5:=
DICTIONARY	
  TERMS	
  and	
  TERM	
  CANDITATES	
  if	
  they	
  are	
  
consGtuents	
  of	
  a	
  CONJUNCTIONLESS	
  TERM	
  COMBINATION	
  
that	
  includes	
  a	
  broken	
  term	
  from	
  S,	
  a	
  broken	
  DICTIONARY	
  
TERM	
  or	
  a	
  broken	
  TERM	
  CANDITATE
If S3∪S4∪S5≠∅ then S:=S∪S5; goto Step	
  3	
  
Step	
  6	
   S6:= TERM	
  VARIANTS	
  of	
  all	
  terms	
  that	
  belong	
  to	
  actual	
  S
Step	
  7	
   S7:= TERM	
  CANDITATES	
  with	
  frequency	
  more	
  than	
  F	
  
Step	
  8	
   S8:= DICTIONARY	
  TERMS	
  that	
  are	
  not	
  yet	
  in	
  S
A{er	
  each	
  step	
  i=6,	
  7,	
  8:	
  
If Si≠∅ then S:=S∪Si; goto Step	
  3	
  	
  
COMPARATIVE	
  EVALUATION	
  	
  
of	
  HEURISTIC	
  STRATEGY	
  
  CollecGon	
  of	
  texts	
  (≈	
  33000	
  words)	
  of	
  different	
  genres	
  and	
  sizes	
  
	
  (on	
  computer	
  science	
  and	
  physics)	
  
  Comparison	
  with	
  several	
  methods	
  commonly	
  used	
  for	
  term	
  
extracGon	
  from	
  text	
  corpora:	
  
14	
  
Mutual-­‐Inf	
   two-­‐word	
  terms	
  extracGon	
  based	
  on	
  staGsGcs	
  of	
  
word	
  occurrences	
  and	
  co-­‐occurrences	
  	
  
Mod-­‐Mutual	
   modificaGon	
  of	
  Mutual-­‐Inf	
  methods	
  
SP	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   terms	
  extracGon	
  according	
  to	
  their	
  grammaGcal	
  
structures	
  
C-­‐Value	
   term	
  recogniGon	
  by	
  using	
  frequencies	
  of	
  words	
  
and	
  informaGon	
  about	
  embedded	
  terms	
  	
  
EVALUATION	
  of	
  THE	
  STRATEGY	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
q 17,6%	
  increase	
  of	
  F-­‐measure	
  for	
  extracGon	
  of	
  terms	
  	
  
q 11,7%	
  increase	
  of	
  F-­‐measure	
  for	
  recogniGon	
  of	
  term	
  
occurrences	
  	
  
15	
  
Methods	
   ExtracGon	
  of	
  Terms	
  	
   RecogniGon	
  of	
  their	
  Occurrences	
  	
  
Recall	
  	
   Precision	
  	
   F-­‐measure	
   Recall	
  	
   Precision	
  	
   F-­‐measure	
  
Mutual-­‐Inf	
   27,3%	
   13,0%	
   17,6%	
   24,4%	
   20,4%	
   22,2%	
  
Mod-­‐Mutual	
   54,1%	
   37,4%	
   44,2%	
   69,2%	
   41,5%	
   51,9%	
  
SP	
   51,4%	
   22,6%	
   31,4%	
   37,3%	
   29,7%	
   33,1%	
  
C-­‐Value	
   35,5%	
   4,9%	
   8,6%	
   21,3%	
   5,9%	
   9,3%	
  
Стратегия	
   53,6%	
   73,1%	
   61,8%	
   68,1%	
   59,7%	
   63,6%	
  
CONCLUSIONS	
  
●  We	
  propose	
  a	
  heurisGc	
  strategy	
  for	
  term	
  extracGon	
  
based	
  on	
  various	
  linguisGcs	
  informaGon	
  including	
  
ü grammaGcal	
  structures	
  of	
  mulGword	
  scienGfic	
  terms	
  
ü their	
  text	
  variants	
  
ü contexts	
  of	
  their	
  usage	
  
●  The	
  informaGon	
  has	
  been	
  represented	
  as	
  a	
  set	
  of	
  LSPL	
  
lexico-­‐syntacGc	
  paPerns	
  
●  Experimental	
  evaluaGon	
  of	
  our	
  strategy	
  shows	
  
increase	
  of	
  F-­‐measure	
  in	
  comparison	
  with	
  the	
  
commonly-­‐used	
  methods	
  of	
  term	
  extracGon	
  
 Nevertheless,	
  the	
  strategy	
  needs	
  further	
  verificaGon	
  
on	
  texts	
  of	
  various	
  scienGfic	
  domains	
  and	
  sizes	
  
16	
  
 	
  
	
  
THANKS	
  FOR	
  YOUR	
  
ATTENTION!	
  

More Related Content

Viewers also liked

Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
AIST
 
Dmitriy Ignatov - AIST'2014 Opening
Dmitriy Ignatov - AIST'2014 OpeningDmitriy Ignatov - AIST'2014 Opening
Dmitriy Ignatov - AIST'2014 Opening
AIST
 
Aist exactpro
Aist exactproAist exactpro
Aist exactpro
AIST
 
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
AIST
 

Viewers also liked (20)

Benjamin Lind - Organizations, State Interactions, and Field Stability: A Ne...
Benjamin Lind - Organizations, State Interactions, and Field Stability:  A Ne...Benjamin Lind - Organizations, State Interactions, and Field Stability:  A Ne...
Benjamin Lind - Organizations, State Interactions, and Field Stability: A Ne...
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
 
Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...
Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...
Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
 
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
Vladimir Milov and  Andrey Savchenko - Classification of Dangerous Situations...Vladimir Milov and  Andrey Savchenko - Classification of Dangerous Situations...
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
 
Verichev Fedoseev - Robust Image Watermarking on Triangle Grid of Feature Points
Verichev Fedoseev - Robust Image Watermarking on Triangle Grid of Feature PointsVerichev Fedoseev - Robust Image Watermarking on Triangle Grid of Feature Points
Verichev Fedoseev - Robust Image Watermarking on Triangle Grid of Feature Points
 
E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...
E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...
E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...
 
Vladimir Surin and Alexander Tyrsin - Research of properties of digital nois...
Vladimir Surin and  Alexander Tyrsin - Research of properties of digital nois...Vladimir Surin and  Alexander Tyrsin - Research of properties of digital nois...
Vladimir Surin and Alexander Tyrsin - Research of properties of digital nois...
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 
Sergey Zaika and Andrew Toporkov - Semantic Web on Duty of E- Learning: Ontol...
Sergey Zaika and Andrew Toporkov - Semantic Web on Duty of E- Learning: Ontol...Sergey Zaika and Andrew Toporkov - Semantic Web on Duty of E- Learning: Ontol...
Sergey Zaika and Andrew Toporkov - Semantic Web on Duty of E- Learning: Ontol...
 
Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...
Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...
Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
 
Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage ...
Andrew Smirnov and  Valentin Mendelev - Applying Word Embeddings to Leverage ...Andrew Smirnov and  Valentin Mendelev - Applying Word Embeddings to Leverage ...
Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage ...
 
V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...
V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...
V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...
 
Dmitriy Ignatov - AIST'2014 Opening
Dmitriy Ignatov - AIST'2014 OpeningDmitriy Ignatov - AIST'2014 Opening
Dmitriy Ignatov - AIST'2014 Opening
 
Anton Agafonov and Vladislav Myasnikov - An algorithm for traffic flow parame...
Anton Agafonov and Vladislav Myasnikov - An algorithm for traffic flow parame...Anton Agafonov and Vladislav Myasnikov - An algorithm for traffic flow parame...
Anton Agafonov and Vladislav Myasnikov - An algorithm for traffic flow parame...
 
Aist exactpro
Aist exactproAist exactpro
Aist exactpro
 
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
 
Artem Lukanin - Normalization of Non-Standard Words with Finite State Transd...
Artem Lukanin - Normalization of Non-Standard Words  with Finite State Transd...Artem Lukanin - Normalization of Non-Standard Words  with Finite State Transd...
Artem Lukanin - Normalization of Non-Standard Words with Finite State Transd...
 
Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...
Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...
Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...
 

Similar to Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting Terms from Scientific Texts

Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
DhruvKushwaha12
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Lifeng (Aaron) Han
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Lifeng (Aaron) Han
 

Similar to Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting Terms from Scientific Texts (20)

Text Processing Framework for Hindi
Text Processing Framework for HindiText Processing Framework for Hindi
Text Processing Framework for Hindi
 
Reference Scope Identification in Citing Sentences
Reference Scope Identification in Citing SentencesReference Scope Identification in Citing Sentences
Reference Scope Identification in Citing Sentences
 
Ontology matching
Ontology matchingOntology matching
Ontology matching
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
Formalising the Swedish Constructicon in Grammatical Framework
Formalising the Swedish Constructicon in Grammatical FrameworkFormalising the Swedish Constructicon in Grammatical Framework
Formalising the Swedish Constructicon in Grammatical Framework
 
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
LiCord: Language Independent Content Word Finder
LiCord: Language Independent Content Word FinderLiCord: Language Independent Content Word Finder
LiCord: Language Independent Content Word Finder
 
Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...
 
1 l5eng
1 l5eng1 l5eng
1 l5eng
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 

More from AIST

Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
AIST
 

More from AIST (20)

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product Categories
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
 
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
 
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
 
Anton Korsakov - Determination of an unmanned mobile object orientation by na...
Anton Korsakov - Determination of an unmanned mobile object orientation by na...Anton Korsakov - Determination of an unmanned mobile object orientation by na...
Anton Korsakov - Determination of an unmanned mobile object orientation by na...
 
Thu Huong Nguyen - On Road Defects Detection and Classification
Thu Huong Nguyen - On Road Defects Detection and ClassificationThu Huong Nguyen - On Road Defects Detection and Classification
Thu Huong Nguyen - On Road Defects Detection and Classification
 

Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting Terms from Scientific Texts

  • 1. A  HEURISTIC  STRATEGY     FOR  EXTRACTING  TERMS     FROM  SCIENTIFIC  TEXTS     Elena  I.  Bolshakova,  Natalia  E.  Efremova     Lomonosov  Moscow  State  University,     NaGonal  Research  University  Higher  School  of  Economics   Моscow,  Russia  
  • 2. CONTENTS   ● Approaches  to  term  extracGon   ● Term  extracGon  from  scienGfic  texts   ● Types  and  paPerns  of  extracted  terms   ● Term  extracGon  procedures   ● Steps  of  heurisGc  term  extracGon  strategy   ● ComparaGve  evaluaGon  of  the  strategy     ● Conclusions   2  
  • 3. TERMS  and  TERM  EXTRACTION   Terms  are  words  or  mulGword  units  that  refer  to  concepts   of  specific  domains    nonlinear  plan,  coefficient  adjustment  learning   Term  recogni,on  techniques:     ü  staGsGcal  and  linguisGcs  criteria     ü  shallow  syntacGc  analysis     Applica,ons  of  automa,c  term  extrac,on:       compiling  terminology  dicGonaries     construcGng  thesauri  and  ontologies     text  abstracGng  and  summarizaGon     computer-­‐aided  wriGng  and  ediGng  of  specialized  texts   ü construcGon  glossaries  and  subject  indexes   3  
  • 4. APROACHES  to  TERM  EXTRACTION     Corpus-­‐based  terminology  extrac,on:     ●  large  text  collecGons  and  corpora  are  processed   ●  staGsGcal  criteria  of  term  recogniGon  are  exploited          (like    3.idf    measure  and  its  numerous  modificaGons)   ●  poor  linguisGc  informaGon  is  used          (such  as  part  of  speech  of  words)     Term  recogni,on  in  a  single  text:   ●  small  and  medium-­‐sized  texts  are  processed     ●  staGsGcal  measures  becomes  less  significant,            contrast  corpora  are  not  always  available   ●  more  comprehensive  linguisGc  informaGon     is  required  for  reliable  term  extracGon     4  
  • 5. TERM  EXTRACTION     from  SCIENTIFIC  TEXTS   ScienGfic  texts:    intensive  use  of  terms   u Our  main  goal:     to  improve  the  quality  of  automaGc  term  extracGon   Ø  from  a  parGcular  scienGfic  text     Ø  by  exploiGng  various  linguisGc  informaGon  about   terms  and  their  occurrences  in  texts     u ApplicaGons  of  term  recogniGon  in  a  single  text:       ü creaGon  of  glossaries  and  subject  indexes   ü checkups  of  term  consistency  and  accuracy   ScienGfic  texts  in  Russian  are  processed  in  our  work   5  
  • 6. STAGES  of  OUR  RESEARCH   Our  work  included:     Empirical  study  of  scienGfic  texts  and  terminological   dicGonaries  in  Russian  (on  computer  science  and  physics)     FormalizaGon  of  linguisGcs  features  of  mulG-­‐word  terms   and  their  occurrences  in  texts:   ●  typical  term  structures   ●  terminological  contexts   ●  text  variants  of  terms              LSPL  (Lexico-­‐SyntacGc  PaPern  Language)  is  used  as  a  tool     SpecificaGon  of  types  of  extracted  terms  on  the  basis  of   linguisGc  informaGon  used  for  term  recogniGon     Development  of  extracGon  procedures  for  each  term  type     TesGng  the  procedures  and  working  out  a  strategy  for   combining    the  sets  of  terms  extracted  by  them       6  
  • 7. TYPES  and  LSPL-­‐PATTERNS     of  EXTRACTED  TERMS  (1)   q  Term  candidates  have  specified  grammaGcal  structures    стерильный  нейтрино  –  sterile  neutrinos   A  N  <A=N>    (LSPL-­‐paPern)     q  Author’s  terms  appear  in  contexts  of  definiGons              Вероятность  есть  степень  возможности…  –              Probability  is  the  measure  of  the  likeliness…   Term<c=nom>  "есть"  Defin<c=nom>  =>  Term     q  Term  synonyms:      инфракрасный  (ИК)  –  infrared  (IR)   Term1  "("Term2")"  <Term1.c=Term2.c>  =>  Term1,  Term2   q  Dic,onary  terms  from  a  terminological  dicGonary            адрес,  адрес  возврата  –  address,  return  address            N1<адрес>  [N2<возврат,c=gen>]        7  
  • 8. TYPES  and  LSPL-­‐PATTERNS     of  EXTRACTED  TERMS  (2)   q  Combina,ons  of  several  mulG-­‐word  terms            N1  A  N2<c=gen>  <A=N2>    =>    N1  N2<c=gen>,              A  N2  <A=N2>                A1  "и"  A2  N  <A1=A2=N>      =>      A1  N  <A1=N>,                A2  N  <A2=N>         q  Text  variants  of  a  single  term              фрейм  активации  è  фрейм,  запись  активации            acevaeon  frame  è  frame,  acevaeon  record                  A1  N  <A1=N>  =>  N,  A2  N  <A2=N>  <Syn(A1,A2)>       8   разрядность  внутреннего   регистра     =   разрядность   регистра     +   внутренний   регистр     capacity  of  internal  register     =   capacity  of  register     +   internal  register     гравитационная  и  инертная   масса     =   гравитационная   масса   +   инертная  масса     gravitaeonal  and  inereal  mass     =   gravitaeonal  mass   +   inereal  mass    
  • 9. TERM  EXTRACTION  PROCEDURES   ●  FormalizaGon    =>    6  groups  of  LSPL-­‐paPerns,   according  to  types  of  extracted  terms   ●  For  each  group,  an  automaGc  term  extracGon   procedure  was  developed     ●  Each  procedure  was  tested  on  texts  in  computer   science  and  physics  domains:     ü sizes  of  the  texts  vary  from  1500  to  4700   words    (total  volume  ≈  16000  words)   ●  DicGonary  terms  in  physics  (>  3000  )  and  in   computer  science  (>  4000)  were  used   9  
  • 10. EVALUATION  of  THE  PROCEDURES                     Rates  both  for  recogniGon  of  terms  and  their  occurrences   For  example:         The  geodeec  effect  represents  the  effect  of  the  curvature  of         spaceeme…  The  geodeec  effect  was  first  predicted  by  ...         Extracted  term  geodeec  effect    +  two  recognized  occurrences     10   Procedure  and   Type  of  Terms     ExtracGon  of  Terms     RecogniGon  of  their  Occurrences     Recall     Precision     Recall     Precision     Term  candidates   57,7%   27,4%   59,6%   48,6%   Author’s  terms   92,3%   95,9%   73,7%   77,9%   Synonyms  of  terms     64,0%   49,9%   ––   ––   DicGonary  terms   94,0%   83,2%   89,2%   72,0%   Term  combinaGon   81,7%   24,7%   ––   ––  
  • 11. STRATEGY  FOR  TERMS  EXTRACTION:   KEY  IDEAS   Analysis  of  incompleteness  and  inaccuracy  of  term  extracGon  shows:   ●  certain  terms  are  not  extracted  because  of  their  complex   grammaGcal  structure   ●  some  paPerns  of  term  definiGons  are  ambiguous  (their  addiGon   increases  recall  but  decreases  precision)   ●  paPerns  of  term  combinaGons  and  term  candidates  fix  only  their   grammaGcal  structure,  so  many  non-­‐terms  (e.g.,  important   problem  of  astronomy)  match  the  paPerns   ●  dicGonary  terms  are  not  recognized  in  the  cases,  when  they  are   broken  within  term  combinaGons   Linguis,cs  features  of  terms  are  not  mutually  exclusive      =>        the  sets  of  terms  extracted  by  the  procedures  are  intersected       So  a  strategy  for  combining  extracted  sets  by  heurisGc  selecGon  was   worked  out,  in  order  to  improve  the  quality  of  extracGon     11  
  • 12. HEURISTIC  STRATEGY:  STEPS  1-­‐3   q  The  final  set  S  of  terms  is  formed  incrementally,      iniGally  S  is  empty   q  In  each  step  of  the  strategy  some  terms  from  pre-­‐extracted  sets   of  terms  are  added  to  S 12   Step   Set   SelecGon  and  addiGon   S:= ∅ Step  1   S1:= AUTHOR’S  TERMS  +  DICTIONARY  TERMS  that  aren’t   fragments  of  TERM  CANDITATES   Step  2   S2:= DICTIONARY  TERMS  that  are  consGtuents  of   CONJUNCTIONLESS  TERM  COMBINATIONS  +     CONJUNCTIONLESS  TERM  COMBINATIONS  that  include   DICTIONARY  TERMS  as  consGtuent   S:= S1∪S2 Step  3   S3:= SYNONYMS  of  all  terms  that  belong  to  actual  S S:= S∪S3
  • 13. HEURISTIC  STRATEGY:  STEPS  4-­‐8               13   Step   Set   SelecGon  and  addiGon   Step  4   S4:= DICTIONARY  TERMS  and  TERM  CANDITATES  if  they  are   consGtuents  of  a  TERM  COMBINATION  WITH  CONJUNCTION   that  includes  a  term  from  S,  a  DICTIONARY  TERM  or  a  broken   TERM  CANDITATE S:= S∪S4 Step  5   S5:= DICTIONARY  TERMS  and  TERM  CANDITATES  if  they  are   consGtuents  of  a  CONJUNCTIONLESS  TERM  COMBINATION   that  includes  a  broken  term  from  S,  a  broken  DICTIONARY   TERM  or  a  broken  TERM  CANDITATE If S3∪S4∪S5≠∅ then S:=S∪S5; goto Step  3   Step  6   S6:= TERM  VARIANTS  of  all  terms  that  belong  to  actual  S Step  7   S7:= TERM  CANDITATES  with  frequency  more  than  F   Step  8   S8:= DICTIONARY  TERMS  that  are  not  yet  in  S A{er  each  step  i=6,  7,  8:   If Si≠∅ then S:=S∪Si; goto Step  3    
  • 14. COMPARATIVE  EVALUATION     of  HEURISTIC  STRATEGY     CollecGon  of  texts  (≈  33000  words)  of  different  genres  and  sizes    (on  computer  science  and  physics)     Comparison  with  several  methods  commonly  used  for  term   extracGon  from  text  corpora:   14   Mutual-­‐Inf   two-­‐word  terms  extracGon  based  on  staGsGcs  of   word  occurrences  and  co-­‐occurrences     Mod-­‐Mutual   modificaGon  of  Mutual-­‐Inf  methods   SP                         terms  extracGon  according  to  their  grammaGcal   structures   C-­‐Value   term  recogniGon  by  using  frequencies  of  words   and  informaGon  about  embedded  terms    
  • 15. EVALUATION  of  THE  STRATEGY                 q 17,6%  increase  of  F-­‐measure  for  extracGon  of  terms     q 11,7%  increase  of  F-­‐measure  for  recogniGon  of  term   occurrences     15   Methods   ExtracGon  of  Terms     RecogniGon  of  their  Occurrences     Recall     Precision     F-­‐measure   Recall     Precision     F-­‐measure   Mutual-­‐Inf   27,3%   13,0%   17,6%   24,4%   20,4%   22,2%   Mod-­‐Mutual   54,1%   37,4%   44,2%   69,2%   41,5%   51,9%   SP   51,4%   22,6%   31,4%   37,3%   29,7%   33,1%   C-­‐Value   35,5%   4,9%   8,6%   21,3%   5,9%   9,3%   Стратегия   53,6%   73,1%   61,8%   68,1%   59,7%   63,6%  
  • 16. CONCLUSIONS   ●  We  propose  a  heurisGc  strategy  for  term  extracGon   based  on  various  linguisGcs  informaGon  including   ü grammaGcal  structures  of  mulGword  scienGfic  terms   ü their  text  variants   ü contexts  of  their  usage   ●  The  informaGon  has  been  represented  as  a  set  of  LSPL   lexico-­‐syntacGc  paPerns   ●  Experimental  evaluaGon  of  our  strategy  shows   increase  of  F-­‐measure  in  comparison  with  the   commonly-­‐used  methods  of  term  extracGon    Nevertheless,  the  strategy  needs  further  verificaGon   on  texts  of  various  scienGfic  domains  and  sizes   16  
  • 17.       THANKS  FOR  YOUR   ATTENTION!