Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IE: Named Entity Recognition (NER)

3,473 views

Published on

Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology

Published in: Education
  • ⇒ www.HelpWriting.net ⇐ is a good website if you’re looking to get your essay written for you. You can also request things like research papers or dissertations. It’s really convenient and helpful.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Did u try to use external powers for studying? Like ⇒ www.HelpWriting.net ⇐ ? They helped me a lot once.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • There is a useful site for you that will help you to write a perfect and valuable essay and so on. Check out, please ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Have u ever tried external professional writing services like ⇒ www.WritePaper.info ⇐ ? I did and I am more than satisfied.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Unlock The Universe & Get Answers You Seek Today In Your FREE Tarot Reading. DO THIS FIRST... To get the most out of your tarot reading, I first need you to focus your intention - this concentrates the energy on the universe to answer the questions that you most desire the answers for. Take 10 seconds to think of your #1 single biggest CHALLENGE right now. (Yes, stop for 10 seconds, close your eyes, and focus your energy on ONE key problem) Ready? Okay, let's proceed. ➢➢➢ https://url.cn/gb1VYtIO
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

IE: Named Entity Recognition (NER)

  1. 1. Seman&c  Analysis  in  Language  Technology   http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 
 
 Information Extraction (I)
 Named Entity Recognition (NER) Marina  San(ni   san$nim@stp.lingfil.uu.se     Department  of  Linguis(cs  and  Philology   Uppsala  University,  Uppsala,  Sweden     Spring  2016       1  
  2. 2. Previous  Lecture:  Distribu$onal  Seman$cs   •  Star(ng  from  Shakespeare  and  IR  (term-­‐document  matrix)  …   •  Moving  to  context  ”windows”  taken  from  the  Brown  corpus…   •  Ending  up  to  PPMI  to  weigh  word  distribu(on…   •  Men(oning  cosine  metric  to  compare  vectors….   2  
  3. 3. As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 IR:  Term-­‐document  matrix   •  Each  cell:  count  of  term  t  in  a  document  d:    Nt,d:     •  Each  document  is  a  count  vector  in  ℕv:  a  column  below     3   Term  frequency  of   t  in  d  
  4. 4. Document  similarity:  Term-­‐document  matrix   •  Two  documents  are  similar  if  their  vectors  are  similar   4   As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  5. 5. The  words  in  a  term-­‐document  matrix   •  Two  words  are  similar  if  their  vectors  are  similar   5   As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  6. 6. Term-­‐context  matrix  for  word  similarity   •  Two  words  are  similar  in  meaning  if  their  context   vectors  are  similar   6   aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0 we say, two words are similarin meaning if their context vectors are similar.  
  7. 7. Compu$ng  PPMI  on  a  term-­‐context  matrix   •  Matrix  F  with  W  rows  (words)  and  C  columns  (contexts)   •  fij  is  #  of  $mes  wi  occurs  in  context  cj 7   pij = fij fij j=1 C ∑ i=1 W ∑ pi* = fij j=1 C ∑ fij j=1 C ∑ i=1 W ∑ p* j = fij i=1 W ∑ fij j=1 C ∑ i=1 W ∑ pmiij = log2 pij pi* p* j ppmiij = pmiij if pmiij > 0 0 otherwise ! " # $# The  count  of  all   the  words  that   occur  in  that   context   The  count  of  all  the   contexts  where  the   word  appear   The  sum  of  all  words  in   all  contexts  =  all  the   numbers  in  the  matrix  
  8. 8. Summa$on:  Sigma  Nota$on  (i)   8   It means: sum whatever appears after the Sigma: so we sum n. What is the value of n ? The values are shown below and above the Sigma. Below --> index variable (eg. start from 1); Above --> the range of the sum (eg. from 1 up to 4). In this case, it says that n goes from 1 to 4, which is 1, 2, 3 and 4 (http://www.mathsisfun.com/algebra/sigma-notation.html )   pij = fij fij j=1 C ∑ i=1 W ∑we can’t delete f(i,j) !!!   Sum  from  i=1  to  4  
  9. 9. Summa$on:  Sigma  Nota$on  (ii)     •  Addi(onal  examples   •  Sums  can  be  nested   9  
  10. 10. Alterna$ve  nota$ons…  (Levy,  2012)   •  When,  the  range  of  the  sum  can  be  understood  from  context,  it   ca  be  le  out;     •  or  we  want  to  be  vague  about  the  precise  range  of  the  sum.  For   example,  suppose  that  there  are  n  variables,  x1  through  xn.     •  In  order  to  say  that  the  sum  of  all  n  variables  is  equal  to  1,  we   might  simply  write:     10  
  11. 11. Formulas:  Sigma  Nota$on   11   pij = fij fij j=1 C ∑ i=1 W ∑ pi* = fij j=1 C ∑ fij j=1 C ∑ i=1 W ∑ p* j = fij i=1 W ∑ fij j=1 C ∑ i=1 W ∑ •  Numerator:  f  ij  =  a  single  cell     •  Denominators:  sum  the  cells  of  all  the   words  and  the  cells  of  all  the  contexts   •  Numerator:  sum  the  cells  of  all  contexts   (all  the  columns)   •  Numerator:  sum  the  cells  of  all  the  words   (all  the  rows)    
  12. 12. Living  lexicon:  built  upon  an  underlying   con$nously  updated  corpus     12   Drawbacks:  Updated  but  unstable  &  incomplete:  missing words, missing   linguis(c  informa(on,  etc.     Mul(lingualiy,  func(on  words,  etc.    
  13. 13. Similarity:     •  Given  the  underlying  sta(s(cal  model,  these  words  are  similar   13   Fredrik  Olsson  
  14. 14. Gavagai  blog   •  Further  reading  (Magnus  Sahlgren)  :   heps://www.gavagai.se/blog/ 2015/09/30/a-­‐brief-­‐history-­‐of-­‐ word-­‐embeddings/     14  
  15. 15. End  of  previous  lecture   15  
  16. 16. Acknowledgements Most  slides  borrowed  or  adapted  from:   Dan  Jurafsky  and  Christopher  Manning,  Coursera   Dan  Jurafsky  and  James  H.  Mar(n         J&M(2015,  dra):  heps://web.stanford.edu/~jurafsky/slp3/              
  17. 17. Preliminary:  What’s  Informa$on  Extrac$on  (IE)?     •  IE  =  text  analy(cs  =  text  mining  =  e-­‐discovery,  etc.   •  The  ul(mate  goal  is  to  convert  unstructured  text  into  structured   informa(on  (so  informa(on  of  interest  can  easily  be  picked  up).   •  unstructured  data/text:  email,  PDF  files,  social  media  posts,  tweets,  text   messages,  blogs,  basically  any  running  text...   •  structured  data/text:  databases  (xlm,  sql,  etc.),  ontologies,  dic(onaries,  etc.     17  
  18. 18. Informa$on   Extrac$on  and  Named   En$ty  Recogni$on   Introducing  the  tasks:   Gelng  simple  structured   informa(on  out  of  text  
  19. 19. Informa$on  Extrac$on   •  Informa(on  extrac(on  (IE)  systems   •  Find  and  understand  limited  relevant  parts  of  texts   •  Gather  informa(on  from  many  pieces  of  text   •  Produce  a  structured  representa(on  of  relevant  informa(on:     •  rela3ons  (in  the  database  sense),  a.k.a.,   •  a  knowledge  base   •  Goals:   1.  Organize  informa(on  so  that  it  is  useful  to  people   2.  Put  informa(on  in  a  seman(cally  precise  form  that  allows  further   inferences  to  be  made  by  computer  algorithms  
  20. 20. Informa$on  Extrac$on:  factual  info   •  IE  systems  extract  clear,  factual  informa(on   •  Roughly:  Who  did  what  to  whom  when?   •  E.g.,   •  Gathering  earnings,  profits,  board  members,  headquarters,  etc.  from   company  reports     •  The  headquarters  of  BHP  Billiton  Limited,  and  the  global  headquarters   of  the  combined  BHP  Billiton  Group,  are  located  in  Melbourne,   Australia.     •  headquarters(“BHP  Biliton  Limited”,  “Melbourne,  Australia”)   •  Learn  drug-­‐gene  product  interac(ons  from  medical  research  literature  
  21. 21. Low-­‐level  informa$on  extrac$on   •  Is  now  available  –  and  I  think  popular  –  in  applica(ons  like  Apple   or  Google  mail,  and  web  indexing   •  Oen  seems  to  be  based  on  regular  expressions  and  name  lists  
  22. 22. Low-­‐level  informa$on  extrac$on  
  23. 23. •  A  very  important  sub-­‐task:  find  and  classify  names   in  text.   •  An  en(ty  is  a  discrete  thing  like  “IBM  Corpora(on”   •  Named” means called “IBM” or “Big Blue” not “it” or “the company” •  often extended in practice to things like dates, instances of products and chemical/biological substances that aren’t really entities… •  But also used for times, dates, proteins, etc., which aren’t entities – easy to recognize semantic classes Named  En$ty  Recogni$on  (NER)  
  24. 24. Named  En$ty  Recogni$on  (NER)   •  A  very  important  sub-­‐task:  find  and   classify  names  in  text,  for  example:   •  The  decision  by  the  independent  MP   Andrew  Wilkie  to  withdraw  his  support   for  the  minority  Labor  government   sounded  drama(c  but  it  should  not   further  threaten  its  stability.  When,  aer   the  2010  elec(on,  Wilkie,  Rob   Oakeshoe,  Tony  Windsor  and  the   Greens  agreed  to  support  Labor,  they   gave  just  two  guarantees:  confidence   and  supply.   you have a text, and you want to: 1.  find things that are names: European Commission, John Lloyd Jones, etc. 2. give them labels: ORG, PERS, etc.  
  25. 25. •  A  very  important  sub-­‐task:  find  and  classify  names  in   text,  for  example:   •  The  decision  by  the  independent  MP  Andrew  Wilkie  to   withdraw  his  support  for  the  minority  Labor  government   sounded  drama(c  but  it  should  not  further  threaten  its   stability.  When,  aer  the  2010  elec(on,  Wilkie,  Rob   Oakeshoe,  Tony  Windsor  and  the  Greens  agreed  to  support   Labor,  they  gave  just  two  guarantees:  confidence  and   supply.   Named  En$ty  Recogni$on  (NER)   Person   Date   Loca(on   Organi-­‐          za(on      
  26. 26. Named  En$ty  Recogni$on  (NER)   •  The  uses:   •  Named  en((es  can  be  indexed,  linked  off,  etc.   •  Sen(ment  can  be  aeributed  to  companies  or  products   •  A  lot  of  IE  rela(ons  are  associa(ons  between  named  en((es   •  For  ques(on  answering,  answers  are  oen  named  en((es.   •  Concretely:   •  Many  web  pages  tag  various  en((es,  with  links  to  bio  or  topic  pages,  etc.   •  Reuters’  OpenCalais,  Evri,  AlchemyAPI,  Yahoo’s  Term  Extrac(on,  …   •  Apple/Google/Microso/…  smart  recognizers  for  document  content  
  27. 27. Summary:   Gelng  simple  structured  informa(on  out  of  text  
  28. 28. Evalua$on  of  Named   En$ty  Recogni$on   The  extension  of  Precision,   Recall,  and  the  F  measure  to   sequences  
  29. 29. The  Named  En$ty  Recogni$on  Task   Task:  Predict  en((es  in  a  text      Foreign    ORG    Ministry    ORG    spokesman    O    Shen      PER    Guofang    PER    told      O    Reuters    ORG    :      :   }   Standard     evalua(on   is  per  en(ty,   not  per  token  
  30. 30. P/R   30   P=TP/TP+FP;  R=TP/TP+FN   FP=false  alarm  (it  is  not  a   NE,  but  it  has  been   classified  as  NE)   FN  =it  is  true  that  it  is  a   NE,  but  d  system  failed   to  recognised  it  
  31. 31. Precision/Recall/F1  for  IE/NER   •  Recall  and  precision  are  straighNorward  for  tasks  like  IR  and  text   categoriza(on,  where  there  is  only  one  grain  size  (documents)   •  The  measure  behaves  a  bit  funnily  for  IE/NER  when  there  are   boundary  errors  (which  are  common):   •  First  Bank  of  Chicago  announced  earnings  …   •  This  counts  as  both  a  fp  and  a  fn   •  Selec(ng  nothing  would  have  been  beeer   •  Some  other  metrics  (e.g.,  MUC  scorer)  give  par(al  credit   (according  to  complex  rules)  
  32. 32. Summary:     Be  careful  when  interpre(ng  the  P/R/F1  measures  
  33. 33. Sequence  Models  for   Named  En$ty   Recogni$on  
  34. 34. The  ML  sequence  model  approach  to  NER   Training   1.  Collect  a  set  of  representa(ve  training  documents   2.  Label  each  token  for  its  en(ty  class  or  other  (O)   3.  Design  feature  extractors  appropriate  to  the  text  and  classes   4.  Train  a  sequence  classifier  to  predict  the  labels  from  the  data     Tes(ng   1.  Receive  a  set  of  tes(ng  documents   2.  Run  sequence  model  inference  to  label  each  token   3.  Appropriately  output  the  recognized  en((es  
  35. 35. NER  pipeline   35   Representa(ve   documents   Human   annota(on   Annotated   documents   Feature   extrac(on   Training  data  Sequence   classifiers   NER  system  
  36. 36. Encoding  classes  for  sequence  labeling        IO  encoding  IOB  encoding      Fred      PER    B-­‐PER    showed    O    O    Sue      PER    B-­‐PER    Mengqiu    PER    B-­‐PER    Huang    PER    I-­‐PER    ‘s      O    O    new      O    O    pain(ng  O    O  
  37. 37. Features  for  sequence  labeling   •  Words   •  Current  word  (essen(ally  like  a  learned  dic(onary)   •  Previous/next  word  (context)   •  Other  kinds  of  inferred  linguis(c  classifica(on   •  Part-­‐of-­‐speech  tags   •  Label  context   •  Previous  (and  perhaps  next)  label   37  
  38. 38. Features:  Word  substrings   drug company movie place person Cotrimoxazole   Wethersfield   Alien  Fury:  Countdown  to  Invasion   0 0 0 18 0 oxa 708 0 0 06 : 0 8 6 68 14 field
  39. 39. Features: Word shapes •  Word Shapes •  Map words to simplified representation that encodes attributes such as length, capitalization, numerals, Greek letters, internal punctuation, etc. Varicella-zoster Xx-xxx mRNA xXXX CPA1 XXXd
  40. 40. Sequence  models   •  Once  you  have  designed  the  features,  apply  a  sequence   classifier  (cf  PoS  tagging),  such  as:   •  Maximum  Entropy  Markov  Models   •  Condi(onal  Random  Fields   •  etc.   40  
  41. 41. The end

×