Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Relation Extraction


Published on

databases of relations, knowledge graph, DBpedia, freebase, ACE, relation extractors, hand-written patterns, supervised machine learning, semi-supervised learning, bootstrapping, distant supervision, unsupervised learning from the web, semantic analysis in language technology

Published in: Education
  • Be the first to comment

Relation Extraction

  1. 1. Seman&c  Analysis  in  Language  Technology 
 Relation Extraction Marina  San(ni   san$     Department  of  Linguis(cs  and  Philology   Uppsala  University,  Uppsala,  Sweden     Spring  2016      
  2. 2. Previous  Lecture:  Ques$on  Answering   2  
  3. 3. Ques$on  Answering  systems   •  Factoid  ques(ons:       •  Google   •  Wolfram   •  Ask  Jeeves   •  Start     •  ….   3   •  Approaches:       •  IR-­‐based   •  Knowelege  based   •  Hybrid  
  4. 4. Katz  et  al.  (2006)   hFp://$ons/FLAIRS0601KatzB.pdf     •  START  answers  natural  language  ques(ons  by  presen(ng  components  of  text  and   mul(-­‐media  informa(on  drawn  from  a  set  of  informa(on  resources  that  are   hosted  locally  or  accessed  remotely  through  the  Internet.     •  START  targets  high  precision  in  its  ques(on  answering.     •  The  START  system  analyzes  English  text  and  produces  a  knowledge  base  which   incorporates,  in  the  form  of  nested  ternary  expressions  (=triples),  the   informa(on  found  in  the  text.   4  
  5. 5. Is  it    true?:  hFp://   •  Ask  Jeeves,  more  correctly  known  as,  is  a  search  engine  founded   in  1996  in  California.     •  Ini(ally  it  represented  a  stereotypical  English  butler  who  would  "fetch"   the  answer  to  any  ques(on  asked.   •  is  now  considered  one  of  the  great  failures  of  the  internet.  The   ques(on  and  answer  feature  simply  didn't  work  as  well  as  hoped,  and   a^er  trying  his  hand  at  being  both  a  tradi(onal  search  engine  and  a   terrible  kind  of  "ar(ficial  AI"  with  a  bald  spot,     •  These  days  Jeeves  is  ranked  as  the  4th  most  successful  search  engine  on   the  web,  and  the  4th  most  successful  overall.  This  seems  impressive  un$l   you  consider  that  Google  holds  the  top  spot  with  95%  of  the  market.  It   has  even  fallen  behind  Bing;  enough  said.   5  
  6. 6. Search  engines  that  can  be  used  as  QA  systems   •  Yahoo   •  Bing   6  
  7. 7. Siri   hFp://     •  Siri  /ˈsɪri/  is  an  intelligent  personal  assistant  and  knowledge  navigator  which  works  as  an   applica(on  for  Apple  Inc.'s  iOS.   •   The  applica(on  uses  a  natural  language  user  interface  to  answer  ques$ons,  make   recommenda(ons,  and  perform  ac(ons  by  delega$ng  requests  to  a  set  of  Web  services.     •  The  so^ware,  both  in  its  original  version  and  as  an  iOS  applica(on,  adapts  to  the  user's   individual  language  usage  and  individual  searches  (preferences)  with  con(nuing  use,  and   returns  results  that  are  individualized.     •  The  name  Siri  is  Scandinavian,  a  short  form  of  the  Norse  name  Sigrid  meaning  "beauty"  and   "victory",  and  comes  from  the  intended  name  for  the  original  developer's  first  child.   7  
  8. 8. ChaFerbots   •  Siri…  conversa(onal  ”safety  net”.   •  Conversa(onal  agents  (chaker  bots,   and  personal  assistants)     àcustomer  care,  customer  analy(cs   (replacing/integra(ng  FAQs  and  help   desk)   8   Avatar: a picture of a person or animal that represents you on a computer screen, for example in some chat rooms or when you are playing games over the Internet
  9. 9. Eliza   hFp://   ELIZA  was  wriFen  at  MIT  by  Joseph  Weizenbaum  between  1964  and  1966     9  
  10. 10. General  IR  architecture  for  factoid  ques$ons   10   Document DocumentDocument Docume ntDocume ntDocume ntDocume ntDocume nt Question Processing Passage Retrieval Query Formulation Answer Type Detection Question Passage Retrieval Document Retrieval Answer Processing Answer passages Indexing Relevant Docs DocumentDocument Document
  11. 11. Things  to  extract  from  the  ques$on   •  Answer  Type  Detec(on   •  Decide  the  named  en$ty  type  (person,  place)   of  the  answer   •  Query  Formula(on   •  Choose  query  keywords  for  the  IR  system   •  Ques(on  Type  classifica(on   •  Is  this  a  defini(on  ques(on,  a  math  ques(on,  a   list  ques(on?   •  Focus  Detec(on   •  Find  the  ques(on  words  that  are  replaced  by   the  answer   •  Rela(on  Extrac(on   •  Find  rela(ons  between  en((es  in  the  ques(on  11  
  12. 12. 12   Common  Evalua$on  Metrics   1. Accuracy  (does  answer  match  gold-­‐labeled  answer?)   2. Mean  Reciprocal  Rank:     •  The  reciprocal  rank  of  a  query  response  is  the  inverse  of  the  rank  of  the   first  correct  answer.     •  The  mean  reciprocal  rank  is  the  average  of  the  reciprocal  ranks  of   results  for  a  sample  of  queries  Q   MRR = 1 rankii=1 N ∑ N =  
  13. 13. Common  Evalua$on  Metrics:  MRR   •  The  mean  reciprocal  rank  is  the  average  of  the  reciprocal  ranks   of  results  for  a  sample  of  queries  Q.   •  (ex  adapted  from  Wikipedia)   •  3  ranked  answers  for  a  query,  with  the  first  one  being  the  one  it  thinks  is   most  likely  correct     •  Given  those  3  samples,  we  could  calculate  the  mean  reciprocal  rank  as   (1/3  +  1/2  +  1)/3  =  0.61.   13  
  14. 14. Complex  ques$ons:  “What  is  the  ‘hajii’”?   •  The  (bokom-­‐up)  snippet  method   •  Find  a  set  of  relevant  documents   •  Extract  informa(ve  sentences  from  the  documents  (using  p-­‐idf,  MMR)   •  Order  and  modify  the  sentences  into  an  answer   •  The  (top-­‐down)  informa(on  extrac(on  method   •  build  specific  answerers  for  different  ques(on  types:   •  defini(on  ques(ons,   •  biography  ques(ons,     •  certain  medical  ques(ons  
  15. 15. Informa$on  that  should  be  in  the  answer   for  3  kinds  of  ques$ons  
  16. 16. Document Retrieval 11 Web documents 1127 total sentences Predicate Identification Data-Driven Analysis 383 Non-Specific Definitional sentences Sentence clusters, Importance ordering Definition Creation 9 Genus-Species Sentences The Hajj, or pilgrimage to Makkah (Mecca), is the central duty of Islam. The Hajj is a milestone event in a Muslim's life. The hajj is one of five pillars that make up the foundation of Islam. ... The Hajj, or pilgrimage to Makkah [Mecca], is the central duty of Islam. More than two million Muslims are expected to take the Hajj this year. Muslims must perform the hajj at least once in their lifetime if physically and financially able. The Hajj is a milestone event in a Muslim's life. The annual hajj begins in the twelfth month of the Islamic year (which is lunar, not solar, so that hajj and Ramadan fall sometimes in summer, sometimes in winter). The Hajj is a week-long pilgrimage that begins in the 12th month of the Islamic lunar calendar. Another ceremony, which was not connected with the rites of the Ka'ba before the rise of Islam, is the Hajj, the annual pilgrimage to 'Arafat, about two miles east of Mecca, toward Mina… "What is the Hajj?" (Ndocs=20, Len=8) Architecture  for  complex  ques$on  answering:     defini$on  ques$ons   S.  Blair-­‐Goldensohn,  K.  McKeown  and  A.  Schlaikjer.  2004.   Answering  Defini(on  Ques(ons:  A  Hyrbid  Approach.    
  17. 17. State-­‐of-­‐the-­‐art:  ex   •  Top  downMing  Tan,  Cicero  dos  Santos,  Bing  Xiang  &  Bowen   Zhou.  2015.  LSTM-­‐Based  Deep  Learning  Models  for  non  factoid   Answer  Selec(on.   •  Di  Wang  and  Eric  Nyberg.  2015.  A  Long  Short-­‐Term  Memory   Model  for  Answer  Sentence  Selec(on  in  Ques(on  Answering.  In   ACL  2015.s   •  Minwei  Feng,  Bing  Xiang,  Michael  R.  Glass,  Lidan  Wang,  Bowen   Zhou.  2015.  Applying  deep  learning  to  answer  selec(on:  A  study   and  an  open  task.     17   Deep  Learning  is  a  new  area  of  Machine  Learning   research.  Said  to  be  very  promising.  It  is  about  learning   mul(ple  levels  of  representa(on  and  abstrac(on  that   help  to  make  sense  of  data  such  as  images,  sound,  and   text.  It  is  based  on  neural  networks.    
  18. 18. Prac$cal  ac$vity   •  Start  seems  to  be  limited,  but  it  understands  natural  language   •  Google  (presumably  helped  by  Knowledge  Graph)  is  more   accurate,  but  skips  natural  language  (uses  keywords).     •  Google  is  customized  to  the  users’  preferences  (different  results)   •  Interes(ng  outcomes   •  Currency  vs.  Coin   •  What’s  love?   •  Lyric/song  vs.  Defini(on  ques(on   18  
  19. 19. What’s  the  meaning  of  life?   •  Google   19   Presumably  from  Knowledge  Graph…  
  20. 20. Start  and  the  42  puzzle   •  gg   20  
  21. 21. End  of  previous  lecture   21  
  22. 22. Acknowledgements Most  slides  borrowed  or  adapted  from:   Dan  Jurafsky  and  Christopher  Manning,  Coursera   Dan  Jurafsky  and  James  H.  Mar(n  (2015)         J&M(2015,  dra^):  hkps://              
  23. 23. Relation Extraction What  is  rela(on   extrac(on?  
  24. 24. Extrac$ng  rela$ons  from  text   •  Company  report:  “Interna(onal  Business  Machines  Corpora(on  (IBM  or  the   company)  was  incorporated  in  the  State  of  New  York  on  June  16,  1911,  as  the   Compu(ng-­‐Tabula(ng-­‐Recording  Co.  (C-­‐T-­‐R)…”   •  Extracted  Complex  Rela(on:   Company-­‐Founding      Company    IBM      Loca(on      New  York      Date      June  16,  1911      Original-­‐Name      Compu(ng-­‐Tabula(ng-­‐Recording  Co.   •  But  we  will  focus  on  the  simpler  task  of  extrac(ng  rela(on  triples   Founding-­‐year(IBM,1911)   Founding-­‐loca(on(IBM,New  York)  24  
  25. 25. Extrac$ng  Rela$on  Triples  from  Text    The  Leland  Stanford  Junior  University,   commonly  referred  to  as  Stanford   University  or  Stanford,  is  an  American   private  research  university  located  in   Stanford,  California  …  near  Palo  Alto,   California…  Leland  Stanford…founded   the  university  in  1891   Stanford EQ Leland Stanford Junior University Stanford LOC-IN California Stanford IS-A research university Stanford LOC-NEAR Palo Alto Stanford FOUNDED-IN 1891 Stanford FOUNDER Leland Stanford25  
  26. 26. Why  Rela$on  Extrac$on?   •  Create  new  structured  knowledge  bases,  useful  for  any  app   •  Augment  current  knowledge  bases   •  Adding  words  to  WordNet  thesaurus,  facts  to  FreeBase  or  DBPedia   •  Support  ques(on  answering   •  The  granddaughter  of  which  actor  starred  in  the  movie  “E.T.”?   (acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)! •  But  which  rela(ons  should  we  extract?   ! 26  
  27. 27. Automated  Content  Extrac$on  (ACE)   ARTIFACT GENERAL AFFILIATION ORG AFFILIATION PART- WHOLE PERSON- SOCIAL PHYSICAL Located Near Business Family Lasting Personal Citizen- Resident- Ethnicity- Religion Org-Location- Origin Founder Employment Membership Ownership Student-Alum Investor User-Owner-Inventor- Manufacturer Geographical Subsidiary Sports-Affiliation “Relation Extraction Task” 27   Automa(c  Content  Extrac(on   (ACE)  is  a  research  program  for   developing  advanced  Informa(on   extrac(on  technologies.   Given  a  text  in  natural  language,   the  ACE  challenge  is  to  detect:   •  en((es     •  rela(ons  between  en((es   •  events    
  28. 28. Automated  Content  Extrac$on  (ACE)   •  Physical-­‐Located                        PER-­‐GPE   !He was in Tennessee! •  Part-­‐Whole-­‐Subsidiary    ORG-­‐ORG        XYZ, the parent company of ABC! •  Person-­‐Social-­‐Family          PER-­‐PER   John’s wife Yoko! •  Org-­‐AFF-­‐Founder                      PER-­‐ORG   !Steve Jobs, co-founder of Apple…! •     28  
  29. 29. UMLS:  Unified  Medical  Language  System   •  134  en(ty  types,  54  rela(ons   Injury                          disrupts    Physiological  Func(on   Bodily  Loca(on                      loca(on-­‐of  Biologic  Func(on   Anatomical  Structure                      part-­‐of    Organism   Pharmacologic  Substance        causes    Pathological  Func(on   Pharmacologic  Substance        treats      Pathologic  Func(on   29  
  30. 30. Extrac$ng  UMLS  rela$ons  from  a  sentence    Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes! ê    Echocardiography,  Doppler  DIAGNOSES  Acquired  stenosis   30  
  31. 31. Databases  of  Wikipedia  Rela$ons   31   Rela(ons  extracted  from  Infobox   Stanford  state  California   Stanford  moko  “Die  Lu^  der  Freiheit  weht”   …   Wikipedia  Infobox  
  32. 32. Rela$on  databases     that  draw  from  Wikipedia   •  Resource  Descrip(on  Framework  (RDF)  triples   subject  predicate  object   Golden Gate Park location San Francisco! dbpedia:Golden_Gate_Park      dbpedia-­‐owl:loca(on      dbpedia:San_Francisco! •  The  DBpedia  project  uses  the  Resource  Descrip(on  Framework  (RDF)  to  represent  the  extracted  informa(on  and   consists  of  3  billion  RDF  triples,  580  million  extracted  from  the  English  edi(on  of  Wikipedia  and  2.46  billion  from  other   language  edi(ons  (wikipedia,  March  2016).   •  Frequent  Freebase  rela(ons:   people/person/na(onality,                                                                loca(on/loca(on/contains     people/person/profession,                                                                  people/person/place-­‐of-­‐birth     biology/organism_higher_classifica(on                      film/film/genre   32   DBpedia  is  a  project  aiming  to  extract   structured  content  from  the  informa(on   created  as  part  of  the  Wikipedia  project.   Freebase  was  a  large   collabora(ve  knowledge  base   consis(ng  of  data  composed   mainly  by  its  community   members  (cf  Seman(c  Web).  -­‐-­‐>   Knowledge  Graph:   hkps:// Freebase    
  33. 33. How  to  build  rela$on  extractors   1.  Hand-­‐wriken  pakerns   2.  Supervised  machine  learning   3.  Semi-­‐supervised  and  unsupervised     •  Bootstrapping  (using  seeds)   •  Distant  supervision   •  Unsupervised  learning  from  the  web   33  
  34. 34. Relation Extraction Using  pakerns  to   extract  rela(ons  
  35. 35. Rules  for  extrac$ng  IS-­‐A  rela$on     Early  intui(on  from  Hearst  (1992)     •  “Agar  is  a  substance  prepared  from  a  mixture  of   red  algae,  such  as  Gelidium,  for  laboratory  or   industrial  use”   •  What  does  Gelidium  mean?     •  How  do  you  know?`   35  
  36. 36. Rules  for  extrac$ng  IS-­‐A  rela$on     Early  intui(on  from  Hearst  (1992)     •  “Agar  is  a  substance  prepared  from  a  mixture  of   red  algae,  such  as  Gelidium,  for  laboratory  or   industrial  use”   •  What  does  Gelidium  mean?     •  How  do  you  know?`   36  
  37. 37. Hearst’s  PaFerns  for  extrac$ng  IS-­‐A  rela$ons   (Hearst,  1992):      Automa(c  Acquisi(on  of  Hyponyms   “Y such as X ((, X)* (, and|or) X)”! “such Y as X”! “X or other Y”! “X and other Y”! “Y including X”! “Y, especially X”! 37  
  38. 38. Hearst’s  PaFerns  for  extrac$ng  IS-­‐A  rela$ons   Hearst  paFern   Example  occurrences   X  and  other    Y   ...temples,  treasuries,  and  other  important  civic  buildings.   X  or  other    Y   Bruises,  wounds,  broken  bones  or  other  injuries...   Y  such  as  X   The  bow  lute,  such  as  the  Bambara  ndang...   Such    Y  as  X   ...such  authors  as  Herrick,  Goldsmith,  and  Shakespeare.   Y  including  X   ...common-­‐law  countries,  including  Canada  and  England...   Y  ,  especially  X   European  countries,  especially  France,  England,  and  Spain...   38  
  39. 39. Hand-­‐built  paFerns  for  rela$ons   •  Plus: •  Human patterns tend to be high-precision •  Can be tailored to specific domains •  Minus •  Human patterns are often low-recall •  A lot of work to think of all possible patterns! •  Don’t want to have to do this for every relation! •  We’d like better accuracy39  
  40. 40. Relation Extraction Supervised  rela(on   extrac(on  
  41. 41. Supervised  machine  learning  for  rela$ons   •  Choose  a  set  of  rela(ons  we’d  like  to  extract   •  Choose  a  set  of  relevant  named  en((es   •  Find  and  label  data   •  Choose  a  representa(ve  corpus   •  Label  the  named  en((es  in  the  corpus   •  Hand-­‐label  the  rela(ons  between  these  en((es   •  Break  into  training,  development,  and  test   •  Train  a  classifier  on  the  training  set   41  
  42. 42. How  to  do  classifica$on  in  supervised   rela$on  extrac$on   1.  Find  all  pairs  of  named  en((es  (usually  in  same  sentence)   2.  Decide  if  2  en((es  are  related   3.  If  yes,  classify  the  rela(on   •  Why  the  extra  step?   •  Faster  classifica(on  training  by  elimina(ng  most  pairs   •  Can  use  dis(nct  feature-­‐sets  appropriate  for  each  task.   42  
  43. 43. Word  Features  for  Rela$on  Extrac$on   •  Headwords  of  M1  and  M2,  and  combina(on   Airlines                          Wagner                              Airlines-­‐Wagner   •  Bag  of  words  and  bigrams  in  M1  and  M2                      {American,  Airlines,  Tim,  Wagner,  American  Airlines,  Tim  Wagner}   •  Words  or  bigrams  in  par(cular  posi(ons  le^  and  right  of  M1/M2   M2:  -­‐1  spokesman   M2:  +1  said   •  Bag  of  words  or  bigrams  between  the  two  en((es   {a,  AMR,  of,  immediately,  matched,  move,  spokesman,  the,  unit}   American  Airlines,  a  unit  of  AMR,  immediately  matched  the  move,  spokesman  Tim  Wagner  said   Men(on  1   Men(on  2   43  
  44. 44. Named  En$ty  Type  and  Men$on  Level   Features  for  Rela$on  Extrac$on   •  Named-­‐en(ty  types   •  M1:    ORG   •  M2:    PERSON   •  Concatena(on  of  the  two  named-­‐en(ty  types   •  ORG-­‐PERSON   •  En(ty  Level  of  M1  and  M2    (NAME,  NOMINAL,  PRONOUN)   •  M1:  NAME    [it    or  he  would  be  PRONOUN]   •  M2:  NAME    [the  company    would  be  NOMINAL]   American  Airlines,  a  unit  of  AMR,  immediately  matched  the  move,  spokesman  Tim  Wagner  said   Men(on  1   Men(on  2   44  
  45. 45. Parse  Features  for  Rela$on  Extrac$on   •  Base  syntac(c  chunk  sequence  from  one  to  the  other   NP          NP        PP      VP        NP        NP   •  Cons(tuent  path  through  the  tree  from  one  to  the  other   NP      é NP      é        S        é S        ê NP   •  Dependency  path                    Airlines        matched            Wagner      said   American  Airlines,  a  unit  of  AMR,  immediately  matched  the  move,  spokesman  Tim  Wagner  said   Men(on  1   Men(on  2   45  
  46. 46. American  Airlines,  a  unit  of  AMR,  immediately   matched  the  move,  spokesman  Tim  Wagner  said. 46  
  47. 47. Classifiers  for  supervised  methods   •  Now  you  can  use  any  classifier  you  like   •  MaxEnt   •  Naïve  Bayes   •  SVM   •  ...   •  Train  it  on  the  training  set,  tune  on  the  dev  set,  test  on  the  test   set   47  
  48. 48. Evalua$on  of  Supervised  Rela$on   Extrac$on   •  Compute  P/R/F1  for  each  rela(on   48   P = # of correctly extracted relations Total # of extracted relations R = # of correctly extracted relations Total # of gold relations F1 = 2PR P + R
  49. 49. Summary:  Supervised  Rela$on  Extrac$on   +    Can  get  high  accuracies  with  enough  hand-­‐labeled   training  data,  if  test  similar  enough  to  training   -­‐      Labeling  a  large  training  set  is  expensive   -­‐      Supervised  models  are  brikle,  don’t  generalize  well   to  different  genres     49  
  50. 50. Relation Extraction Semi-­‐supervised   and  unsupervised   rela(on  extrac(on  
  51. 51. Seed-­‐based  or  bootstrapping  approaches   to  rela$on  extrac$on   •  No  training  set?  Maybe  you  have:   •  A  few  seed  tuples    or   •  A  few  high-­‐precision  pakerns   •  Can  you  use  those  seeds  to  do  something  useful?   •  Bootstrapping:  use  the  seeds  to  directly  learn  to  populate  a   rela(on   51   Roughly  said:  Use  seeds  to  ini(alize  a   process  of  annota(on,  then  refine   through  itera(ons  
  52. 52. Rela$on  Bootstrapping  (Hearst  1992)   •  Gather  a  set  of  seed  pairs  that  have  rela(on  R   •  Iterate:   1.  Find  sentences  with  these  pairs   2.  Look  at  the  context  between  or  around  the  pair  and   generalize  the  context  to  create  pakerns   3.  Use  the  pakerns  for  grep  for  more  pairs     52  
  53. 53. Bootstrapping     •  <Mark  Twain,  Elmira>    Seed  tuple   •  Grep  (google)  for  the  environments  of  the  seed  tuple   “Mark  Twain  is  buried  in  Elmira,  NY.”   X  is  buried  in  Y   “The  grave  of  Mark  Twain  is  in  Elmira”   The  grave  of  X  is  in  Y   “Elmira  is  Mark  Twain’s  final  res(ng  place”   Y  is  X’s  final  res(ng  place.   •  Use  those  pakerns  to  grep  for  new  tuples   •  Iterate  53  
  54. 54. Dipre:  Extract  <author,book>  pairs   •  Start  with  5  seeds:       •  Find  Instances:   The  Comedy  of  Errors,  by    William  Shakespeare,  was   The  Comedy  of  Errors,  by    William  Shakespeare,  is   The  Comedy  of  Errors,  one  of  William  Shakespeare's  earliest  akempts   The  Comedy  of  Errors,  one  of  William  Shakespeare's  most   •  Extract  pakerns  (group  by  middle,  take  longest  common  prefix/suffix)   ?x , by ?y , ?x , one of ?y ‘s ! •  Now  iterate,  finding  new  seeds  that  match  the  pakern   ! Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web. Author   Book   Isaac  Asimov   The  Robots  of  Dawn   David  Brin   Star(de  Rising   James  Gleick   Chaos:  Making  a  New  Science   Charles  Dickens   Great  Expecta(ons   William  Shakespeare   The  Comedy  of  Errors   54  
  55. 55. Distant  Supervision   •  Combine  bootstrapping  with  supervised  learning   •  Instead  of  5  seeds,   •  Use  a  large  database  to  get  huge  #  of  seed  examples   •  Create  lots  of  features  from  all  these  examples   •  Combine  in  a  supervised  classifier   Snow,  Jurafsky,  Ng.  2005.  Learning  syntac(c  pakerns  for  automa(c  hypernym  discovery.  NIPS  17   Fei  Wu  and  Daniel  S.  Weld.  2007.    Autonomously  Seman(fying  Wikipeida.  CIKM  2007   Mintz,  Bills,  Snow,  Jurafsky.  2009.  Distant  supervision  for  rela(on  extrac(on  without  labeled  data.  ACL09   55  
  56. 56. Distant  supervision  paradigm   •  Like  supervised  classifica(on:   •  Uses  a  classifier  with  lots  of  features   •  Supervised  by  detailed  hand-­‐created  knowledge   •  Doesn’t  require  itera(vely  expanding  pakerns   •  Like  unsupervised  classifica(on:   •  Uses  very  large  amounts  of  unlabeled  data   •  Not  sensi(ve  to  genre  issues  in  training  corpus   56  
  57. 57. Distantly  supervised  learning     of  rela$on  extrac$on  paFerns     For  each  rela(on     For  each  tuple  in  big  database     Find  sentences  in  large  corpus   with  both  en((es     Extract  frequent  features   (parse,  words,  etc)     Train  supervised  classifier  using   thousands  of  pakerns   4 1 2 3 5 PER  was  born  in  LOC   PER,  born  (XXXX),  LOC   PER’s  birthplace  in  LOC     <Edwin  Hubble,  Marshfield>   <Albert  Einstein,  Ulm>   Born-­‐In   Hubble  was  born  in  Marshfield   Einstein,  born  (1879),    Ulm   Hubble’s  birthplace  in  Marshfield   P(born-in | f1,f2,f3,…,f70000)57  
  58. 58. Unsupervised  rela$on  extrac$on   •  Open  InformaLon  ExtracLon:     •  extract  rela(ons  from  the  web  with  no  training  data,  no  list  of  rela(ons   1.  Use  parsed  data  to  train  a  “trustworthy  tuple”  classifier   2.  Single-­‐pass  extract  all  rela(ons  between  NPs,  keep  if  trustworthy   3.  Assessor  ranks  rela(ons  based  on  text  redundancy   (FCI,  specializes  in,  so^ware  development)     (Tesla,  invented,  coil  transformer)   58   M.  Banko,  M.  Cararella,  S.  Soderland,  M.  Broadhead,  and  O.  Etzioni.   2007.  Open  informa(on  extrac(on  from  the  web.  IJCAI  
  59. 59. Evalua$on  of  Semi-­‐supervised  and   Unsupervised  Rela$on  Extrac$on   •  Since  it  extracts  totally  new  rela(ons  from  the  web     •  There  is  no  gold  set  of  correct  instances  of  rela(ons!   •  Can’t  compute  precision  (don’t  know  which  ones  are  correct)   •  Can’t  compute  recall  (don’t  know  which  ones  were  missed)   •  Instead,  we  can  approximate  precision  (only)   •   Draw  a  random  sample  of  rela(ons  from  output,  check  precision  manually   •  Can  also  compute  precision  at  different  levels  of  recall.   •  Precision  for  top  1000  new  rela(ons,  top  10,000  new  rela(ons,  top  100,000   •  In  each  case  taking  a  random  sample  of  that  set   •  But  no  way  to  evaluate  recall  59   ˆP = # of correctly extracted relations in the sample Total # of extracted relations in the sample
  60. 60. The end