Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Information  
Extraction	
	
Ruben Izquierdo
ruben.izquierdobevia@vu.nl
http://rubenizquierdobevia.com
Text  Mining  Course	
•  1) Introduction to Text Mining
•  2) Introduction to NLP
•  3) Named Entity Recognition and Disam...
Outline	
1.  What is Information Extraction
2.  Main goals of Information Extraction
3.  Information Extraction Tasks and ...
What  is  IE?	
•  Late 1970s within NLP field
•  Find and extract automatically limited relevant
parts of texts
•  Merge i...
What  is  IE?	
•  Quite often in specialized domains
•  Move from unstructured/semi-structured data to
structured data
o  ...
What  is  IE?	
Unstructured  text	
•  Natural  language  sentences	
•  Historically  NLP  system  have  been  designed  to...
What  is  IE?	
Semi-­‐‑structured  text	
•  The  physical  layout  helps  to  the  interpretation	
•  Processing  half  wa...
What  is  IE?
Main  goals  of  IE	
•  Fill a predefined “template” from raw text
•  Extract who did what to whom and when?
o  Event extr...
IE.  Task  &  Subtasks	
•  Named Entity Recognition
o  Detection à Mr. Smith eats bitterballen [Mr. Smith] : ENTITY
o  Cl...
IE.  Task  &  Subtasks	
•  Relationship extraction
o  Bill works for IBM PERSON works for ORGANISATION
•  Terminology extr...
IE  Tasks  &  Subtasks	
•  Apple mail
MUC  conferences	
•  Message Understanding Conference (MUC), held
between 1987 and 1998.
•  Domain specific texts + traini...
MUC  conferences	
Bridgestone  Sports  Co.  said  Friday  it  has  set  up  a  joint  venture  in  
Taiwan  with  a  local...
Main  domains  of  IE	
•  Terrorist events
•  Joint ventures
•  Plane crashes
•  Disease outbreaks
•  Seminar announcement...
Outline	
1.  What is Information Extraction
2.  Main goals of Information Extraction
3.  Information Extraction Tasks and ...
Methods  for  IE	
•  Cascaded finite-state transducers
o  Rule based
o  Regular expressions
•  Learning based approaches
o...
Cascaded  finite-­‐‑state  
transducers	
•  Emerging idea from MUC participants and
approaches
•  Decompose the task into s...
Cascaded  finite-­‐‑state  
transducers	
Finite  Automaton  for  noun  groups	
=>  John’s  interesting  book  with  a  nice...
Cascaded  finite-­‐‑state  
transducers	
•  Earlier stages recognize smaller linguistics objects
o  Usually domain independ...
Cascaded  finite-­‐‑state  
transducers	
•  Complex words
o  Multiwords: “set up” “trading house”
o  NE: “Bridgestone Sport...
Cascaded  finite-­‐‑state  
transducers
Cascaded  finite-­‐‑state  
transducers	
•  Complex phrases
o  Complex noun and verb groups on the basis of syntactic infor...
Cascaded  finite-­‐‑state  
transducers	
•  Domain events
o  Recognize events and match with “fillers” detected in previous...
Cascaded  finite-­‐‑state  
transducers
Regular  Expressions	
•  1950’s Stephen Kleene
•  A string pattern that describes/matches a set of
strings
•  A regular ex...
Regular  Expressions	
Character	
 Description	
a	
 The  character  a	
.	
 Any  single  character	
[abc]	
 Any  character  ...
Regular  Expressions	
①  .at è ???
Regular  Expressions	
①  .at è hat cat bat xat …
②  [hc]at è hat cat
③  [^b]at è all matched by .at but “bat”
④  [^hc]a...
Regular  Expressions	
①  .at è hat cat bat xat …
②  [hc]at è hat cat
③  [^b]at è all matched by .at but “bat”
④  [^hc]a...
Using  Regular  
Expressions	
•  Typically extracting information from automatic
generated webpages is easy
o  Wikipedia
•...
Using  Regular  
Expressions
Using  Regular  
Expressions	
•  Some “unstructured” pieces of information keep
some structure and are easy to capture by ...
Using  Regular  
Expressions	
•  Some “unstructured” pieces of information keep
some structure and are easy to capture by ...
Using  Regular  
Expressions	
•  Also to detect relations and fill events
•  Higher level regular expressions make use of
...
Using  Regular  
Expressions	
•  Extraction relations between entities
o  Which PERSON holds what POSITION in what ORGANIZ...
Using  Regular  
Expressions	
•  Extraction relations between entities
o  Which PERSON holds what POSITION in what ORGANIZ...
Extracting  relations  with  
palerns	
•  Hearst 1992
•  What does Gelidium mean?
•  “Αγαρ ισ α συβστανχε πρεπαρεδ φροµ α ...
Extracting  relations  with  
palerns	
•  Hearst 1992
•  What does Gelidium mean?
•  “Agar is a substance prepared from a ...
Extracting  relations  with  
palerns	
•  Hearst 1992: Automatic Acquisition of Hyponyms (IS-A)
X à Gelidium (sub-type) Y...
Extracting  relations  with  
palerns
Hand-­‐‑built  palerns	
•  Positive
o  Tend to be high-precision
o  Can be adapted to specific domains
•  Negative
o  Huma...
Learning-­‐‑based  
Approaches	
•  Statistical techniques and machine learning
algorithms
o  Automatically learn patterns ...
Supervised  Learning  of  
Palerns  and  Rules	
•  Aiming to reduce the knowledge engineering
bottleneck to create an IE i...
Supervised  Learning  of  
Palerns  and  Rules	
•  Relational learning methods
o  RAPIER: rules for pre-filler, filler, an...
Supervised  Learning  for  
relation  extraction  (I)	
•  Design a supervised machine learning framework
•  Decide what re...
Supervised  Learning  for  
relation  extraction  (II)	
•  Relation extraction as a classification problem
•  2 classifier...
Supervised  Learning  for  
relation  extraction  (III)	
•  Are the two entities related?
•  What is the type of relation?
Supervised  Learning  for  
relation  extraction  (IV)	
“[American Airlines], a unit of AMR, immediately
matched the move,...
Supervised  Learning  for  
relation  extraction  (V)	
“[American Airlines], a unit of AMR, immediately
matched the move, ...
Supervised  Learning  for  
relation  extraction  (VI)	
“[American Airlines], a unit of AMR, immediately
matched the move,...
Supervised  Learning  for  
relation  extraction  (VII)	
•  Decide your algorithm
o  MaxEnt, Naïve Bayes, SVM
•  Train the...
Sequential  Classifier  
Methods	
•  IE as a classification problem using sequential
learning models.
•  A classifier is in...
Sequential  Classifier  
Methods
Sequential  Classifier  
Methods	
•  Typical steps for training
o  Get the annotated training data
o  Represent the data in...
Sequential  Classifier  
Methods	
•  Algorithms
o  HMM
o  CMM
o  CRF
•  Features
o  Words (current, previous, next)
o  Othe...
Sequential  Classifier  
Methods	
•  Algorithms
o  HMM
o  SVM
o  CRF
•  Features
o  Words (current, previous, next)
o  Othe...
Weakly  supervised  and  
unsupervised  	
•  Manual annotation is also “expensive”
o  IE is quite domain specific à not r...
Weakly  supervised  and  
unsupervised  	
•  OpeNER:
•  European project dealing with entity recognition,
sentiment analys...
Weakly  supervised  and  
unsupervised  	
•  Seed list
•  + à good, nice
•  - à bad, ugly
•  Patterns
•  a [EXP] [TAR]
•...
Weakly  supervised  and  
unsupervised  	
•  Propagation method
o  1) Get new targets using the seed expressions and the
p...
Weakly  supervised  and  
unsupervised  	
•  Polarity guessing
o  Apply the polarity patters to guess the polarity
•  = a ...
Outline	
1.  What is Information Extraction
2.  Main goals of Information Extraction
3.  Information Extraction Tasks and ...
How  good  is  IE
How  good  is  IE	
•  Some progress has been done
•  Still the barrier of 60% seems difficult to outperform
•  Most errors...
How  good  is  IE	
Information  Type	
 Accuracy	
Entities	
 90  –  98%	
Alributes	
 80%	
Relations	
 60  –  70%	
Events	
 ...
Information  
Extraction	
	
Ruben Izquierdo
ruben.izquierdobevia@vu.nl
http://rubenizquierdobevia.com
Information Extraction
Information Extraction
Upcoming SlideShare
Loading in …5
×

Information Extraction

7,948 views

Published on

Information Extraction slides for the Text Mining course at the VU University of Amsterdam (2014-2015) by the CLTL group

Published in: Technology
  • Login to see the comments

Information Extraction

  1. 1. Information   Extraction Ruben Izquierdo ruben.izquierdobevia@vu.nl http://rubenizquierdobevia.com
  2. 2. Text  Mining  Course •  1) Introduction to Text Mining •  2) Introduction to NLP •  3) Named Entity Recognition and Disambiguation •  4) Opinion Mining and Sentiment Analysis •  5) Information Extraction •  6) NewsReader and Visualisation •  7) Guest Lecture and Q&A
  3. 3. Outline 1.  What is Information Extraction 2.  Main goals of Information Extraction 3.  Information Extraction Tasks and Subtasks 4.  MUC conferences 5.  Main domains of Information Extraction 6.  Methods for Information Extraction o  Cascaded finite-state transducers o  Regular expressions and patterns o  Supervised learning approaches o  Weakly supervised and unsupervised approaches 7.  How far we are with IE
  4. 4. What  is  IE? •  Late 1970s within NLP field •  Find and extract automatically limited relevant parts of texts •  Merge information from many pieces of text
  5. 5. What  is  IE? •  Quite often in specialized domains •  Move from unstructured/semi-structured data to structured data o  Schemas o  Relations (as a database) o  Knowledge base o  RDF triples
  6. 6. What  is  IE? Unstructured  text •  Natural  language  sentences •  Historically  NLP  system  have  been  designed  to  process  this  type  of  data •  The  meaning  à  linguistic  analysis  and  natural  language  understanding
  7. 7. What  is  IE? Semi-­‐‑structured  text •  The  physical  layout  helps  to  the  interpretation •  Processing  half  way  linguistic  features  ßà  positional  features
  8. 8. What  is  IE?
  9. 9. Main  goals  of  IE •  Fill a predefined “template” from raw text •  Extract who did what to whom and when? o  Event extraction •  Organize information so that is useful to people •  Put information in a form that allows further inferences by computers o  Big data
  10. 10. IE.  Task  &  Subtasks •  Named Entity Recognition o  Detection à Mr. Smith eats bitterballen [Mr. Smith] : ENTITY o  Classification à Mr. Smith eats bitterballen [Mr. Smith] : PERSON •  Event extraction o  The thief broke the door with a hammer •  CAUSE_HARMà Verb: break Agent: the thief Patient: the door Instrument: a hammer •  Coreference resolution o  [Mr. Smith] eats bitterballen. Besides to this, [he] only drinks Belgium beer.
  11. 11. IE.  Task  &  Subtasks •  Relationship extraction o  Bill works for IBM PERSON works for ORGANISATION •  Terminology extraction o  Finding relevant terms of multi words from a given corpus •  Some concrete examples o  Extracting earnings, profits, board members, headquarters from company reports o  Searching on the WWW for e-mails for advertising (spamming) o  Learn drug-gene product interactions from biomedical research papers
  12. 12. IE  Tasks  &  Subtasks •  Apple mail
  13. 13. MUC  conferences •  Message Understanding Conference (MUC), held between 1987 and 1998. •  Domain specific texts + training examples + template definition •  Precision, Recall and F1 as evaluation •  Domains o  MUC-1 (1987), MUC-2 (1989): Naval operations messages. o  MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries. o  MUC-5 (1993): Joint ventures and microelectronics domain. o  MUC-6 (1995): News articles on management changes. o  MUC-7 (1998): Satellite launch reports.
  14. 14. MUC  conferences Bridgestone  Sports  Co.  said  Friday  it  has  set  up  a  joint  venture  in   Taiwan  with  a  local  concern  and  a  Japanese  trading  house  to  produce   golf  clubs  to  be  shipped  to  Japan. The  joint  venture,  Bridgestone  Sports  Taiwan  Co.,  capitalized  at  20   million  new  Taiwan  dollars,  will  start  production  in  January  1990  with   production  of  20,000  iron  and  “metal  wood”  clubs  a  month. Example  from  MUC5
  15. 15. Main  domains  of  IE •  Terrorist events •  Joint ventures •  Plane crashes •  Disease outbreaks •  Seminar announcements •  Biological and medical domain
  16. 16. Outline 1.  What is Information Extraction 2.  Main goals of Information Extraction 3.  Information Extraction Tasks and Subtasks 4.  MUC conferences 5.  Main domains of Information Extraction 6.  Methods for Information Extraction o  Cascaded finite-state transducers o  Regular expressions and patterns o  Supervised learning approaches o  Weakly supervised and unsupervised approaches 7.  How far we are with IE
  17. 17. Methods  for  IE •  Cascaded finite-state transducers o  Rule based o  Regular expressions •  Learning based approaches o  Traditional classifiers •  Bayes, MME, SVM … o  Sequence label models •  HMM, CMM, CRF •  Unsupervised approaches •  Hybrid approaches
  18. 18. Cascaded  finite-­‐‑state   transducers •  Emerging idea from MUC participants and approaches •  Decompose the task into small sub-tasks •  One element is read at a time from a sequence o  Depending on the type a certain transition in produced in the automaton to a new state o  Some states are considered final (the input matches a certain pattern) •  Can be defined as a regular expression
  19. 19. Cascaded  finite-­‐‑state   transducers Finite  Automaton  for  noun  groups =>  John’s  interesting  book  with  a  nice  cover
  20. 20. Cascaded  finite-­‐‑state   transducers •  Earlier stages recognize smaller linguistics objects o  Usually domain independent •  Later stages build on top of the previous ones o  Usually domain dependent •  Typical IE systems 1.  Complex words 2.  Basic phrases 3.  Complex phrases 4.  Domain events 5.  Merging structures
  21. 21. Cascaded  finite-­‐‑state   transducers •  Complex words o  Multiwords: “set up” “trading house” o  NE: “Bridgestone Sports Co” •  Basic Phrases o  Syntactic chunking •  Noun groups (head noun + all modifiers) •  Verb groups
  22. 22. Cascaded  finite-­‐‑state   transducers
  23. 23. Cascaded  finite-­‐‑state   transducers •  Complex phrases o  Complex noun and verb groups on the basis of syntactic information •  The attachment of appositives to their head noun group o  “The joint venture, Bridgestone Sports Taiwan Co.,” •  The construction of measure phrases o  “20,000 iron and ‘metal wood’ clubs a month”
  24. 24. Cascaded  finite-­‐‑state   transducers •  Domain events o  Recognize events and match with “fillers” detected in previous steps o  Requires domain specific patterns •  To recognize phrases of interest •  To define what are the roles o  Patterns can be defined also as a finite-state machines or regular expressions •  <Company/ies><Set-up><Joint-Venture> with <Company/ies> •  <Company><Capitalized> at <Currency>
  25. 25. Cascaded  finite-­‐‑state   transducers
  26. 26. Regular  Expressions •  1950’s Stephen Kleene •  A string pattern that describes/matches a set of strings •  A regular expression consists of: o  Characters o  Operation symbols •  Boolean (and/or) •  Grouping (for defining scopes) •  Quantification
  27. 27. Regular  Expressions Character Description a The  character  a . Any  single  character [abc] Any  character  in  the  brackets  (OR)  ‘a’   or  ‘b’  or  ‘c’ [^abc] Any  character  not  in  the  brackets.  Any   symbol  that  is  not  ‘a  ‘  or  ‘b’  or  ‘c’ * Quantifier.  Matches  the  preceding   element  ZERO  or  more  times + Quantifier.  Matches  the  preceding   element  ONE  or  more  times ? Matches  the  previous  element  zero  or   one  time | Choice  (OR)  Matches  one  of  the   expressions  (before  of  after  the  |)
  28. 28. Regular  Expressions ①  .at è ???
  29. 29. Regular  Expressions ①  .at è hat cat bat xat … ②  [hc]at è hat cat ③  [^b]at è all matched by .at but “bat” ④  [^hc]at è all match by .at but “hat” and “cat” ⑤  s.* è s sssss ssbsd2ck3e
  30. 30. Regular  Expressions ①  .at è hat cat bat xat … ②  [hc]at è hat cat ③  [^b]at è all matched by .at but “bat” ④  [^hc]at è all match by .at but “hat” and “cat” ⑤  s.* è s sssss ssbsd2ck3e ⑥  [hc]*at è hat cat hhat chat cchhat at … ⑦  cat|dogè cat dog ⑧  …. ⑨  ….
  31. 31. Using  Regular   Expressions •  Typically extracting information from automatic generated webpages is easy o  Wikipedia •  To know the country for a given city o  Amazon webpage •  From a list of hits o  Weather forecast webpages o  DBpedia
  32. 32. Using  Regular   Expressions
  33. 33. Using  Regular   Expressions •  Some “unstructured” pieces of information keep some structure and are easy to capture by means of regular expressions o  Phone numbers o  What else? o  … o  ...
  34. 34. Using  Regular   Expressions •  Some “unstructured” pieces of information keep some structure and are easy to capture by means of regular expressions o  Phone numbers o  E-mails o  URL Websites
  35. 35. Using  Regular   Expressions •  Also to detect relations and fill events •  Higher level regular expressions make use of “objects” detected by lower level patterns •  Some NLP information may help (pos tags, phrases, semantic word categories) o  Crime-Victim can use things matched by “noun-group” •  Prefiller: [pos: V, type-of-verb: KILL] WordNet MCR •  Filler: [phrase: NOUN-GROUP]
  36. 36. Using  Regular   Expressions •  Extraction relations between entities o  Which PERSON holds what POSITION in what ORGANIZATION •  [PER], [POSITION] of [ORG] Entities: PER:  Jose  Mourinho POSITION:  trainer ORG:  Chelsea Relation Jose  Mourinho Trainer Chelsea
  37. 37. Using  Regular   Expressions •  Extraction relations between entities o  Which PERSON holds what POSITION in what ORGANIZATION •  [PER], [POSITION ] of [ORG] •  [ORG] (named, appointed,…) [PER] Prep [POSITION] o  Nokia has appointed Rajeev Suri as President o  Where a ORGANIZATION is located •  [ORG] headquarters in [LOC] o  NATO headquarters in Brussels •  [ORG][LOC] (division, branch, headquarters…) o  KFOR Kosovo headquarters
  38. 38. Extracting  relations  with   palerns •  Hearst 1992 •  What does Gelidium mean? •  “Αγαρ ισ α συβστανχε πρεπαρεδ φροµ α µιξτυρε οφ ρεδ αλγαε, συχη ασ Gelidium, φορ λαβορατορψ ορ ινδυστριαλ υσε”
  39. 39. Extracting  relations  with   palerns •  Hearst 1992 •  What does Gelidium mean? •  “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use” •  How do you know?
  40. 40. Extracting  relations  with   palerns •  Hearst 1992: Automatic Acquisition of Hyponyms (IS-A) X à Gelidium (sub-type) Y à red algae (super-type) X à IS-A à Y •  “Y such as X” •  “Y, such as X” •  “X or other Y” •  “X and other Y” •  “Y including X” •  ….
  41. 41. Extracting  relations  with   palerns
  42. 42. Hand-­‐‑built  palerns •  Positive o  Tend to be high-precision o  Can be adapted to specific domains •  Negative o  Human patterns are usually low-recall o  A lot of work to think all possible patterns o  Need to create a lot of patterns for every relation
  43. 43. Learning-­‐‑based   Approaches •  Statistical techniques and machine learning algorithms o  Automatically learn patterns and models for new domains •  Some types o  Supervised learning of patterns and rules o  Supervised Learning for relation extraction o  Supervised learning of Sequential Classifier Methods o  Weakly supervised and supervised
  44. 44. Supervised  Learning  of   Palerns  and  Rules •  Aiming to reduce the knowledge engineering bottleneck to create an IE in a new domain •  AutoSlog and PALKA à first IE pattern learning systems o  AutoSlog: syntactic templates, lexico-syntactic patterns and manual review •  Learning Algorithms à generate rules from annotated text o  LIEP (Huffman 1996) : syntactic paths, role fillers. Patterns that work ok in training are kept o  (LP)2 uses tagging rules and correction rules
  45. 45. Supervised  Learning  of   Palerns  and  Rules •  Relational learning methods o  RAPIER: rules for pre-filler, filler, and post-filler component. Each component is a pattern that consists of words, POS tags, and semantic classes.
  46. 46. Supervised  Learning  for   relation  extraction  (I) •  Design a supervised machine learning framework •  Decide what relations we are interested in •  Choose what entities are relevant •  Find (or create) labeled data o  Representative corpus o  Label the entities in the corpus (Automatic NER) o  Hand label relation between these entities o  Split into train + dev + test •  Train, improve and evaluate
  47. 47. Supervised  Learning  for   relation  extraction  (II) •  Relation extraction as a classification problem •  2 classifiers o  To decide if two entities are related o  To decide the class for a pair or related entities •  Why 2? o  Faster training by eliminating most pairs o  Appropriate feature sets for each task •  Find all pairs of NE (restricted to the sentence) o  For every pair 1.  Are the entities related (classifier 1) 1.  no à END 2.  Yes à guess the class (classifier 2)
  48. 48. Supervised  Learning  for   relation  extraction  (III) •  Are the two entities related? •  What is the type of relation?
  49. 49. Supervised  Learning  for   relation  extraction  (IV) “[American Airlines], a unit of AMR, immediately matched the move, spokesman [Tim Wagner] said” •  What features? o  Head words of entity mentions and combination •  Airlines Wagner Airlines-Wagner o  Bag-of-words in the two entity mentions •  American, Airlines, Tim, Wagner, American Airlines, Tim Wagner o  Words/bigrams in particular positions to the left and right •  M2#-1: spokesman M1#+1: said o  Bag-of-words (or bigrams) between the 2 mentions •  a, AMR, of, immediately, matched, move, spokesman, the, unit
  50. 50. Supervised  Learning  for   relation  extraction  (V) “[American Airlines], a unit of AMR, immediately matched the move, spokesman [Tim Wagner] said” •  What features? o  Named entity types •  M1: ORG M2: PERSON o  Entity level (Name, Nominal (NP), Pronoun) •  M1: NAME (“it” or “he” would be PRONOUN) •  M2: NAME (“the company” would be NOMINAL) o  Basic chunk sequence from one entity to the other •  NP NP PP VP NP NP o  Constituency path on the parse tree •  NP é NP é S é S ê NP
  51. 51. Supervised  Learning  for   relation  extraction  (VI) “[American Airlines], a unit of AMR, immediately matched the move, spokesman [Tim Wagner] said” •  What features? •  Trigger lists o  For family à parent, wife, husband… (WordNet) •  Gazetteers o  List of countries… •  …. •  …. •  …
  52. 52. Supervised  Learning  for   relation  extraction  (VII) •  Decide your algorithm o  MaxEnt, Naïve Bayes, SVM •  Train the system on the training data •  Tune it on the dev set •  Test on the evaluation test o  Traditional Precision, Recall and F-score
  53. 53. Sequential  Classifier   Methods •  IE as a classification problem using sequential learning models. •  A classifier is induced from annotated data to sequentially scan a text from left to right and decide what piece of text must be extracted or not •  Decide what you want to extract •  Represent the annotated data in a proper way
  54. 54. Sequential  Classifier   Methods
  55. 55. Sequential  Classifier   Methods •  Typical steps for training o  Get the annotated training data o  Represent the data in IOB o  Design feature extractors o  Decide the algorithm to use o  Train the models •  Testing steps o  Get the test documents o  Extract features o  Run the sequence models o  Extract the recognized entities
  56. 56. Sequential  Classifier   Methods •  Algorithms o  HMM o  CMM o  CRF •  Features o  Words (current, previous, next) o  Other linguistic information (PoS, chunks…) o  Task specific features (NER…) •  Word shapes: abstract representation for words
  57. 57. Sequential  Classifier   Methods •  Algorithms o  HMM o  SVM o  CRF •  Features o  Words (current, previous, next) o  Other linguistic information (PoS, chunks…) o  Task specific features (NER…) •  Word shapes: abstract representation for words
  58. 58. Weakly  supervised  and   unsupervised   •  Manual annotation is also “expensive” o  IE is quite domain specific à not reuse •  AutoSlog-Ts: o  Just needs 2 sets of documents: relevant/irrelevant o  Syntactic templates + relevance according to relevant set •  Ex-Disco (Yangarber et al. 2000) o  No need preclassified corpus o  They use a small set of patterns to decide relevant/irrelevant
  59. 59. Weakly  supervised  and   unsupervised   •  OpeNER: •  European project dealing with entity recognition, sentiment analysis and opinion mining mainly in hotel reviews (also restaurants, attractions, news) •  Double propagation o  Method to automatically gather opinion words and targets •  From a large raw hotel corpus •  Providing a set of seeds and patterns
  60. 60. Weakly  supervised  and   unsupervised   •  Seed list •  + à good, nice •  - à bad, ugly •  Patterns •  a [EXP] [TAR] •  the [EXP] [TAR] •  Polarity patterns •  = [EXP] and [EXP] [EXP], [EXP] •  ! [EXP] but [EXP]
  61. 61. Weakly  supervised  and   unsupervised   •  Propagation method o  1) Get new targets using the seed expressions and the patterns •  a nice [TAR] a bad [TAR] the ugly [TAR] •  Output à new targets (hotel, room, location) o  2) Get new expression using the previous targets and the patterns •  a [EXP] hotel the [EXP] location •  Output à new expressions (expensive, cozy, perfect…) o  Keep running 1 and 2 to get new EXP and TAR
  62. 62. Weakly  supervised  and   unsupervised   •  Polarity guessing o  Apply the polarity patters to guess the polarity •  = a nice(+) and cozy(?) à cozy(+) •  ! Clean(+) but expensive(?) à expensive (-) hlps://github.com/opener-­‐‑project/opinion-­‐‑domain-­‐‑ lexicon-­‐‑acquisition
  63. 63. Outline 1.  What is Information Extraction 2.  Main goals of Information Extraction 3.  Information Extraction Tasks and Subtasks 4.  MUC conferences 5.  Main domains of Information Extraction 6.  Methods for Information Extraction o  Cascaded finite-state transducers o  Regular expressions and patterns o  Supervised learning approaches o  Weakly supervised and unsupervised approaches 7.  How far we are with IE
  64. 64. How  good  is  IE
  65. 65. How  good  is  IE •  Some progress has been done •  Still the barrier of 60% seems difficult to outperform •  Most errors on entities and event coreference •  Propagation errors o  Entity recognition à 90% o  One event -> 4 entities o  0.9 x 4 à 60% •  A lot of knowledge is implicit or “common world knowledge”
  66. 66. How  good  is  IE Information  Type Accuracy Entities 90  –  98% Alributes 80% Relations 60  –  70% Events 50  –  60% •  Very optimistic numbers for well-established tasks •  The numbers go down for specific/new tasks
  67. 67. Information   Extraction Ruben Izquierdo ruben.izquierdobevia@vu.nl http://rubenizquierdobevia.com

×