PhD defense Koen Deschacht


Published on

This thesis studies weakly supervised learning for information extraction methods in two settings: (1) unimodal weakly supervised learning, where annotated texts are augmented with a large corpus of unlabeled texts and (2) multimodal weakly supervised learning, where images or videos are augmented with texts that describe the content of these images or videos.

In the <b>unimodal</b> setting we find that traditional semi-supervised methods based on generative Bayesian models are not suitable for the textual domain because of the violation of the assumptions made by these models. We develop an unsupervised model, the latent words language model (LWLM), that learns accurate word similarities from a large corpus of unlabeled texts. We show that this model is a good model of natural language, offering better predictive quality of unseen texts than previously proposed state-of-the-art language models. In addition, the learned word similarities can be used successfully to automatically expand words in the annotated training with synonyms, where the correct synonyms are chosen depending on the context. We show that this approach improves classifiers for word sense disambiguation and semantic role labeling.
The second part of this thesis discusses weakly supervised learning in a <b>multimodal</b> setting. We develop information extraction methods to information from texts that describe an image or video, and use this extracted information as a weak annotation of the image/video. A first model for the prediction of entities in an image uses two novel measures: The salience measure captures the importance of an entity, depending on the position of that entity in the discourse and in the sentence. The visualness measure captures the probability that an entity can be perceived visually, extracted from the WordNet database. We show that combining these measures results in an accurate prediction of the entities present in the image. We then discuss how this model can be used to learn a mapping from names in the text to faces in the image, and to retrieve images of a certain entity.

We then turn to the automatic annotation of video. We develop a model that annotates a video with the visual verbs and their visual arguments, i.e. actions and arguments that can be observed in the video. The annotations of this system are successfully used to train a classifier that detects and classifies actions in the video. A second system annotates every scene in the video with the location of that scene. This system comprises a multimodal scene cut classifier that combines information from the text and the video, an IE algorithm that extracts possible locations from the text and a novel way to propagate location labels from one scene to another, depending the similarity of the scenes in the textual and visual domain.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

PhD defense Koen Deschacht

  1. 1. Weakly supervised methods  for information extraction  PhD defense   Koen Deschacht Supervisors :  Prof. Marie­Francine Moens                  Prof. Danny De Schreye 1
  2. 2. Overview 2
  3. 3. Information extraction Detect and classify structures in unstructured  Text  Images / video Examples Word sense disambiguation in (WSD) Semantic role labeling (SRL) Visual entity detection 3
  4. 4. Information extraction Detect and classify structures in unstructured  Text  Images / video WSD: Determine meaning of a word  He kicked the ball in the goal. At a formal ball attendees wear evening attire. He stood on the balls of his feet. 4
  5. 5. Information extraction Detect and classify structures in unstructured  Text  Images / video SRL: Who is doing what, where ?  John broke the window with a stone. John broke the window with little doubt. The window broke. 5
  6. 6. Information extraction Detect and classify structures in unstructured  Text  Images / video Who/what is present in the image? Hillary Clinton Bill Clinton 6
  7. 7. Information extraction Detect and classify structures in unstructured  Text  Images / video Common approach: Word sense disambiguation Semantic role labeling Visual entity detection and many, many more... 7
  8. 8. Information extraction Detect and classify structures in unstructured  Text  Images / video Common approach: Word sense disambiguation Semantic role labeling Supervised machine Visual entity detection learning methods and many, many more... 8
  9. 9. Supervised machine learning Statistical methods that are trained on many  annotated examples SRL : 113.000 verbs WSD : 250.000 words Learn soft rules from the data 9
  10. 10. Example: WSD Ball = round object 1. He kicked the ball in the goal. 2. Ricardo blocks the ball as Benzema tries to shoot. 3. Patrice Evra almost kicked the ball in his own goal. … Ball = formal dance 1. Obama and his wife danced at the inaugural ball. 2. Casey Gillis was dressed in a white ball gown. 3. Dance Unlimited's Spring Ball takes place tomorrow. ... 10
  11. 11. Example: WSD Soft rules :  ­ If “kicked” ­ If “goal”    ball = “round object” ­ ...  ­ If “dance” ­ If “gown”     ball = “formal dance” ­ ... Machine learning methods can combine many   complimentary and/or contradicting rules 11
  12. 12. Supervised machine learning Current state­of­the­art machine learning  methods  Manually annotated corpus   Machine learning method  needed for every new task,  often independent of task  language or domain  Successful for many tasks  Features need to be   Flexible, fast development  manually engineered for new tasks  High variation of language   Only some expert  limits performance even  knowledge needed with large training corpora 12
  13. 13. Solution: use unlabeled data Unlabeled data: cheap, available for many  domains and languages Semi­supervised learning Optimize single function that incorporates labeled  and unlabeled data Violation of assumptions cause deteriorating results  when adding more unlabeled data Unsupervised learning First learn model on unlabeled data, then use model  in supervised machine learning method 13
  14. 14. Distributional hypothesis It is possible to determine the meaning of a  word by investigating its occurrence in a corpus. Example: What does “pulque” mean? 14
  15. 15. Distributional hypothesis It is possible to determine the meaning of a  word by investigating its occurrence in a corpus. Example: “It takes a maguey plant twelve years before it is mature  enough to produce the sap for pulque.” “The consumption of pulque peaked in the 1800’s.” “After the Conquest, pulque lost its sacred character, and  both indigenous and Spanish people began to drink it.” “In this way, the making of pulque passed from being a  home­made brew to one commercially produced.” 15
  16. 16. Latent words language model Directed Bayesian model that models likely  synonyms of a word, depending on context. Automatically learns synonyms and related  words. 16
  17. 17. Latent words language model We hope there is an increasing need for reform   Original sentence 17
  18. 18. Latent words language model We hope there is an increasing need for reform   We hope there is an increasing need for reform I believe this was the enormous chance of restructuring They think that 's no important demand to change You feel it are some increased potential that peace ... ... ... ... ... ... ... ... ... Automatically learned synonyms 18
  19. 19. Latent words language model We hope there is an increasing need for reform   We hope there is an increasing need for reform I believe this was the enormous chance of restructuring They think that 's no important demand to change You feel it are some increased potential that peace ... ... ... ... ... ... ... ... ... Time to compute all possible combinations:                                           ~ very, very long... Approximate: consider only most likely                                                            ~ pretty fast 19
  20. 20. LWLM: quality Measure how well the model can predict new,  previously unseen texts in terms of perplexity Model Reuters APNews EnWiki ADKN 114.96 134.42 161.41 IBM 108.38 125.65 149.21 LWLM 108.78 124.57 151.98 int. LWLM 96.45 112.81 138.03 LWLM outperforms other language models 20
  21. 21. LWLM for information extraction Word sense disambiguation standard + cluster features + hidden words 66.32% 66.97% 67.61% Semantic role labeling 90% 80% 70% 60% standard 50% + clusters 40% + hidden words 5% 20% 50% 100% Latent words : help with underspecification and  ambiguity 21
  22. 22. Automatic annotation of images & video Texts describe content of images Extract information in structured format Entities Attributes Actions Locations 22
  23. 23. Automatic annotation of images & video Texts describe content of images Extract information in structured format Entities Attributes Actions Locations 23
  24. 24. Annotation of entities in images Extract entities from descriptive news text that  are present in the image. Former President Bill Clinton, left, looks on as an honor guard  folds the U.S. flag during a graveside service for Lloyd Bentsen  in Houston, May 30, 2006. Bentsen, a former senator and  former treasury secretary, died last week at the age of 85.  service  Lloyd Bentsen  Bill Clinton  Houston  guard  age  flag  ... 24
  25. 25. Annotation of entities in images Assumption:  Entity is present in image if important in  descriptive text and possible to perceive visually. Salience:  Dependent on text Combines analysis of discourse and syntax Visualness: Independent of text  Extracted from semantic database 25
  26. 26. Annotation of entities in images Former President Bill Clinton, left, looks on as an honor guard  folds the U.S. flag during a graveside service for Lloyd Bentsen  in Houston, May 30, 2006. Bentsen, a former senator and  former treasury secretary, died last week at the age of 85.  Bill Clinton  service  guard  Lloyd Bentsen  Houston  flag  age  ... 26
  27. 27. Salience Is the entity important in descriptive text? Discourse model Important entities are referred to by other entities  and terms. Graph models entities, co­referents and other terms  Eigenvectors find most important entities Syntactic model Important entities appear high in parse tree Important entities have many children in tree 27
  28. 28. Visualness Can the entity be perceived visually? Similarity measure on entities in WordNet s(“car”,“truck”) = 0.88 s(“thought”,“house”) = 0.23 s(“car”,“horse”) = 0.38 s(“house”,“building”) = 0.91 s(“horse”, “cow”) = 0.79 s(“car”, “house”) = 0.40 Visual seeds      “person”, “vehicle” , “animal”, ... Non­visual seeds  “thought”, “power”, “air”, … Visualness:  combine similarity measure and seeds “entities close to visual seeds will be visual” 28
  29. 29. Annotation of entities: Results Appearance model : combine visualness and   salience All entities + visualness + salience + salience + visualness 26.66% 62.78% 59.56% 69.39% Appearance model dramatically increases  accuracy! 29
  30. 30. Scene location annotation Annotate location of every scene in sitcom series  Input : video and transcript Shot of Buffy opening the  refrigerator and taking  out a carton of milk.  Buffy sniffs the milk and  puts it on the counter. In  the background we see  Dawn opening a cabinet  to get out a box of cereal.  Buffy turns away. 30
  31. 31. Scene location annotation Annotate location of every scene in sitcom series Dawn's room the kitchen the living room the street 31
  32. 32. Scene segmentation Segment transcript and video in scenes Scene cut classifier in text Shot cut detector in video Transcript Shot of Buffy opening the refrigerator and taking out a carton of milk.  Scene cuts Buffy sniffs the milk and puts it on the counter. In the background we  see Joyce drinking coffee and Dawn opening a cabinet to get out a box  of cereal.    ... Buffy & Riley move into the living room. They sit on the sofa.  Buffy nods in resignation.   Smooch. Riley gets up.    Cut to a shot of a bright red convertible driving down the street. Giles  is at the wheel, Buffy beside him and Dawn in the back. Classical  music plays on the radio.  .... 32
  33. 33. Scene segmentation Segment transcript and video in scenes Scene cut classifier in text Shot cut detector in video 33
  34. 34. Scene segmentation Segment transcript and video in scenes Scene cut classifier in text Shot cut detector in video Shot of Buffy opening the  refrigerator and taking out a  carton of milk.  ... Buffy & Riley move into the  living room. They sit on the  sofa.  … Cut to a shot of a bright red  convertible driving down the  street. .... 34
  35. 35. Location detection and propagation Detect locations in text Shot of Buffy opening the refrigerator and taking out a carton of  milk. ... Buffy & Riley move into the living room. They sit on the sofa.  Cut to a shot of a bright red convertible driving down the street. Propagate locations to other scenes Latent Dirichlet allocation: learn correlation  locations & other objects (“refrigerator”→“kitchen”) Visual reweighting: visually similar scenes should  be in the same location 35
  36. 36. Location annotation results Scene cut classifier precision recall f1­measure 91.71% 97.48% 85.16% Location detector   precision recall f1­measure 68.75% 75.54% 71.98% Location annotation episode only text text + LDA text + LDA + vision 2 54.72% 58.89% 57.39% 3 60.11% 65.87% 68.57% 36
  37. 37. Contributions    1/2 The latent words language model Best n­gram language model Unsupervised learning of word similarities  Unsupervised disambiguation of words Using the latent words for WSD Best WSD system  Using the latent words for SRL Improvement of s­o­a classifier 37
  38. 38. Contributions    2/2 Image annotation :  First full analysis of entities in descriptive texts Visualness: capture knowledge from WordNet  Salience: capture knowledge from syntactic  properties Location annotation :  Automatic annotation of locations from transcripts Including new locations Including locations that are not explicitly mentioned 38
  39. 39. Thank  you! Questions? Comments? 39