This thesis studies weakly supervised learning for information extraction methods in two settings: (1) unimodal weakly supervised learning, where annotated texts are augmented with a large corpus of unlabeled texts and (2) multimodal weakly supervised learning, where images or videos are augmented with texts that describe the content of these images or videos.
In the <b>unimodal</b> setting we find that traditional semi-supervised methods based on generative Bayesian models are not suitable for the textual domain because of the violation of the assumptions made by these models. We develop an unsupervised model, the latent words language model (LWLM), that learns accurate word similarities from a large corpus of unlabeled texts. We show that this model is a good model of natural language, offering better predictive quality of unseen texts than previously proposed state-of-the-art language models. In addition, the learned word similarities can be used successfully to automatically expand words in the annotated training with synonyms, where the correct synonyms are chosen depending on the context. We show that this approach improves classifiers for word sense disambiguation and semantic role labeling.
<br>
The second part of this thesis discusses weakly supervised learning in a <b>multimodal</b> setting. We develop information extraction methods to information from texts that describe an image or video, and use this extracted information as a weak annotation of the image/video. A first model for the prediction of entities in an image uses two novel measures: The salience measure captures the importance of an entity, depending on the position of that entity in the discourse and in the sentence. The visualness measure captures the probability that an entity can be perceived visually, extracted from the WordNet database. We show that combining these measures results in an accurate prediction of the entities present in the image. We then discuss how this model can be used to learn a mapping from names in the text to faces in the image, and to retrieve images of a certain entity.
We then turn to the automatic annotation of video. We develop a model that annotates a video with the visual verbs and their visual arguments, i.e. actions and arguments that can be observed in the video. The annotations of this system are successfully used to train a classifier that detects and classifies actions in the video. A second system annotates every scene in the video with the location of that scene. This system comprises a multimodal scene cut classifier that combines information from the text and the video, an IE algorithm that extracts possible locations from the text and a novel way to propagate location labels from one scene to another, depending the similarity of the scenes in the textual and visual domain.