Advertisement

Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Principal Software Engineer at SmartFocus US, Inc.
Jan. 15, 2012
Advertisement

Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

  1. Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives Pradipto Das, Rohini Srihari and Yun Fu SUNY Buffalo CIKM 2011, Glasgow, Scotland
  2. Ubiquitous Bi-Perspective Document Structure Words indicative of important Wiki concepts Actual human generated Wiki category tags – words that summarize/ categorize the document Wikipedia
  3. Ubiquitous Bi-Perspective Document Structure Words Actual tags indicative for the of forum post questions – even frequencies are given! Words indicative of answers StackOverflow
  4. Ubiquitous Bi-Perspective Document Structure Words indicative of document title Words indicative of image Actual description tags given by users Yahoo! Flickr
  5. Understanding the Two Perspectives What if the documents are plain text files? News Article
  6. Understanding the Two Perspectives  Imagine browsing over reports in a topic cluster It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case. News Article
  7. Understanding the Two Perspectives  What words can we remember after a first browse? It is believedUS investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. German, US, Lopez, stole industrial secrets from the US group investigations, and took them with him when he joined VW last year. GM, Dorothea Thisinvestigation was launched by US President Bill Clinton and is in principle a far more simple Holland, Lopez, or at least more single-minded pursuit than that of Ms. prosecute Holland. The “document level” Dorothea Holland, until four months ago perspective was the only prosecuting lawyer on the News Article German case.
  8. Understanding the Two Perspectives  What helped us generate the Document Level perspective? The “word level” perspective It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German Named Entities prosecutors probing allegations that former German, US, GM director, Mr. Lopez, stole industrial LOCATION secrets from the US group and took them investigations, MISC with him when he joined VW last year. GM, Dorothea ORGANIZATION This investigation was launched by US Holland, Lopez, PERSON President Bill Clinton and is in principle a far more simple or at least more single-minded prosecute Important Verbs pursuit than that of Ms. Holland. The “document level” and Dependents Dorothea Holland, until four months ago perspective WHAT was the only prosecuting lawyer on the HAPPENED? German case. News Article
  9. What if we turn the document off?  Summarization power of the perspectives It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former German, US, GM director, Mr. Lopez, stole industrial secrets from the US group and took them investigations, with him when he joined VW last year. GM, Dorothea This investigation was launched by US Holland, Lopez, President Bill Clinton and is in principle a far more simple or at least more single-minded prosecute pursuit than that of Ms. Holland Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case. Sentence Boundaries
  10. Hypothesis • Documents are at least tagged from two different perspectives – either implicit or explicit and one perspective affects the other – Simplest example of implicit WL tagging – binned positions indicating sections – Simplest example of implicit DL tagging – tag cloud It is believed US investigators have asked for, but have been so far refused Begin (0) access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle Midd le (1) a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on End tagcrowd.com (2) the German case. The “word level” (WL) tags are usually some category descriptions
  11. How can bi-level perspective be exploited?  Can we generate category labels for Wikipedia documents by looking at image captions?  Can we use images to label latent topics?  Can we build a topic model that incorporates both perspectives simultaneously?  choice of document level tags, impact on performance  Can supervised and unsupervised generative models work together?
  12. Example – A Wikipedia Article on “fog” 0 1 2 Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labels by human editors
  13. The Wikipedia Article on “fog”  Take the first category label – “weather hazards to aircraft”  “aircraft” doesn’t occur in the document body!  “hazard” only appears in a section label read as “Visibility hazards”  “Weather” appears only 6 out of 15 times in the main body  However, if we look at the images, it seems that the concept of fog is related to concepts like fog over the Golden Gate bridge, fog in streets, poor visibility and quality of air Categories: fog, San Francisco, visible, high, temperature, streets, Bay, lake, California, bridge, air Labels by model from title and image captions Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labels by human editors
  14. The Family of Tag-Topic Models • TagLDA: An occurrence of a word depends on how much of it is explained by a topic K and a WL tag t  Intuitively LDA TagLDA Train Sample L L L L L L S S LDA’s learnt “purple” topic can TagLDA learns the “purple” topic generate all 4 large balls with better based on a constraint - it high probability will generate a mix of large and small balls with high probability
  15. Faceted Bi-Perspective Document Organization Topics conditioned on different section identifiers Correspondence Topics (WL tag categories) of DL tag words over with content Topic Marginals image words captions Topic Labeling
  16. The Family of Tag-Topic Models MMLDA METag2LDA TagLDA CorrMMLDA CorrMETag2LDA Combines Combines TagLDA and TagLDA and MMLDA CorrMMLDA MM = Multinomial + Multinomial; ME = Multinomial + Exponential
  17. The Family of Tag-Topic Models • METag2LDA: A topic generating all DL tags in a document doesn’t necessarily mean that the same topic generates all words in the document • CorrMETag2LDA: A topic generating *all* DL tags in a document does mean that the same topic generates all words in the document - a considerable strongpoint METag2LDA CorrME- Topic concentration parameter Tag2LDA Document specific topic proportions Indicator variables Document content words Document Level (DL) tags Word Level (WL) tags Topic Parameters Tag Parameters
  18. Experiments  Wikipedia articles with images and captions manually collected along {food, animal, countries, sport, war, transportation, nature, weapon, universe and ethnic groups} concepts  Tags used:  DL Tags – image caption words and the article titles  WL Tags – Positions of sections binned into 5 bins  Objective: to generate category labels for test documents  Evaluation – Perplexity: to see performance among various TagLDA models – WordNet based similarity evaluation between actual category labels and model output
  19. Evaluations – Held-out Perplexity 0.8 Millions 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 K=20 K=50 K=100 K=200 MMLDA TagLDA corrLDA METag2LDA corrMETag2LDA Selected Wikipedia Articles  WL tag categories – Section positions in the document  DL tags – image caption words and article titles  TagLDA perplexity is comparable to MM(METag2)LDA  The (image caption words + article titles) and the content words are independently discriminative enough  CorrMM(METag2)LDA performs best since almost all image caption words and the article title for a Wikipedia document are about a specific topic and the correspondence assumption is accepted by the model with much higher confidence
  20. Evaluations – Application End-Goals 2 1.8 1.6 1.4 METag2LDA- AverageDistance 1.2 1 corrMETag2LDA- AverageDistance 0.8 0.6 METag2LDA- BestDistance 0.4 0.2 corrMETag2LDA- BestDistance 0 K=20 K=50 K=100 K=200 Inverse Hop distance in WordNet ontology  Top 5 words from the caption vocabulary are chosen  Max Weighted Average = 5, Max Best = 1  METag2LDA almost always wins by narrow margins  METag2LDA reweights the vocabulary of caption words and article titles that are about a topic and hence may miss specializations relevant to document within the top (5) ones  In WordNet ontology, specializations lead to more hop distance  Ontology based scoring helps explain connections to caption words to ground truths e.g. Skateboard  skate  glide  snowboard
  21. Evaluations – Held-out Perplexity 1.65 2 Millions Millions 1.6 1.5 1.55 1.5 1 1.45 1.4 0.5 1.35 0 40 60 80 100 40 60 80 100 MMLDA METag2LDA corrLDA corrMETag2LDA MMLDA METag2LDA corrLDA corrMETag2LDA TagLDA DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)  WL tag categories – Named Entities  DL tags – abstract coherence markers like (“subj”  “obj”) e.g. “Mary/Subj taught the class. Everybody liked Mary/Obj.” *Ignored coref resolution+  Abstract markers like (“subj”  “obj”) acting as DL perspective are not document discriminative markers  Rather they indicate a semantic perspective of coherence which is intricately linked to words  Topics are influenced both by non-sparse document level coherence indicators like (“subj”  “obj”, “subj”  “--”, etc.) AND also by document level co-occurrence  By ignoring the DL perspective completely leads to better fit by TagLDA due to variations in word distributions only
  22. Evaluations – Application End-Goals 4 3.66 3.5 3 3.08 METag2LDA 2.5 CorrMETag2LDA 2 1.88 1.5 1 0.96 0.98 0.91 0.5 0.63 0.35 0 40 60 80 100 Person Named Entity coverage (DUC05 data)  Two PERSON NEs in the same docset i.e., manual topic set are related (G in total)  A_B, A, B are treated as separate PERSON NEs  For each docset in DUC05 data  Create a set of best topics for a docset and pull out top PER NE pairs from the PER NE facets  Find how many matched over all documents in a docset (M in total)  Win over baseline = M/G (averaged over all docsets)  CorrMETag2LDA wins here because of the nature of DL perspective (Role transitions like “SubjObj” coherence markers)  More topics are pulled out that group more PER NEs across documents (Recall )
  23. Model Usefulness and Applications • Applications – Document classification using reduced dimensions – Find faceted topics automatically through word level tags – Learn correspondences between perspectives – Label topics through document level multimedia – Create recommendations based on perspectives – Video analysis: word prediction given video features – Tying “multilingual comparable corpora” through topics – Multi-document summarization using coherence – E-Textbook aided discussion forum mining: • Explore topics through the lens of students and teachers • Label topics from posts through concepts in the e-textbook
  24. Summary • Flexible family of topic models that integrate a partitioned space of DL tags and words with WL tag categories – Supervised models can collaborate with unsupervised generative models i.e. supervised models can be bettered independently • Captioned multimedia objects like images, video, audio can provide intuitive latent space labeling – a picture is worth a 1000 words • Obtain “facets” in topics • As always held-out perplexity should not always be the sole judge of end-task performance
  25. Thanks! Special thanks to Jordan Boyd-Graber for useful discussions on TagLDA parameter regularizations

Editor's Notes

  1. Hyperlinked text in body represent word level tagsCategories represent document level tags
  2. Word level tags: question/answerDoc level tags: actual tags for the forum post
  3. Word level tags: title, image descriptionDoc level tags: tags given by users
  4. Document about investigationsWe don’t have annotations but let’s see how that can be built up!
  5. Words to the right are relevant to the topic of the document set – mostly by frequency
  6. Since documents are mostly about some events; Certain words strike us – NEs mentioned frequently and across sentencesDependencies between subjects and objects of the important verbs from the document set.
  7. The word and doc level tagged words alone are sufficient to summarize the document as bags of words
  8. I don’t think we need this slide. I should explain these points while showing the previous slide!
  9. Cons:Collocations need to be addressedChains don’t involve causality e.g. (fogs & accidents, [hop length = 12])
  10. Within the family of (corr)MM(E)(Tag2)LDAs modeling joint observations, corrMETag2LDA performs best
Advertisement