Your SlideShare is downloading. ×
All good things
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

All good things

17,167
views

Published on

And out again I curve and flow …

And out again I curve and flow
To join the gleaming seer,
For models may come and models may go,
But I go on for ever.

Published in: Technology, Education

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
17,167
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • We are shaping a problem space. Each node is a problem and each peak represents a possible solution to that problem.Each problem has associated with it several smaller problems which need to be solved along the way giving rise to the mountainous terrain.We actually do not see this landscape beforehand and shape it as we move forward. A PhD candidate has to go from one peak to another to get a view of the entire landscape from where the candidate can put the landscape created by other luminaries in the field in perspective. So this is my long journey and I did not want to get stuck in one peak only and explore low lying hills (similar to writing one paper and then merely extending it)To create a landscape, we need tools to make the roads and clear away obstacles. But once done, it allows other researchers and practitioners to make use of the road infrastructure to build communities and businesses if the peaks are interesting enough to attract visitors and, of course, go from one place to another with ease.So, let’s get started… the stories of my journey will need some time to be told… and…
  • The answers are coming in the next one and half hours
  • A very recent talk by David Blei, who is considered to be the father of topic modeling research, also listed the importance of the problem we tackled as one of the open problems
  • What topic models do? For sure, they can identify signature words from a corpus of documents in a data driven way.Also you can figure out which of these topics belong to which classes of documents if you have that information-----------------------------And people really wanted this for a long time!
  • these models are language agnostic (Multilingual capability)--------------------------------------------Imagine automatically producing larger font on some important words in an HTML document – easily done not just by the words alone but also justifying it through their coherence properties
  • From just counts to richness
  • Each node is labeled witha word and each hill brings related nodes together with the closeness related to the length of the edges between themImagine all such points lying in a flat piece of paper on a uniform 2D grid of equal length edgesOur job is to re-arrange the nodes, connect them according to their closeness and create the triangulations so that we can discover the landscape shown hereAnd we have to do this without the model ever having any idea of the 3D landscapeThis brings us to an important question…---------------------------------------
  • + Success of LDA+ Almost 660 citations/year!+ Really widely extended and applied in different contexts------------------------------
  • But the success of LDA has really been in its generalization performance to fit unseen documents to the trained topic spaceMuch better generalization performance than PLSA or LSALDA can find a basis for distributions over topics unlike SVD which assumes 1 topic per document or computes a span over the topic vectorsmodels improved and they became more and more complex…-------------------------------------
  • +Comparison of model complexities+Y-axis = HLA X-axis = model complexity
  • HL = Hair Loss axisAll of these models address the common problem of looking at central tendencies of data
  • Why do we want to explore? We want to explore because we seek wisdom from everything that is happening around usBut where to start?Well, as Yoda points out, we can start at the centers of the data-------------------------------------Your:Each one of us has our own model of wisdom that gets shaped through our personal exploration of the world around usEach one of us assumes that there is some hypothesis which gives rise to the data around usCenters of data:Big data problem---lots of data around us but which ones are meaningful?Need statistics from the data that meaningfully encode multiple views i.e. modalitiesSufficient statistics (i.e. the function of a sample that encodes all information about the sample) usually represent the centers of the data
  • + Let’s start at the central tendencies…+ We want to go beyond words to full clips to visualize topics!-------------------------We have devices which continuously capture data and we seek wisdom from such large amounts of data:Wisdom is really about looking at set of representative examples (centers)Wisdom encodes variance in information compactly and completely and this improves decision making
  • How do the centers look like?These are actual outputs from one of our models. The ground-truth synopses overlaid over the multimedia topics obtained from training data
  • Assume each data point has an associated binary labelBut we have no training data which is representative of the classes----------------------------------------------
  • With labels, we can optimize a loss function (similar to interpolation and extrapolation)But we do not have labels and so we need to make assumptions about some function of data only which summarizes all observations and how the observations vary from that summary i.e. find the location and scale estimates as best as possibleLet’s choose the algorithm to be K-means which yields a simple hypothesis set \arg\max_{i,j}(d(x, \mu_i), d(x, \mu_j)) i = 0, j = 1--------------------------------
  • Lets sample one more point from the ground truth blue class and see if K means made the correct decisions based on limited samples but without any ground truth knowledge? --------------------------------
  • Clearly there are two misclassifications
  • We can have additional observed constraints on X: e.g. they are structured into books as a collection of sections which can focus on an idea in a coherent fashionThese structures give rise to co-occurrence which has been exploited before in IR for thesaurus constructionThe better the structure, the better the read – look at the Egyptian man – that man’s hair grew white by just scrolling through the scrollsRodin’s thinker replies to Dr. Corso’s twitter comment on our CVPR paper with “#pow is in #doing” -----------------------------------------There is a inherent partitioning of the linear position space of all wordsThis partitioning is due to the result of some sort of authorship (LDA with many authors = author topic model)
  • The success behind LDA is really about a balancing actNot easy to balance perfectly: x_9 and x_10 can be misclassified since LDA may want to allocate as few topics to d_2 as possible and chooses the red topicWell, at least now we know why NikWalLenDA can rope walk so easily--------------------------------
  • Summarization problem (see TAC competitions from NIST)
  • + Earlier research on discourse analysis mainly used for co-reference resolution+ Has some really intriguing ideas!--------------------------For a sequence of utterances to be a discourse, it must exhibit coherenceIf we denote U_n and U_{n+1} to be adjacent utterances, the backward-looking center of U_n, denoted as C_b(U_n), represents the entity currently being focused on in the discourse after U_n is interpreted. The forward-looking centers of U_n, denoted as C_f(U_n), form an ordered list containing the entities mentioned in U_n, all of which can serve as C_b(U_{n+1}). In general, however, C_b(U_{n+1}) is the most highly ranked element of C_f(U_n) mentioned in U_{n+1}. The C_b of the first utterance in a discourse is undefined. Brennan et al. uses the following ordering:Subject > Existential predicate nominal > Object > Indirect object or oblique > Demarcated adverbial PP
  • + Inducing a coherence flow comes through a lot of good writing practice+ Imputing a paragraph with salient concepts comes first to the minds of most authors and they tend to focus on the topic, which here is {house, door, furniture, burglary}--------------------------------------
  • + Incorporating coherence this way does not necessarily lead to the final summary being coherent+ Coherence is best handled as a post processing step using the Traveling Salesman Problem [Conroy et al.]+ There are lots of open question on just multi-document summarization itself…But what I really wanted is… ------------------------------------
  • to “see” what topics mean?+ Interpreting topics can still be tedious+ Most LDA models ignore metadata even if they are useful--------------------------------These are actual outputs:This is a tough event to match words with frames. The event is “Working on a sewing project”
  • This is again another tough event to match words with frames. The event is “Repairing an appliance”
  • + Describing a domain specific video with annotated keywords! This can be useful in robotic vision!
  • + allowing robots and video recording devices to communicate at a human level
  • Moving on – PART II+ At this point I was not sure where I should be moving? I had only a very vague idea!+ And you actually don’t know if there *are* other peaks!+ As Yoda pointed out… “Clouded your future is!”
  • + So now lets visit the document space again+ We look at another model – TagLDA which can incorporate a certain kind of domain knowledge into LDA+ Document partitioned words have associated annotations -> gives rise to two different distributions over words and each distribution affects the other+ A word is observed under the effects of both these distributions------------------------------------
  • What does this representation buy us?+ Goal is to assign x_9 and x_10 to their correct cluster with the use of domain knowledge+ x_5 and x_10 are annotated with the orange label and x_5 co-occurs with x_9 both in d_1 and d_2+ It is thus likely that x_5, x_9 and x_10 belong to the same class since both documents d_1 and d_2 should contain as few topics as possible
  • + Fitting a model amounts to forming an hypothesis which can best explain a set of observations+ TagLDA implicitly expands the hypothesis space of topics to search for the best explanation needed to describe the observations with the help of the annotations from domain knowledge------------------------------------
  • What if we assume that there is an additional perspective overd_i w.r.t x’- Is this an unnatural assumption?
  • Well not at all!word level tags: Hyperlinked text in bodydocument level tags: Categories
  • word level tags: question/answerdocument level tags: actual tags for the forum post
  • word level tags: title, image descriptiondocument level tags: tags given by users
  • Is the bi-perspective nature of documents ubiquitous?
  • We don’t have annotations but let’s see how that can be built up!Seems like this is a document on investigation of an industrial espionage
  • Words to the right are relevant to the topic of the document set – mostly by frequency
  • Natural language processing based content annotationSince documents are mostly about some events; Certain words strike us – NEs mentioned frequently and across sentencesDependencies between subjects and objects of the important verbs from the document set
  • The word and doc level tagged words alone are sufficient to summarize the document as bags of wordsSo are we done with the summarization problem?
  • And now we want things like these! If you are in doubt, ask any member of Dr. Corso’s VPML LabBut,+ High level descriptions are complex+ Spoken Language is complicated with high degrees of paraphrasingThe translator does consciously what the author did instinctively
  • How can bi-perspective topic models be exploited?The experiments really started off by looking at the image captions and category labels
  • This slide is self explanatory
  • Some people call it a mere combinationBut I say it is e-Harmony!
  • We now cover a particular METag^2LDA model\pi: tag i.e. word annotation distribution over words\beta topic distributions over words\mu and \sigma are fixed regularizersi.e some fixed priors that help in proper scaling of the parameters while optimization-----------------------------
  • Joint probability distribution belongs to exponential family following Maximum Entropy principleIn the original model, the hidden variables and parameters are coupled leading to an exponential state space to search for the right posterior over the hidden variables---------------------------------------
  • Delete all observations and edges which lead to the passages of the Bayes Ball being obstructed---------------------------------
  • Resulting in decoupling of the variables over which posterior needs to be computed+The more the decoupling, the more tractable the inference--------------------------------------
  • + We use fixed regularizers here+ Introducing exponential family priors for \pi and \beta will need more complicated inference machinery+ There are several other approximation techniques to compute posteriors and hence compute marginal and Mean Field is a deterministic local optimization technique but celebrities have endorsed it--------------------------------------
  • Even Adrian Monk likes Mean Field factorization!
  • And now let me introduce our friend – the Mixture of Gaussians for real valued data+ Keep x_1 and x_2 fixed and try to explain the two samples through different location parameters of the Gaussians through log likelihood+ The two surfaces are the error surfaces for the mixture model likelihood for x_1 and x_2 individually+ For discrete data, the mean parameters of the generating distribution is not discrete-------------------------------------
  • Mixture of two Gaussiansmodel+ Keep the two true location parameters fixed and try to explain samples generated at different distances from the two means through log likelihood+ There is a relation between the parameters of the distribution over the data (usually unknown) and the sufficient statistics as a function of data onlyWhich leads us to…------------------------------------------
  • Mean Parameters = Expected sufficient statisticsField = energy arising out of interactions with neighboring nodes (in mathematics a field is nothing but a space)\mu_e (the red dots) are the extreme points of the polytope is a function of the sufficient statistics \Upsilon(z,x) for fixed xWhen we optimize over this space, we select one of these red dots and corresponding to it, there is an optimal mean parameter \mu^{\star}Suppose we have the complete data as (Z,X) Z = hidden variables and X = observationsM(G) is the mean parameter space corresponding to expected sufficient statistics of the hidden variables in the original graphical model GFor discrete distributions M(G) is a convex polytope due to intersection of finitely many linear inequalities i.e. half spacesFor each fixed x and p_{\theta}, there is a \mu and As p is varied holding x fixed, the set M is formed\mu provides alternative parameterization of the exponential family distribution and any mean parameter in interior(M(G)) yields a lower bound to A(\theta)Any mean parameter can mean mean parameters of distributions whose moments can be easily computed e.g. factored distributions and those assumptions lead to a non-convex domain over which optimization is performedCartoon constraint is shown in the upper right cornerZ|x ~ Mult(\theta). \Upsilon(z) is the sufficient statistics for z------------------------------------------------log partition functions play an important role in the mapping of \mu to \theta and vice versa\M_F(G) is a subset of M(G) having only the extreme points in common and dependent of the factorization Fover Z to allow discovery of this backward mapping possible in finite timeEasiest implementation of mean filed principle is to consider no direct dependencies between the distributions of the hidden vars
  • Classic Estimator finding problem:Maximize log likelihood whose objective includes the empirical mean and the log partition functionClassic theorem:Maximize over \mu given a set of observations x to get as close to \theta\mu is dependent on the sufficient statistics associated with the variables whose likelihood we need to maximizeA(\theta) is the log partition function expressed in terms of the dual A(\mu)The dual, A(\mu) is maximized at the negative entropy of the distribution over mean parameters when the latter belong to interior(M(G))Relation between the derivatives of the log partition functions of the primal and the dual are shown in lower left corner
  • Mean field approximation to joint p(z_1, z_2, z_3): product of independent Bernoulli distributions, p(z_i)In this case, the mean field distributions are exactly in the same exponential family as the true distributionsWrite down each q in exponential form with the log partition function (as a function of canonical parameters in this case)Solve for A(\mu) using maximum over the dual formulation yielding \theta(\mu) and A*(\mu)Solving for A(\theta) using A*(\mu) yields \mu(\theta) = exp(\theta)/(1+exp(\theta))----------------------------------------
  • Goal is to find \mu from sufficient statistics of ZIn practical problems, there are exponentially many extreme points for all realizations of sufficient statisticsThis shows a cartoon illustration of the solution for mean parameters using linear programming
  • Unfortunately the set of \mu s under the factorization assumption is a strict subset of M(G)This subset itself is also convex if did not have to match its extreme points to those of the enclosing setThe region over which optimization for \mu needs to happen with the tractable distribution assumption is thus non-convex- This means that we won’t get a globally optimal solution
  • Let us now look at the relation of this formulation to the mean field formulation of METag^2LDA\Theta^T \mu = \Theta^T \int \sum \Upsilon(\theta, y, z) q(\theta, y, z) for e.g. \sum_{k=1}^K \phi_{m,k} I_{k}[z_m] \log \beta_{k}-A*(\mu) = +H(q) = -q \log q
  • The big red box is the ELBOTop: E-step inner loop (update variational distributions for every document)Bottom: M-step parameter updates based on mean parameters of the associated document dependent sufficient statisticsFor \beta and \pi, we only do MAP estimation here corresponding to fixed priors which act as regularizers
  • All of this inference machinery *is needed* to generate exploratory outputs like this!
  • Non-Correspondence topic models vs. Correspondence topic models
  • + Within the family of (Corr)MM(E)(Tag2)LDAs modeling joint observations, Corr-METag2LDA performs best+ We need to be careful about what kind of document level tags are we considering? Do those tags really collaborate in refining the topical perspective?-----------------------------------------
  • Cons:Collocations need to be addressedChains don’t involve causality e.g. (fogs & accidents, [hop length = 12])
  • So what’s next?
  • I never looked seriously at this paper “Modeling annotated data” until very late (around 2010)
  • From this to
  • This (actually the other way around)
  • Upper row – training (camera motion and shakes are a real problem for maintaining the bounding boxes)Lower row – trained models
  • + Role of alpha – alpha provides a topic for every observation. Alpha is a K-vector+ Here each component of alpha is different which helps assign different proportions of observations differently (e.g. one topic can be focusing solely on “stop-words”, another one on “commonly occurring words” and other ones on the different topics etc.)+ This helps identifying a set of “basis” of topic distributions while SVD computes a span over topic distributions
  • Translation formula (Marginalization over topics)- If there are two topics i.e. K=2, then (for e.g for the 2nd term) 0.5*0.5 + 0.5*0.5 = 0.5 < 0*0.0001 + 0.9*0.9- Values of the inferred \phi’s are very important for the real valued data – separated Gaussians are better but does not always happen- This raises an issue where the real valued data may need to be preprocessed to increase the chances of separation
  • The sum over K is the marginalization over z in p(w,z)
  • Again all of these are needed to translate videos into text and vice versa
  • This is the core problem of video summarization
  • Psycholinguistics are needed to confirm but that’s not a concern at this pointIn our dataset we have only one ground truth summary---base case for ROUGE evaluation
  • There are no individual summaries for shots within the clip – only one high level summaryProblems with shot-wise nearest neighbor matching precisely for this reason?The dataset that we use for the video summarization task is released as part of NIST's 2011 TRECVID Multimedia Event Detection (MED) evaluation set. The dataset consists of a collection of Internet multimedia content posted to the various Internet video hosting sites. The training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Working on a woodworking project etc.We use the videos and their textual metadata in all the 15 events as training data. There are 2062 clips with summaries in the training set with almost equal distribution amongst the events. The test set which we use is called the TransparentDevelopment (Dev-T) collection. The Dev-T collection includes positive instances of the first 5 training events and near positive instances for the last 10 events---a total of 630 videos labeled with event category information (and associated human synopses which are to be compared against for summarization performance). Each summary is a short and very high level description of the entire video and ranges from 2 to 40 words but on average 10 words (with stopwords). We remove standard English stopwords and retain only the word morphologies (not required) from the synopses as our training vocabularies. The proportion of videos belonging to events 6 through 15 in the Dev-T set is much low compared to the proportion for the other events since those clips are considered to be “related" instances which cover only part of the event category specifications. The performances of our topic models are evaluated on those kinds of clips as well. The numbers of videos in events 6 through 15 in the Dev-T set are {4,9,5,7,8,3,3,3,10,8} while there are around 120 videos per event for the first 5 events. All other videos in the Dev-T set neither have any event category label nor are identified as positive, negative or related videos and we do not consider these videos in our experiments.
  • Test ELBOs on events 1-5 in the Dev-T set – Measuring held-out log likelihoods on both videos and associated human summariesPrediction ELBOs on events 1-5 in the Dev-T set – Measuring held-out log likelihoods on just videos in absence of the textLower inverse covariance contributes high positive values to log likelihood + Gaussian entropy can be high too due to overlapping tails
  • The HEXTAC scores can change from dataset to dataset but max around 40-45% for 100 word summaries
  • If we can achieve 10% of this for 10 word summaries, we are doing pretty good!Caveat – The text multi-document summarization task is much more complex than this simpler task (w.r.t. summarization)
  • Purely multinomial topic models showing lower ELBOs can perform quite well in BoW summarization. MMLDA assigns likelihoods based on success and failure of independent events and failures contribute highly negative terms to the log likelihoods but this does not indicate the model's summarization performance where low probability terms are pruned out. Gaussian components allow different but related topics to model GIST features almost equally (strong overlap in the tail of the bell shaped curves - Gaussians) and show poor permutation of predicted words due to the violation of the soft probabilistic constraint of correspondence (this also leads to higher entropy)Scaling of variables in these kinds of mixed domain topic models needs to be looked at more closely
  • To improve relevancy of the lingual descriptions generated for the domain specific test videos, we present… for the first time ever…
  • iAnalyze for your videos…
  • A computer science graduate should never have to cope with information twirling around his head! We need high quality tools to address this problem.
  • I had taken Late Amar Gopal Bose’s advice in preparing these slides: I took some time out to prepare them leaving everything else behind. As Dr. Bose would say that “creativity never comes under emotional stress or tension. The real creativity comes when the mind finally relaxes and it is quiet and then you can focus.” Watch here [http://www.ndtv.com/video/player/news/remembering-amar-bose/282935?pfrom=home-topstories]. And yes, most of these slides were prepared with a Bose iOE2 headphone over my ears.
  • Transcript

    • 1. Natural Language Summarization of Text and Videos using Topic Models Pradipto Das PhD Dissertation Defense CSE Department, SUNY at Buffalo Rohini K. Srihari Sargur N. Srihari Aidong Zhang Professor and Committee Chair Distinguished Professor Professor and Chair CSE Dept., SUNY Buffalo CSE Dept., SUNY Buffalo CSE Dept., SUNY Buffalo Download this presentation from http://bit.ly/pdasthesispptx or http://bit.ly/pdasthesispptxpdf Primary committee members
    • 2. Using Tag-Topic Models and Rhetorical Structure Trees to Generate Bulleted List Summaries[journal submission] The Road Ahead (modulo presenter) Discovering Voter Preferences using Mixtures of Topic Models [AND Wkshp 2009] Simultaneous Joint and Conditional Modeling of documents Tagged from Two Perspectives [CIKM 2011] A Thousand Frames in just a Few Words: Lingual Descriptions of Videos through Latent Topic Models and Sparse Object Stitching [CVPR 2013] Translating Related Words to Videos and Back through Latent Topics [WSDM 2013] Introduction to LDA Learning to Summarize using Coherence [NIPS Wkshp 2009]
    • 3. • Stay hungry • Stay foolish The answers are coming within the next 60-75 minutes.. so.. Steve Jobs: Stanford Commencement Speech, 2005 there is great food, green tea and coffee at the back! But if you stay hungry I will happily grab the leftovers!
    • 4. Contributions of this thesis We can explore our data, extrapolate from our data and use context to guide decisions about new information Can we find topics from a corpus without human intervention? Can we use these topics to annotate documents and use annotations to organize, summarize and search text? Well, yes, LDA does that for us! That is so 2003!  Well, can LDA model documents tagged from at least two different viewpoints or perspectives? No!  Can we do that after reading this thesis? Yes we can!  Can we generate bulleted lists from multiple documents after reading this thesis? Yes we can!  Can we go further and translate videos into text and vice versa after reading this thesis? Yes we can! Bottomline:
    • 5. http://www.cs.princeton.edu/~blei/kdd-tutorial.pdf DavidBlei’stalkatKDD2012 DavidBlei’stalkatICML2012
    • 6. • Unsupervised topic exploration using LDA – Full text of first 50 patents from uspto.gov using search keywords of “rocket” & full text of 50 scientific papers from American journal of Aerospace Engineering – Vocabulary size: 10102 words; Total word count: 219568 Theme 1 Theme 2 Theme 3 Theme 4 Theme 5 insulation fuel launch rocket system composition matter mission assembly fuel fiber A-B space nozzle engine system engineer system surface combustion sensor tower vehicle portion propulsion fire magnetic earth ring pump water electron orbit motor oxidizeTopic from patent documents Topic from journal papers Topic from patent documents Topic from journal papers Topic from journal papers Explore and extrapolate from context
    • 7. Power of LDA: Language independence Topic Translation Topic Translation Topic Translation , , , , , , , Tsunami, earthquake, Chile, Pichilemu, gone, warning , news, city , , , , , , , , , flight, Air, France, Brazil, A, 447, disappear, ocean France , , , , , , , China, Olympic, Beijing, Gore, function, stadium, games Topic Translation Topic Translation Topic Translation , , xx->xx, , , , , :xx->xx Tsunami, earthquake, earthquake:x x->xx, city, local, UTC, Mayor, Tsunami:xx- >xx xx- >xx, xx->xx, xx->xx, Brazil, A, disappeared, search, flight, aircraft:xx- >xx, ocean, ship:xx->xx, air:xx->xx, air, space xx->xx, xx- >xx, xx- >xx, xx- >xx, China, Olympic, China:xx->xx, Olympic:xx- >xx, Gore:xx- >xx, Gore, gold, Beijing:xx- >xx, National TopicsoverwordsTopicsovercontrolledvocabulary
    • 8. How does LDA look at documents? A boring view of Wikipedia
    • 9. What about other perspectives? Words forming other Wiki articles Article specific content words Words forming section titles An exciting view of Wikipedia
    • 10. Insulation, composition, fiber system, sensor, fire, water Fuel, matter, A-B Engineer, tower magnetic, electron Rocket, assembly, Nozzle, surface, Portion, ring, motor Launch, mission, Space, system, Vehicle, earth orbit We are identifying the landscape from within the landscape – similar to finding the map of a maze from within the maze! Fuel, matter, A-B Engineer, tower magnetic, electron Explore and extrapolate from context
    • 11. Mostly from premier topic model research groups Year I joined UB Today! Success of LDA: a Generative Model August
    • 12. Success of LDA • Fitting themes to an UNSEEN patent document on insulating a rocket motor using basalt fibers, nanoclay compositions etc. Theme 1 Theme 2 Theme 3 Theme 4 Theme 5 insulation fuel launch rocket system composition matter mission assembly fuel fiber A-B space nozzle engine system engineer system surface combustion sensor tower vehicle portion propulsion fire magnetic earth ring pump water electron orbit motor oxidize “What is claimed is: 1. An insulation composition comprising: a polymer comprising at least one of a nitrile butadiene rubber and polybenzimidazole fibers; basalt fibers having a diameter that is at least 5 .mu.m 2. (lots more) …” Topic from patent documents Topic from journal papers Topic from patent documents Topic from journal papers Topic from journal papers
    • 13. K-Means Hierarchical Clustering LDA: VB LDA: Gibbs Dynamic LDA MMLDA Corr-LDA Hierarchi cal LDA Markov LDA Syntactic LDA Suffix Tree LDA TagLDA Corr- METag2LDA Corr- MMG LDA Model Complexities (modulo presenter) GMM
    • 14. Model Complexities (modulo presenter) K-Means GMM Hierarchical Clustering LDA: VB Dynamic LDA MMLDA Corr-LDA Hierarchi cal LDA Markov LDA Syntactic LDA Suffix Tree LDA TagLDA Corr- METag2LDA Corr- MMG LDA Hair Loss LDA: Gibbs
    • 15. Why do we want to explore? Master Yoda, how do I find wisdom from so many things happening around us? Go to the center of the data and find your wisdom you will
    • 16. parkour perform traceur area flip footage jump park urban run outdoor outdoors kid group pedestrian playground lobster burger dress celery Christmas wrap roll mix tarragon steam season scratch stick live water lemon garlic floor parkour wall jump handrail locker contestant school run interview block slide indoor perform build tab duck make dog sandwich man outdoors guy bench black sit park white disgustingly toe cough feed rub contest parody Can you find your wisdom? Corr- MMGLDA
    • 17. Corr- MMGLDA parkour perform traceur area flip footage jump park urban run outdoor outdoors kid group pedestrian lobster burger dress celery Christmas wrap roll mix tarragon steam season scratch stick live water lemon floor parkour wall jump handrail locker contestant school run interview block slide indoor perform build tab duck make dog sandwich man outdoors guy bench black sit park white disgustingly toe cough feed rub contest parody tutorial: man explains how to make lobster rolls from scratch One guy is making sandwich outdoors montage of guys free running up a tree and through the woods interview with parkour contestants Kid does parkour around the park Footage of group of performing parkour outdoors A family holds a strange burger assembly and wrapping contest at Christmas Actualground-truthsynopsesoverlaid Man performs parkour in various locations Are these what you were thinking?
    • 18. 1 10 11 12 13 142 3 4 5 6 7 8 9 • No ground truth label assignments are known The Classical Partitioning Problem
    • 19. 1 10 11 12 13 142 3 4 5 6 7 8 9 • Then, select the one with the lowest loss; for example the one shown – blue = +1, red = -1 • But we don’t really have a good way to measure loss here! Distance from or closeness to a central point The Classical Partitioning Problem
    • 20. 1 10 11 12 13 142 3 4 5 6 7 8 9 • Then, select the one with the lowest loss; for example the one shown – blue = +1, red = -1 • But we don’t really have a good way to measure loss here! Distance from or closeness to a central point Lets sample one more point
    • 21. The Ground Truth – Two “Topics” The seven virtues The seven vices Assume, now, that we have some vocabulary V of English words X is a set of positions and each element of X is labeled with an element from V
    • 22. If X is a multi-set of words (set of positions), then it has an inherent structure in it: for e.g. • We no longer see: • We are used to: and #pow is in #doing Additional Partitioning: Documents The seven virtues The seven vices
    • 23. Success behind LDA  Allocate as few topics to a document  Allocate as few words to each topic I am Nik WalLenDABalancing Act This checker board pattern has a significance – in general NP-Hard to figure out the correct pattern from limited samples even for 2 topics  The topic ALLOCATION is controlled by the parameter of a DIRICHLET distribution governing a LATENT proportion of topics over each document
    • 24. Current Timeline Consequent Timeline Event Categories: Accidents/Natural Disasters; Attacks (Criminal/Terrorist); Health & Safety; Endangered Resources; Investigations (Criminal/Legal/Other) Previously, long long time ago
    • 25. Centers of an utterance – Entities serving to link that utterance to other utterances in the current discourse segment Sparse Coherence Flows [BarbaraJ.Grosz,ScottWeinstein,andArvindK.Joshi.Centering:Aframeworkfor modelingthelocalcoherenceofdiscourse.InComputationalLinguistics,volume21, pages203–225,1995] a. Bob opened a new dealership last week. [Cf=Bob, dealership; Cp=Bob; Cb=undefined] b. John took a look at the Fords in his lot. [Cf=John, Fords; Cp=John; Cb=Bob] {Retain} c. He ended up buying one. i. [Cf=John; Cp=John; Cb=John] {Smooth-Shift} OR ii. [Cf=Bob; Cp=Bob; Cb=Bob] {Continue} Previously, long long time ago Centerapproximation=the(word,[Grammatical/ Semantic]role)pair(GSR)e.g.(Bob,Subject),(John, Subject),(dealership,Noun) Algorithmically By inspection For n+1 = 3 and case ii
    • 26. Global (document/section level) focus Problems with Centering Theory a. The house appeared to have been burgled. [Cf=house ] b. The door was ajar. [ Cb=house; Cf=door, house; Cp=door] c. The furniture was in disarray. [ Cb=house; Cf=furniture, house; Cp=furniture] {?} Previously, long long time ago For n+1 = 3  Utterances like these are the majority in most free text documents [redundancy reduction]  In general, co-reference resolution is very HARD
    • 27. An example summary sentence from folder D0906B-A of TAC2009 A timeline: • “A fourth day of thrashing thunderstorms began to take a heavier toll on southern California on Sunday with at least three deaths blamed on the rain, as flooding and mudslides forced road closures and emergency crews carried out harrowing rescue operations.” The next two contextual sentences in the document of the previous sentence are: • “In Elysian Park, just north of downtown, a 42-year-old homeless man was killed and another injured when a mudslide swept away their makeshift encampment.” • “Another man was killed on Pacific Coast Highway in Malibu when his sport utility vehicle skidded into a mud patch and plunged into the Pacific Ocean.” If the query is, “Describe the effects and responses to the heavy rainfall and mudslides in Southern California,” observe the focus of attention on mudslides as subject in the first two sentences in the table below: Sentence-GSR grid for a sample summary document slice Summarization using Coherence  Incorporating coherence this way does not necessarily lead to the final summary being coherent  Coherence is best obtained in a post processing step using the Traveling Salesman Problem
    • 28. measure project lady tape indoor sew marker pleat highwaist zigzag scissor card mark teach cut fold stitch pin woman skirt machine fabric inside scissors make leather kilt man beltloop sew woman fabric make machine show baby traditional loom blouse outdoors blanket quick rectangle hood knit indoor stitch scissors pin cut iron studio montage measure kid penguin dad stuff thread One lady is doing sewing project indoors Woman demonstrating different stitches using a serger/sewing machine dad sewing up stuffed penguin for kids Woman makes a bordered hem skirt A pair of hands do a sewing project using a sewing machine ground-truthsynopsesoverlaid But what we really want is this
    • 29. ground-truthsynopsesoverlaid clock mechanism repair computer tube wash machine lapse click desk mouse time front wd40 pliers reattach knob make level video water control person clip part wire inside indoor whirlpool man gear machine guy repair sew fan test make replace grease vintage motor box indoor man tutorial fuse bypass brush wrench repairman lubricate workshop bottom remove screw unscrew screwdriver video wire How to repair the water level control mechanism on a Whirlpool washing machine a man is repairing a whirlpool washer how to remove blockage from a washing machine pump Woman demonstrates replacing a door hinge on a dishwasher A guy shows how to make repairs on a microwave How to fix a broken agitator on a Whirlpool washing machine A guy working on a vintage box fan And this
    • 30. And this
    • 31. And this
    • 32. Roadmap Introduction to LDA Discovering Voter Preferences Using Mixtures of Topic Models [AND’09 Oral] Learning to Summarize Using Coherence [NIPS 09 Poster] Core NLP including summarization, information extraction, unsupervised grammar induction, dependency parsing, rhetorical parsing, sentiment and polarity analysis… Non-parametric Bayes Applied StatisticsExit 2 Exit 1 Uncharted territory – proceed at your own risk
    • 33. Why When Who Where TagLDA: More Observed Constraints Domain knowledge Topic distribution over words Annotation/ Tag distribution over words Is there a model which can take additional clues and attempt to correct the misclassifications?
    • 34. Why When Who Where Domain knowledge Incorporating Prior Knowledge Topic distribution over words but conditioned over tags Number of parameters = (K+T)V TagLDA switches to this view for partial normalization of some weights - x5 and x10 are annotated with the orange label and x5 co-occurs with x9 both in documents d1 and d2 - It is thus likely that x5, x9 and x10 belong to the same class since both d1 and d2 should contain as few topics
    • 35. Why When Who Where Domain knowledge Incorporating Prior Knowledge LDA TagLDA
    • 36. Incorporating Prior Knowledge With Additional Perspectives Why When Who Where Domain knowledge LDA TagLDA LDA
    • 37. Words indicative of important Wiki concepts Actual human generated Wiki category tags – words that summarize/ categorize the document Wikipedia Ubiquitous Bi-Perspective Document Structure
    • 38. Words indicative of questions Actual tags for the forum post – even frequencies are available! Words indicative of answers StackOverflow Ubiquitous Bi-Perspective Document Structure
    • 39. Words indicative of document title Actual tags given by users Words indicative of image description Yahoo! Flickr Ubiquitous Bi-Perspective Document Structure
    • 40. News Article What if the documents are plain text files? Understanding the Two Perspectives
    • 41. It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case. News Article Imagine browsing over many reports on an event Understanding the Two Perspectives
    • 42. It is believed USinvestigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the USgroup and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecutinglawyer on the German case.News Article The “document level” perspective What words can we remember after a first browse? German, US, investigations, GM, Dorothea Holland, Lopez, prosecute Understanding the Two Perspectives
    • 43. Important Verbs and Dependents Named Entities What helped us remember? ORGANIZATION It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case. News Article LOCATION MISC PERSON WHAT HAPPENED? The “word level” perspective The “document level” perspective German, US, investigations, GM, Dorothea Holland, Lopez, prosecute Understanding the Two Perspectives
    • 44. Summarization power of the perspectives It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case. German, US, investigations, GM, Dorothea Holland, Lopez, prosecute Sentence Boundaries What if we turn the document off?BeginMiddleEnd
    • 45. A young man climbs an artificial rock wall indoors Adjective modifier (What kind of wall?) Direct Object Direct Subject Adverb modifier (climbing where?) Major Topic: Rock climbing Sub-topics: artificial rock wall, indoor rock climbing gym And as if that wasn’t enough!
    • 46. Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labeled by human editors BeginningMiddleEnd A Wikipedia Article on “fog”
    • 47.  Take the first category label – “weather hazards to aircraft”  “aircraft” doesn’t occur in the document body!  “hazard” only appears in a section title read as “Visibility hazards”  “Weather” appears only 6 out of 15 times in the main body  However, the images suggest that fog is related to concepts like fog over the Golden Gate bridge, fog in streets, poor visibility and quality of air Wiki categories: Abstract or specific? Labeled by a Tag2LDA model from title and image captions Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labeled by human editors Categories: fog, San Francisco, visible, high, temperature, streets, Bay, lake, California, bridge, air
    • 48. • How do we model such a document collection?
    • 49. METag2LDA Corr-METag2LDAMMLDA CorrMMLDATagLDA Combines TagLDA and MMLDA Combines TagLDA and Corr- MMLDA MM = Multinomial + Multinomial; ME = Multinomial + Exponential Made Possible with Tag2LDA Models E- Harmony!
    • 50. Topic ALLOCATION is controlled by the parameter of a DIRICHLET distribution governing a LATENT proportion of topics over each document I am Nik WalLenDA Bi-Perspective Topic Model – METag2LDA And this balancing act got a whole lot tougher
    • 51. Exponential State Space Bayes Ball
    • 52. Constructing Variational Dual
    • 53. Mean Field Distributions
    • 54. Mean Field Distributions
    • 55. Mean Field Distributions Hmmm… a smudge… wipe.. wipe.. wipe.. 2 plates, 2 arrows, 4 circles… no smudges… even and nice!
    • 56. Mixture Model: Real valued data
    • 57. y x Mixture Model: Real valued data
    • 58. Mean Parameters
    • 59. Mean Field Optimization Empirical mean p belongs to exponential family by MaxEnt
    • 60. Forward Mapping Backward Mapping Mean Field Optimization Sufficient statistics
    • 61. Mean Field Optimization Very similar to finding the basic feasible solution (BFS) in linear programming • Start with pivot at the origin (only slack variables as solution) • Cycle the pivot through the extreme points i.e. replace slacks in BFS until solution is found
    • 62. Mean Field Optimization However, mean field optimization space is inherently non-convex over the set of tractable distributions due to the delta functions which match the extreme points of the convex hull of sufficient statistics of the original discrete distributions
    • 63. ELBO: Evidence Lower BOund
    • 64. Mean Field Inference
    • 65. Mean Field Inference
    • 66. Mean Field Inference ELBO
    • 67. Topics conditioned on different section identifiers (WL tag categories) Topic Marginals Topics over image captions Correspondence of DL tag words with content words Topic Labeling Faceted Bi-Perspective Document Organization All of the inference machinery *is needed* to generate exploratory outputs like this!
    • 68. • METag2LDA: A topic generating all DL tags in a document does not necessarily mean that the same topic generates all words in the document • Corr-METag2LDA: A topic generating *all* DL tags in a document does mean that the same topic generates all words in the document - a considerable strongpoint Topic concentration parameter Document specific topic proportions Document content words Document Level (DL) tags Word Level (WL) tags Indicator variables Topic Parameters Tag Parameters CorrME- Tag2LDA METag2LDA The Family of Tag2LDA Models
    • 69. Experiments  Wikipedia articles with images and captions manually collected along {food, animal, countries, sport, war, transportation, nature, weapon, universe and ethnic groups} concepts  Annotations/tags used:  DL Tags – image caption words and the article titles  WL Annotations – Positions of sections binned into 5 bins  Objective: to generate category labels for test documents  Evaluation – ELBO: to see performance among various TagLDA models – WordNet based similarity evaluation between actual category labels and proxies for them from caption words
    • 70. Held-out ELBO Selected Wikipedia Articles  WL annotations – Section positions in the document  DL tags – image caption words and article titles  TagLDA perplexity is comparable to MM(METag2)LDA  The (image caption words + article titles) and the content words are independently discriminative enough  Corr-MM(METag2)LDA performs best since almost all image caption words and the article title for a Wikipedia document are about a specific topic 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 K=20 K=50 K=100 K=200Millions MMLDA TagLDA corrLDA METag2LDA corrMETag2LDA
    • 71. 0 0.5 1 1.5 2 40 60 80 100 Millions MMLDA METag2LDA corrLDA corrMETag2LDA TagLDA Held-out ELBO DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)  WL annotations – Named Entities  DL tags – abstract coherence tuples like (subject, object) e.g. “Mary(Subject) taught the class. Everybody liked Mary(Object).” *Ignoring coref resolution]  Abstract markers like (“subj” “obj”) acting as DL perspective are not document discriminative or even topical markers  Rather they indicate a semantic perspective of coherence which is intricately linked to words  By ignoring the DL perspective completely leads to better fit by TagLDA due to variations in word distributions only 1.35 1.4 1.45 1.5 1.55 1.6 1.65 40 60 80 100 Millions MMLDA METag2LDA corrLDA corrMETag2LDA
    • 72. Are Categories more abstract or specific? Inverse Hop distance in WordNet ontology  Top 5 words from the caption vocabulary are chosen  Max Weighted Average = 5, Max Best = 1  METag2LDA almost always wins by narrow margins  METag2LDA reweights the vocabulary of caption words and article titles that are about a topic and hence may miss specializations relevant to document within the top (5) ones  In WordNet ontology, specializations lead to more hop distance  Ontology based scoring helps explain connections to caption words to ground truths e.g. Skateboard skate glide snowboard 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 K=20 K=50 K=100 K=200 METag2LDA- AverageDistance corrMETag2LDA- AverageDistance METag2LDA- BestDistance corrMETag2LDA- BestDistance
    • 73. • Applications – Document classification using reduced dimensions – Find faceted topics automatically through word level tags – Learn correspondences between perspectives – Label topics through document level multimedia – Create recommendations based on perspectives – Video analysis: word prediction given video features – Tying “multilingual comparable corpora” through topics – Multi-document summarization using coherence – E-Textbook aided discussion forum mining: • Explore topics through the lens of students and teachers • Label topics from posts through concepts in the e-textbook Model Usefulness and Applications
    • 74. Roadmap Introduction to LDA Discovering Voter Preferences Using Mixtures of Topic Models [AND’09 Oral] Learning to Summarize Using Coherence [NIPS 09 Poster] Core NLP including summarization, information extraction, unsupervised grammar induction, dependency parsing, rhetorical parsing, sentiment and polarity analysis… Non-parametric Bayes Computer Vision and Applications – Core Technologies Applied Statistics Supervised Learning, Structured Prediction Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives [CIKM 2011 Oral]
    • 75. Mostly from premier topic model research groups Year I joined UB Today! Success of LDA: Image Annotation August
    • 76. Previously Words forming other Wiki articles Article specific content words Caption corresponding to the embedded multimedia [P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011+
    • 77. Afterwards Words forming other Wiki articles Article specific content words Caption corresponding to the embedded multimedia [P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and Back through Latent Topics,” WSDM, Rome, Italy, 2013+
    • 78.  Expensive frame-wise manual annotation efforts by drawing bounding boxes  Difficulties: camera shakes, camera motion, zooming  Careful consideration to which objects/concepts to annotate?  Focus on object/concept detection – noisy for videos in-the-wild  Does not answer which objects/concepts are important for summary generation? Man with microphone Climbing person Annotations for training object/concept models Trained Models Information Extraction from Videos
    • 79. Learning latent translation spaces a.k.a topics A young man is climbing an artificial rock wall indoors Human Synopsis  Mixed membership of latent topics  Some topics capture observations that co- occur commonly  Other topics allow for discrimination  Different topics can be responsible for different modalities No annotations needed – only need clip level summary Translating across modalities MMGLDA model
    • 80. Translating across modalities Using learnt translation spaces for prediction ? Text Translation ( ) ( ) , , , , 1 1 1 1 ( | , ) ( | ) ( | ) v O H O K H K O H d o i v i d h i v i o i h i p w w w p w p w  Topics are marginalized out to permute vocabulary for predictions  The lower the correlation among topics, the better the permutation  Sensitive to priors for real valued data MMGLDA model
    • 81. Translating across modalities Use learnt translation spaces for prediction ? Text Translation ( ) ( ) , , , , 1 1 1 1 ( | , ) ( | ) ( | ) v O H O K H K O H d o i v i d h i v i o i h i p w w w p w p w  Topics are marginalized out to permute vocabulary for predictions  The lower the correlation among topics, the better the permutation  Sensitive to priors for real valued dataResponsibility of topic i over real valued observations Responsibility of topic i over discrete video features Probability of learnt topic i explaining words in the text vocabulary MMGLDA model
    • 82. • We first formulated the MMGLDA model just two rooms left of where I am standing now! An aside
    • 83. 1. There is a guy climbing on a rock-climbing wall. Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint) 2. A man is bouldering at an indoor rock climbing gym. 3. Someone doing indoor rock climbing. 4. A person is practicing indoor rock climbing. 5. A man is doing artificial rock climbing. To understand whether we speak all that we see?
    • 84. 1. There is a guy climbing on a rock-climbing wall. Multiple Human Summaries: (Max 10 words for imposing a length constraint) Hand holding climbing surface How many rocks? The sketch in the board Wrist-watch What’s there in the back? Color of the floor/wall Dress of the climber Not so important! 2. A man is bouldering at an indoor rock climbing gym. Empty slots 3. Someone doing indoor rock climbing. 4. A person is practicing indoor rock climbing. 5. A man is doing artificial rock climbing. Summaries point toward information needs! Center of Attentions: Central Objects and Actions
    • 85. Skateboarding Feeding animals Landing fishes Wedding ceremony Woodworking project Multimedia Topic Model – permute event specific vocabularies Bag of keywords multi-document summaries Sub-events e.g. skateboarding, snowboarding, sur fing Multiple sets of documents (sets of frames in videos) Natural language multi-document summaries Multiple sentences (group of segments in frames) Once again: A Summarization Perspective
    • 86. Evaluation: Held out ELBOs  In a purely multinomial MMLDA model, failures of independent events contribute highly negative terms to the log likelihoods  NOT a measure of keyword summary generation power  Test ELBOs on events 1-5 in the Dev-T set  Prediction ELBOs on events 1-5 in the Dev-T set
    • 87. Skateboarding Feeding animals Landing fishes Wedding ceremony Woodworking project Multimedia Topic Model – permute event specific vocabularies Bag of words multi-document summaries Sub-events e.g. skateboarding, snowboarding, sur fing Multiple sets of documents (sets of frames in videos) Natural language multi-document summaries Multiple sentences (group of segments in frames)  A c-SVM classier from the libSVM package is used with default settings for multiclass (15 classes) classification  55% test accuracy easily achievable (completely off-the-shelf) Evaluate using ROUGE-1 HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaries Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923) Event Classification and Summarization
    • 88. Skateboarding Feeding animals Landing fishes Wedding ceremony Woodworking project Multimedia Topic Model – permute event specific vocabularies Bag of words multi-document summaries Sub-events e.g. skateboarding, snowboarding, sur fing Multiple sets of documents (sets of frames in videos) Natural language multi-document summaries Multiple sentences (group of segments in frames)  A c-SVM classier from the libSVM package is used with default settings for multiclass (15 classes) classification  55% test accuracy easily achievable (completely off-the-shelf) Event Classification and Summarization Evaluate using ROUGE-1 HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaries Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)  If we can achieve 10% of this for 10 word summaries, we are doing pretty good!  Caveat – Text multi-document summarization task is much more complex
    • 89.  MMLDA can show poor ELBO – a bit misleading  Performs quite well on predicting summary worthy keywords  Sum-normalizing the real valued data to lie in [0,1]P distorts reality for Corr- MGLDA w.r.t. quantitative evaluation  Summary worthiness of predicted keywords is not good but topics are good  MMGLDA produces better topics and higher ELBO  Summary worthiness of keywords almost same as MMLDA for lower n Evaluation: ROUGE-1 Performance
    • 90. • Simply predicting more and more keywords (or creating sentences out of them) does not improve the relevancy of the generated summaries • Instead, selecting sentences from the training set in an intuitive way almost doubles the relevancy of the lingual descriptions Improving ROUGE-1/2 performance
    • 91. YouCook, iAnalyze Das et al. WSDM 2013 Das et al. CVPR 2013 Precision 2-gram Precision 1-gram Recall 2-gram Recall 1-gram Precision 2-gram Precision 1-gram Recall 2-gram Recall 1-gram 0.006 15.47 0.006 19.02 5.14 25.76 6.49 32.87 ROUGE scores for “YouCook” dataset[Corso et al.]
    • 92. Roadmap Introduction to LDA Discovering Voter Preferences Using Mixtures of Topic Models [AND’09 Oral] Learning to Summarize Using Coherence [NIPS 09 Poster] Non-parametric Bayes Computer Vision and Applications – Core Technologies Translating Related Words to Videos and Back through Latent Topics [WSDM 2013 Oral] Applied Statistics Supervised Learning, Structured Prediction Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives [CIKM 2011 Oral] Core NLP including summarization, information extraction, unsupervised grammar induction, dependency parsing, rhetorical parsing, sentiment and polarity analysis… Using Tag-Topic Models and Rhetorical Structure Trees to Generate Bulleted List Summaries[to be submitted to TOIS] Linear, Quadratic and Conic Programming Variants A Thousand Frames in just a Few Words: Lingual Descriptions of Videos through Latent Topic Models and Sparse Object Stitching [CVPR 2013 Spotlight]
    • 93. Just one last thing… • We want to analyze documents not only for topic discovery but also for turning these
    • 94. Just one last thing… • into this  A previous study on sleep deprivation that less sleep resulted in impaired glucose metabolism.  Women who slept less than or equal to 5 hours a night were twice as likely to suffer from hypertension than women. [*]  Children ages 3 to 5 years get 11-13 hours of sleep per night.  Chronic sleep deprivation can do more it can also stress your heart.  Sleeping less than eight hours at night, frequent nightmares and difficulty initiating sleep were significantly associated with drinking.  A single night of sleep deprivation can limit the consolidation of memory the next day.  Women’s health is much more at risk. [*] [*] means that the sentences belong to the same document
    • 95. Just one last thing… • using these Accidents and Natural Disasters Attacks Health and Safety Endangered Resources Investigations and Trials Document sets or “Docsets” Global Tag-Topic Model Local Models Documents and sentences Local Models Local Models Local Models Training using documents Fitting sentences from Docsets to the learnt model Candidate summary sentence for a Docset Weighting a summary sentence from local and global models Candidate summary sentence for a Docset
    • 96. • and these Attribution Cause Elaboration Just one last thing… distractions such as computers or video games in kids ' bedrooms may lessen sleep quality. that only 20 percent of adolescents get the recommended nine hours of sleep ; The National Sleep Foundation reported in 2006 Satellite (Leaf: Span 1) Nucleus (Leaf: Span 2) Nucleus (Leaf: Span 3) Nucleus [2] Root [2, 3] Attribution Joint and need more than eight hours of sleep per day . because they 're nocturnal Sleep-deprived teens crash just about anywhere Nucleus (Leaf: Span 1) Nucleus (Leaf: Span 2) Nucleus (Leaf: Span 3) Satellite [2,3] Root [1, 3] Explanation Joint early-risers are actually at a higher risk of developing heart problems. but a Japanese study says Generations have praised the wisdom of getting up early in the morning, Nucleus (Leaf: Span 1) Satellite (Leaf: Span 2) Nucleus (Leaf: Span 3) Nucleus [2,3] Root [1, 3] Contrast Attribution Fortunately for sleepy women , a Penn State College of Medicine study found, Satellite (Leaf: Span 1) Nucleus [2,4] Root [1, 4] that they 're much better than men at enduring sleep deprivation, Nucleus (Leaf: Span 2) possibly because of '' profound demands of infant and child care Nucleus (Leaf: Span 3) placed on them for most of mankind 's history. Satellite (Leaf: Span 4) Satellite [2,3]
    • 97. • With scores like these Just one last thing…
    • 98. Just one last thing… • and these
    • 99. • We want to analyze documents not only for topic discovery but also for turning these • into this • using these • and these • with scores like these • and these The final song: Recap
    • 100. The ending… Interviewer: Do you agree with President Obama’s approach towards Libya? Presidential: [Libya??] I just wanted to make sure we're talking about the same Candidate thing before I say, 'Yes, I agreed' or 'No I didn't agree.' I do not agree with the way he handled it for the following reason -- nope, that's a different one. I got all this stuff twirling around in my head • So that we can always have the right information on our fingertips
    • 101. Summary • Summarize a task using contextual exploratory analysis tools as well as deep NLP and • Make decisions for us! • Topic models can now talk to structured prediction models • Efficient text summarization/translation of domain specific videos is now possible • With multi-document summarization systems which exploit meaning in text, we are getting closer to our ultimate dream: – Construct an artificial assistant who can
    • 102. Future Directions • Core Algorithms – Non-parametric Tag2LDA family models – Address sparsity in tags and scaling of real-valued variables in mixed domain topic models – Efficient inference with more structure among hidden variables • Applications – Type in text and get an object detector [borrowed from VPML] – Intention analysis of videographers in social networks and the evolution of intentions over time – Large scale visualization using rhetorics and topic analysis – Large scale multi-media multi-document summarization
    • 103. Thank You All for Listening Questions?