• Like
  • Save
Arcomem training Topic Analysis Models advanced
Upcoming SlideShare
Loading in...5
×
 

Arcomem training Topic Analysis Models advanced

on

  • 571 views

This presentation on Topic Analysis Models is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving ...

This presentation on Topic Analysis Models is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.

Statistics

Views

Total Views
571
Views on SlideShare
571
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Arcomem training Topic Analysis Models advanced Arcomem training Topic Analysis Models advanced Presentation Transcript

    • Topic Analysis in ARCOMEM Yahoo Research Barcelona
    • What is Probabilistic Topic Modelling? Exploring and retrieving meaningful information from large collections of textual documents is a challenging task Probabilistic topic models are a suite of algorithms (a framework) that aim to discover and annotate large archives of documents with thematic information. They do not require any prior annotations or labeling of the documents. Topics emerge from the statistical analysis of the original texts
    • Probabilistic Topic Model Topic models are based upon the idea that documents are mixtures of topics, where a topic is a probability distribution over a fixed vocabulary. A topic model is a generative model for documents: it specifies a simple probabilistic procedure by which documents can be generated. The idea is to study the co-occurrence of words, assuming that words that tend to co-occur frequently, express, or belong to, the same semantic concept. Example: A document (d) can be represented by the following mixture of topics Biology Physics Mathematics 0,6 0,3 0,1 In the topic “Biology” words such as “Dna, genetic, evolution” have high probability
    • Intuition behind topic modelling Documents exhibit multiple topics Each topic is individually interpretable, providing a probability distribution over words that picks out a coherent cluster of correlated terms Evolution Biology Genetics Statistical Analysis
    • Generative process We only observe the documents Our goal is to infer the underlying topic structure What are the topics? Topic 1: ?Topic 1: ? Topic 2: ?Topic 2: ? time series nonlinear mathematics geometric dynamics Ecologist population species natural nature human
    • Text Modeling A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1, . . . , V }. A document is a sequence of N words denoted by w = (w1,w2,... ,wN), where wn is the nth word in the sequence. A corpus is a collection of M documents denoted by D = {w1,w2,... ,wM}. Bag-of-words assumption: the only information relevant to the model is the number of times words are produced. We don’t consider word-order!!!!
    • Latent Dirichlet Allocation
    • The challenge is to identify, for each campaign, significant and important topics that are relevant to the two user cases, broadcasting and parliament libraries. Topic analysis provides semantic useful categories which allow end- users to search and browse content archives.
    • Try out on SARA: Trending topics
    • Try out on SARA: Statistical Topic Models