Your SlideShare is downloading. ×
0
Semantic Transforms Using Collaborative Knowledge BasesYegin Genc, Winter Mason, Jeffrey V. Nickerson          Stevens Ins...
Overview• Automatically understand online information• Using network artifacts, such as Wikipedia, to  help
Topic Models       Algorithms to understand and       organize documents by       uncovering semantic structure       of a...
Latent Dirichlet Allocation (LDA)   “In the computer science field of artificial intelligence, a genetic algorithm (GA) is...
Topics from LDA     computer          chemistry           cortex             orbit           infection     methods        ...
The interpretation problem1. Labeling the topics is difficult (J. Chang et al.,   2009)2. The relationships between topics...
Collaborative Knowledge Bases1. Labeled topics2. Connected to each other in a meaningful way3. Contain rich, focused infor...
Wikipedia Pages as TopicsLDA topic      Wikipedia Page   orbit       Solar System   dust        “The Solar System[a] consi...
Wikipedia Pages as TopicsTopics are characterized as distributions over observed words inWikipedia pages Wikipedia Word Fr...
DOCUMENT – TOPIC          DOCUMENT – W0RD                    TOPIC - WORD          Θ (D x K)                 W (D x W )   ...
ExperimentData617 abstracts from Journal of the ACMClassified into 80 categories by their authors53 categories have corres...
Three variations of our method- Inbound links are Wikipedia pages that link to the topic page- Outbound links are Wikipedi...
Results      Method                    Primary                   Primary or Additional         Text                 182 (2...
Results (Qualitative)
Concluding RemarksThe Wiki categories often match the categories thatwere chosen by the authors. When they don’tmatch, the...
Next StepsDependent topic structuresCombine heuristics with generative models:  Wikipedia as a prior for the topic distrib...
Upcoming SlideShare
Loading in...5
×

Semantic Transforms Using Collaborative Knowledge Bases

224

Published on

presented at WIN2012

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
224
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Blei- “Much of my research is in topic models, which are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. These algorithms help us develop new ways to search, browse and summarize large archives of texts.”
  • Here is an example of a paragraphWe assume that some number of topics exist in a document setEach document is a mixture of these corpus wide topicsEach topic is a distribution over wordsEach word is drawn from one of those topics
  • Describing what they mean is different,
  • Use posterior expectations / approximate posterior inference: gibbs sampling, variational inference
  • The reason we chose this so that we can validate our results
  • Pause… Thank you
  • Transcript of "Semantic Transforms Using Collaborative Knowledge Bases"

    1. 1. Semantic Transforms Using Collaborative Knowledge BasesYegin Genc, Winter Mason, Jeffrey V. Nickerson Stevens Institute of Technology
    2. 2. Overview• Automatically understand online information• Using network artifacts, such as Wikipedia, to help
    3. 3. Topic Models Algorithms to understand and organize documents by uncovering semantic structure of a document collection • Discover hidden themes – patterns of word use • Connect documents that exhibit similar patterns
    4. 4. Latent Dirichlet Allocation (LDA) “In the computer science field of artificial intelligence, a genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover.” 1 Algorithms – 0.28 Genetic – 0.18 Optimization – 0.28 Natural – 0.18 Algorithm – 0.14 Evolution – 0.18 Computer – 0.14 Evolutionary – 0.09 Techniques – 0.14 … ….1http://en.wikipedia.org/wiki/Genetic_algorithm
    5. 5. Topics from LDA computer chemistry cortex orbit infection methods synthesis stimulus dust immune number oxidation fig jupiter aids two reaction vision line infected principle product neuron system viral design organic recordings solar cellsFive topics from a 50-topic LDA model to fit Science from 1980 – 2002 (Blei and Lafferty, 2009) methods k of the for the the operations the the the objects of the o and the of a of to a linear we of functional a of algorithm and to problem and to requires is problems for the we problems a that and inTen randomly chosen topics from a 50-topic LDA model fit to abstracts from the Journal ofthe ACM (JACM) from the years 1987 to 2004 (Blei et al., 2010).
    6. 6. The interpretation problem1. Labeling the topics is difficult (J. Chang et al., 2009)2. The relationships between topics are not identified3. The information in the topics is based solely on the input corpus4. The external validity of the topics may be limited
    7. 7. Collaborative Knowledge Bases1. Labeled topics2. Connected to each other in a meaningful way3. Contain rich, focused information on particular topics4. Contain fresh, up-to-date information about practically everything
    8. 8. Wikipedia Pages as TopicsLDA topic Wikipedia Page orbit Solar System dust “The Solar System[a] consists of the Sun jupiter and the astronomical objects gravitationally bound in orbit around it, line all of which formed from the collapse of a system giant molecular cloud approximately 4.6 solar billion years ago…” gasatmospheric (http://en.wikipedia.org/wiki/Solar_System) mars field
    9. 9. Wikipedia Pages as TopicsTopics are characterized as distributions over observed words inWikipedia pages Wikipedia Word Freq. orbit 34 0.12 dust 7 0.02 {Wi Î k} bk = p(Wi | k) = N jupiter 36 0.12 line 0 0.00 å {W Î k} i i system 76 0.26 βk : Per-topic word distribution solar 110 0.38 gas 11 0.04 atmospheric 1 0.00 mars 8 0.03 field 8 0.03
    10. 10. DOCUMENT – TOPIC DOCUMENT – W0RD TOPIC - WORD Θ (D x K) W (D x W ) β (K x W) Z d,n W d,n n Z d,nLDA d d Wiki (W x K) k kWIKI d = d * D: Documents K: Topics W: Words
    11. 11. ExperimentData617 abstracts from Journal of the ACMClassified into 80 categories by their authors53 categories have corresponding Wikipedia PagesAbstracts{Article Name: On the (Im)possibility of Obfuscating Programs, Category: D.4. Operating Systems Add. Category: F.1 Computation by Abstract Devices …}Category Mappings Category Wikipedia Page D.4 Operating Systems: Operating System F.1 Computation by Abstract Devices : Abstract Machine
    12. 12. Three variations of our method- Inbound links are Wikipedia pages that link to the topic page- Outbound links are Wikipedia pages linked to by the topic page- Text-based method only uses word distributions in topic pages
    13. 13. Results Method Primary Primary or Additional Text 182 (29.5%) 314 (50.8%) Inbound links 131 (21.2%) 249 (40.0%) Outbound links 79 (12.8%) 166 (26.9%)The number (and percentage) of authors’ primary ACM topic labels, or authors’primary + additional ACM topics successfully identified by each method.LDA cannot be compared without an additional step mapping word distributions toACM topics.
    14. 14. Results (Qualitative)
    15. 15. Concluding RemarksThe Wiki categories often match the categories thatwere chosen by the authors. When they don’tmatch, they generally appear plausible.Among the variations of our method, the text basedapproach performed better than link basedapproaches.Among the link based approaches, inbound linksperformed better than outbound links.
    16. 16. Next StepsDependent topic structuresCombine heuristics with generative models: Wikipedia as a prior for the topic distribution Learn from the documents observed.
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×