ESWC 2014 Tutorial part 3

350 views

Published on

ESWC 2014 Tutorial part 3
http://tutorials.oeg-upm.net/socialweb/

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
350
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

ESWC 2014 Tutorial part 3

  1. 1. Social Web: Where are the Semantics? ESWC 2014 Miriam Fernández, Victor Rodríguez, Andrés García-Silva, Oscar Corcho Ontology Engineering Group, UPM, Spain Knowledge Media Institute, The Open University
  2. 2. Outline 2 •  Part 1: Understanding Social Media –  Theory: background & applications described in this tutorial –  Hands on: data extraction from Twitter and Facebook •  Part 2: Using semantics to represent data from SNS –  Theory: Using SW to represent content, users and relations –  Hands on: applying and extending SIOC •  Part 3: Using semantics to understand social media conversations –  Theory: Using semantics to understand topics in social media –  Hands on: using LDA to extract topics from social media •  Part 4: Using semantics to understand user behaviour
  3. 3. Why we need semantics to understand social media? •  Information overwhelming –  We need mechanisms to support •  Better information search/recommendation •  Better information integration •  Automatic knowledge extraction •  User generated content is generally unstructured –  Machines can not understand this content! ESWC 2014 Social Web: Where are the Semantics? 3 "The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001
  4. 4. Implicit vs. Explicit Semantics •  Implicit Semantics –  Implicit, also called statistical semantics focuses on extracting word sense by studying the patterns of human word usage in massive collections of text or other human generated data. –  It does not rely on an explicit formalisation/conceptualisation of knowledge •  Explicit Semantics –  Explicit semantics, focus on the analysis of content by using the support of explicit conceptualisations in the form of ontologies and knowledge bases ESWC 2014 Social Web: Where are the Semantics? 4
  5. 5. Implicit semantics: Topic models •  Topic models: one possible way of extracting implicit semantics ESWC 2014 Social Web: Where are the Semantics? 5
  6. 6. bags of words ESWC 2014 Social Web: Where are the Semantics? 6 Word count ESWC 3 rank 1 technology 2 conference 1 venue 1 semantic 5 Web 7 knowledge 5 ... Word count ISWC 3 rank 0 venue 1 semantic 5 conference 0 venue 1 semantic 5 Web 5 knowledge 0 ...
  7. 7. term-document matrix ESWC 2014 Social Web: Where are the Semantics? 7 •  Term-document matrix –  A very large, sparse matrix –  A document can be seen as a vector
  8. 8. similarity measures •  Useful to answer… how similar are two documents? –  Distance measures between two documents •  Cosine similarity = ​ 𝐴•𝐵∕‖𝐴‖‖𝐵‖ =0,72 •  Jackard index = ​| 𝐴∩ 𝐵|∕|𝐴∪ 𝐵| =0,50 •  However: –  Synonyms will appear far apart while they aren’t –  Polysemic words will appear close while they aren’t •  What is a document talking about? –  «Explicit semantic analysis»(ESA) ESWC 2014 Social Web: Where are the Semantics? 8
  9. 9. info retrieval, text mining •  Classification. Documents may belong to different classes •  How relevant is a word for a document (or class of documents)? TF−IDF(𝑥, 𝑦)=​ 𝑡 𝑓↓𝑥, 𝑦 ×log​(​ 𝐷⁄​ 𝑑↓𝑥  ) ESWC 2014 Social Web: Where are the Semantics? 9 tfx,y=freq. of x in y D=number of documents Dx=number of documents containing x
  10. 10. latent semantic analysis ESWC 2014 Social Web: Where are the Semantics? 10 Figure taken from http://faculty.washington.edu/jwilker/559/ SteyversGriffiths.pdf Singular Value Decomposition to reduce dimensionality •  The term-document matrix is large •  Latent Semantic Analysis •  Rank of D can be reduced •  Meaning –  U=term-topic correlation –  D=topic importance –  V=document-topic correlation
  11. 11. Semantics of a social media message ESWC 2014 Social Web: Where are the Semantics? 11
  12. 12. Topics ESWC 2014 Social Web: Where are the Semantics? 12
  13. 13. discriminative models / generative models •  Discriminative Models (1-step) 1.  Directly infer posterior probabilities p(Ck|x) •  Generative Models (2-steps) 1.  Infer class-conditional densities p(x|Ck) and priors p(Ck) 2.  Use Bayes theorem to determine posterior probabilities 𝑝​​ 𝐶↓1 ⁠ 𝑥 =​ 𝑝​ 𝑥⁠​ 𝐶↓1  𝑝(​ 𝐶↓1 )/𝑝​ 𝑥⁠​ 𝐶↓1  𝑝(​ 𝐶↓1 )+ 𝑝​ 𝑥⁠​ 𝐶↓2  𝑝(​ 𝐶↓2 )  ESWC 2014 Social Web: Where are the Semantics? 13 We can generate x that are likely to have been produced by class C1
  14. 14. Generative model ESWC 2014 Social Web: Where are the Semantics? 14
  15. 15. LDA: a probabilistic generative model ESWC 2014 Social Web: Where are the Semantics? 15 This is a Probabilistic Generative Process: we can generate documents according to certain topics.
  16. 16. Topic models ESWC 2014 Social Web: Where are the Semantics? 16
  17. 17. Topic models ESWC 2014 Social Web: Where are the Semantics? 17
  18. 18. Topic models ESWC 2014 Social Web: Where are the Semantics? 18 Topics known a priori Latent topics •  We don’t know the topics in advance •  We don’t know the importance of each word in a topic Latent topics are not pre-specified but found from the corpus
  19. 19. topic models vs LSA ESWC 2014 Social Web: Where are the Semantics? 19 Figure taken from http://faculty.washington.edu/jwilker/559/ SteyversGriffiths.pdf Singular Value Decomposition reduces dimensionality •  Latent Semantic Analysis vs Topic Model
  20. 20. topic models ESWC 2014 Social Web: Where are the Semantics? 20 How important is a word in a topic How important is a topic in a document
  21. 21. topic models: LDA ESWC 2014 Social Web: Where are the Semantics? 21 D documents, using a total of W words K topics LDA: each document d is a mixture among Z topics with each topic being a multinomial distribution over a vocabulary of W words θd : topic distribution for a document d (~Dirichlet(α)) ϕz : word probabilities for a topic z (~Dirichlet(β)) •  Latent Direchlet Allocation (LDA)
  22. 22. Topic models ESWC 2014 Social Web: Where are the Semantics? 22 LDA: each document d is a mixture among Z topics with each topic being a multinomial distribution over a vocabulary of W words Probability of a word: θd : topic distribution for a document d (~Dirichlet(α)) ϕz : word probabilities for a topic z (~Dirichlet(β)) •  Latent Direchlet Allocation (LDA) Reminder of Gamma function
  23. 23. Topic models •  Joint probability distribution for a document of all the random variables, assuming we know α and β. •  Given a set of documents D and the LDA model we can use inference to find out θ,ϕ and the topic assignment for each word •  Intractable problem, but numerically solvable with the Gibbs sampling method (sort of Monte Carlo for Markov Chains method) ESWC 2014 Social Web: Where are the Semantics? 23 for each word in the document Probability of a word in a document, knowing the distribution of words in a topic Probability of a topic in a document, knowing the topic distribution for a document The Dirichlet functions declared before
  24. 24. Topic models: summary •  Latent Dirichlet Allocation is a generative probabilistic model which can discover latent topics in unlabelled data •  Labelled LDA: a supervised version •  Implementations: –  Mallet –  Stanford Topic Modelling Toolbox (Stanford) •  Applications: •  Similarity between two documents •  Classification of texts •  Indexing of documents ESWC 2014 Social Web: Where are the Semantics? 24
  25. 25. Some references •  Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. •  Finding Scientific Topics. Griffiths, T., & Steyvers, M. (2004). Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. •  Semantic Characterization of Tweets Using Topic Models: A Use Case in the Entertainment Domain. A. García-Silva, V. Rodriguez- Doncel, O. Corcho. Int. Journal on Semantic Web and Information Systems (IJSWIS), 9(3), 1-13 (2013) ESWC 2014 Social Web: Where are the Semantics? 25
  26. 26. Social Web: Where are the Semantics? ESWC 2014 Victor, Andres, Oscar, Miriam Universidad Politectica de Madrid, Spain Knowledge Media Institute, The Open University,
  27. 27. mixture model We observe some data generated by a mixture of distributions and we want to learn the parameters of these distributions. A mixture model is a probabilistic model representing the linear combination of several PDFs We dont even see the colours!
  28. 28. Rule based ESWC 2014 Social Web: Where are the Semantics? 28
  29. 29. Classes Satisfaction Insatisfaction Hate Love Fear Trust Happiness Sadness SD LH TF HS + -
  30. 30. Rule grammar object RuleGrammar extends StandardTokenParsers { lexical.delimiters += ("[", "]", "#", "->", "+", "-", "*", "/", "=", "." ) lexical.reserved += ("_") def rule_set = (replace_rule | classify_rule).* def replace_rule = stringLit.* ~ "=" ~ stringLit def classify_rule = morpho_sequence ~ "->" ~ action def morpho_sequence = (word | lemma_pos| lemma | pos | entity | wildcard | limited_wildcard).* def word = stringLit def lemma = ident def pos = "[" ~> ident <~ "]" def lemma_pos = ident ~ "#" ~ ident def entity = "<ENTITY>" def wildcard = "*" def limited_wildcard = "/" ~> numericLit <~ "/" def action = classify_action | chunk_action def classify_action = ident ~ ("+" | "-" | "*" | "/") ~ number def number = numericLit | negative def negative = "-" ~ numericLit def chunk_action = ident <~ "." } Examples (Spanish): (cómo/cada día) odio (más) a (el/la/esta/…) ent: /2/ ODIAR#V /1/ A#SP /1/ <ENTITY> -> LH -2 mi/este odio a/por ent: D ODIO#NC [SP] <ENTITY> -> LH -1

×