Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Antonio Moreno
[Carlos Vicient]
Seminar at Poznan University of Technology, June 2014
 Introduction
 Methodology of analysis
 Case study
 Conclusions and future work
 Web 2.0 (Social Web)
◦ Huge amount of highly heterogeneous and unstructured
user-generated data in the Web (e.g. Wikiped...
Ontology Learning from
Web pages
PhD thesis-D.Sánchez
(2007)
 Focus on Social Networks – Twitter
 500 million short messages (tweets) per day
Hashtags
 Hashtags can be taken as indicators of the topic of
a tweet
 Given a large number of tweets, most approaches
to automat...
 Synonymy: #illness, #disease
 Polysemy: #operation
 Lexical similarity: #pharmaceutical, #pharmaceuticals,
#pharma, #p...
 A semantic management of hashtags will provide
a more coherent classification than the usual ones
based on syntactic co-...
 After obtaining the hashtags from a given corpus
of tweets, a three-step analytic process is applied:
◦ Semantic annotat...
 Idea: give meaning to each hashtag, by linking it
to a WordNet concept
◦ #SagradaFamilia => Church
◦ #LFC => Football Cl...
 Step 1: The hashtag matches directly with a
WordNet concept
◦ Word-breaking techniques and iterative prefix/suffix
analy...
WordNet
WordNet
WordNet
?
WordNet
#SagradaFamilia => {building, church, basilica}
?
 At this point each hashtag h is associated to one
(or several) WordNet concepts Lh
◦ The hashtags that have not been ann...
 We have considered that the similarity between
two hashtags h1 and h2 is the maximum similarity
between a concept in Lh1...
 We have considered that the similarity between
two hashtags h1 and h2 is the maximum similarity
between a concept in Lh1...
 Due to the nature of social tags, traditional
clustering methods provide solutions with a large
number of irrelevant cla...
filtering (HC, minK, maxK, t1, t2)
finalClusts := Ø
forall k in maxK .. minK
forall c in 1 .. k
b := inter-cluster-homogen...
filtering (HC, minK, maxK, t1, t2)
finalClusts := Ø
forall k in maxK .. minK
forall c in 1 .. k
b := inter-cluster-homogen...
filtering (HC, minK, maxK, t1, t2)
finalClusts := Ø
forall k in maxK .. minK
forall c in 1 .. k
b := inter-cluster-homogen...
 5000 medical tweets related to
Oncology, extracted from Symplur
(www.symplur.com)
 From October 31st 2012 to January
11...
 The remaining 930 hashtags were manually
examined.
◦ 536 (57.6%) were relevant medical hashtags, and they
were classifie...
 Wu-Palmer semantic similarity measure
 maxK=200, minK=5
◦ The algorithm proceeds from the cut that divides the set in 200
classes up to the cut that divides th...
 A: Manual set of 16 correct classes (536 HTs) + a
noisy 17th class (394 HTs)
 B: Set of 31 classes (930 HTs) obtained b...
Classes in B
Semantic centroid, Size
Best matching classes in A
Precision, Recall, Manual label
 The unsupervised analysis of the set of HTs
contained in a corpus of tweets is very hard,
because half of them may be no...
 Evaluate the quality of the semantic annotation step
 Test different ontology-based semantic similarity
measures in the...
Antonio Moreno, [Carlos Vicient]
Automatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networks
Upcoming SlideShare
Loading in …5
×

Automatic and unsupervised topic discovery in social networks

2,630 views

Published on

Research seminar given at the Poznan University of Technology, Poland, June 2014. The topic was the automatic and unsupervised discovery of topics in social networks.

Published in: Technology, Education
  • Be the first to comment

Automatic and unsupervised topic discovery in social networks

  1. 1. Antonio Moreno [Carlos Vicient] Seminar at Poznan University of Technology, June 2014
  2. 2.  Introduction  Methodology of analysis  Case study  Conclusions and future work
  3. 3.  Web 2.0 (Social Web) ◦ Huge amount of highly heterogeneous and unstructured user-generated data in the Web (e.g. Wikipedia, blogs) and in social networks (e.g. Facebook, Twitter)  Global aim of our work ◦ Develop tools based on Artificial Intelligence techniques that may analyze all this information in an automatic and unsupervised way and build knowledge structures  Some previous works  Ontology-Based Information Extraction  Ontology Learning from the Web
  4. 4. Ontology Learning from Web pages PhD thesis-D.Sánchez (2007)
  5. 5.  Focus on Social Networks – Twitter  500 million short messages (tweets) per day Hashtags
  6. 6.  Hashtags can be taken as indicators of the topic of a tweet  Given a large number of tweets, most approaches to automatic topic detection try to cluster tweets (or cluster hashtags) in some way  Most usual solution: cluster hashtags considering their syntactic co-occurrence
  7. 7.  Synonymy: #illness, #disease  Polysemy: #operation  Lexical similarity: #pharmaceutical, #pharmaceuticals, #pharma, #pharmacy, #pharmacology  Acronyms: #AIDS, #HIV  Named entities: #MayoClinic,#AustinCancerCentre  Concatenation: #HighBloodPressure, #lungcancer  Feelings: #CancerSucks  Invented words, nonsense
  8. 8.  A semantic management of hashtags will provide a more coherent classification than the usual ones based on syntactic co-occurrence.  Reminder of the talk: ◦ Unsupervised semantic clustering of hashtags ◦ Case study – Medical tweets
  9. 9.  After obtaining the hashtags from a given corpus of tweets, a three-step analytic process is applied: ◦ Semantic annotation of hashtags ◦ Hashtag clustering ◦ Selection of relevant clusters
  10. 10.  Idea: give meaning to each hashtag, by linking it to a WordNet concept ◦ #SagradaFamilia => Church ◦ #LFC => Football Club  Rationale: if we are able to associate each hashtag to a concept in an ontology, we will be able to apply ontology-based semantic similarity measures to know the degree of relationship between pairs of hashtags
  11. 11.  Step 1: The hashtag matches directly with a WordNet concept ◦ Word-breaking techniques and iterative prefix/suffix analysis are applied ◦ #Cathedral, #GothicCathedral match with the “Cathedral” concept  Easy, but most hashtags do not appear directly in WordNet
  12. 12. WordNet
  13. 13. WordNet
  14. 14. WordNet ?
  15. 15. WordNet #SagradaFamilia => {building, church, basilica} ?
  16. 16.  At this point each hashtag h is associated to one (or several) WordNet concepts Lh ◦ The hashtags that have not been annotated in the previous step are dismissed  In order to apply a clustering process it is necessary to define a measure of semantic similarity between pairs of hashtags (i.e. between pairs of lists of WordNet concepts)
  17. 17.  We have considered that the similarity between two hashtags h1 and h2 is the maximum similarity between a concept in Lh1 and a concept in Lh2 ◦ Any ontology-based semantic similarity measure between concepts could be applied h1: C1 C2 h2: C3 C4 C5 0.2 0.1 0.5 0.60.3 0.1
  18. 18.  We have considered that the similarity between two hashtags h1 and h2 is the maximum similarity between a concept in Lh1 and a concept in Lh2 ◦ Any ontology-based semantic similarity measure between concepts could be applied h1: C1 C2 h2: C3 C4 C5 0.2 0.1 0.5 0.60.3 0.1 Using these similarity between hashtags we perform a hierarchical clustering of the set of hashtags
  19. 19.  Due to the nature of social tags, traditional clustering methods provide solutions with a large number of irrelevant classes  It is important to analyse the clustering tree and determine which classes of hashtags are good enough to be shown to the user
  20. 20. filtering (HC, minK, maxK, t1, t2) finalClusts := Ø forall k in maxK .. minK forall c in 1 .. k b := inter-cluster-homogeneity(HCkc) if ((b >= t1) && (|HCkc| >= t2) && (∄ e in finalClusts | e ⊆ HCkc)) Add HCkc to finalClusts return finalClusts Cut the tree and obtain k classes
  21. 21. filtering (HC, minK, maxK, t1, t2) finalClusts := Ø forall k in maxK .. minK forall c in 1 .. k b := inter-cluster-homogeneity(HCkc) if ((b >= t1) && (|HCkc| >= t2) && (∄ e in finalClusts | e ⊆ HCkc)) Add HCkc to finalClusts return finalClusts Compute the homogeneity of each class
  22. 22. filtering (HC, minK, maxK, t1, t2) finalClusts := Ø forall k in maxK .. minK forall c in 1 .. k b := inter-cluster-homogeneity(HCkc) if ((b >= t1) && (|HCkc| >= t2) && (∄ e in finalClusts | e ⊆ HCkc)) Add HCkc to finalClusts return finalClusts A class is selected if it is big enough, it is homogeneous enough, and it is not a superset of any of the previously selected classes A semantic centroid of each selected class is calculated
  23. 23.  5000 medical tweets related to Oncology, extracted from Symplur (www.symplur.com)  From October 31st 2012 to January 11th 2013  The set contains1086 different hashtags  Using the WordNet + Wikipedia semantic annotation process, 930 hashtags (85.6%) were annotated ◦ Half of the annotations are made in the first step (WordNet) and the other half in the second step (Wikipedia) ◦ 156 hashtags (14.4%) were removed 2530 793 769 371 293 129 52 30 2 3 24 1 1 2 3 4 5 6 7 8 9 10 11 12 #tweets #hashtags hashtags/tweet
  24. 24.  The remaining 930 hashtags were manually examined. ◦ 536 (57.6%) were relevant medical hashtags, and they were classified in 16 manually labelled categories  Organs, professions, medical tests, etc. ◦ 394 (42.4%) were considered noisy or unrelated to Medicine
  25. 25.  Wu-Palmer semantic similarity measure
  26. 26.  maxK=200, minK=5 ◦ The algorithm proceeds from the cut that divides the set in 200 classes up to the cut that divides the set in 5 classes; thus, it moves from more particular classes to more general classes  t1: minimum inter-class-homogeneity ◦ All the values between 0 and 1 (in 0.1 steps) were tested. ◦ In this talk I will consider the value 0.70.  t2: minimum number of elements ◦ All the even values between 2 and 20 were tested. ◦ In this talk I will consider the value 10.  With these parameters, 31 classes were obtained
  27. 27.  A: Manual set of 16 correct classes (536 HTs) + a noisy 17th class (394 HTs)  B: Set of 31 classes (930 HTs) obtained by the system  We calculate, for each class Bi in B ◦ Its semantic centroid ◦ Which is the class Aj in A with which it shares more elements  Precision: How many items of Bi belong to Aj  Recall: How many items of Aj appear in Bi
  28. 28. Classes in B Semantic centroid, Size Best matching classes in A Precision, Recall, Manual label
  29. 29.  The unsupervised analysis of the set of HTs contained in a corpus of tweets is very hard, because half of them may be noisy or unrelated to the domain, and they have a very heterogeneous nature  Our hypothesis is that semantic measures of similarity between HTs will lead to better classifications that standard co-occurrence techniques  In a test on 5000 medical tweets, 13 of the 16 manually labelled classes are found, with different degrees of precision and recall
  30. 30.  Evaluate the quality of the semantic annotation step  Test different ontology-based semantic similarity measures in the clustering step  Explore deeply the influence of the thresholds on the selection step  Obtain as result a hierarchy of classes at different levels of abstraction, rather than a partition  Test the system on different sets of tweets ◦ Size: from thousands to millions of tweets ◦ Domain: uni-domain or general corpus
  31. 31. Antonio Moreno, [Carlos Vicient]

×