Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Semantic Filtering
An example of Semantic technologies for real-time
analysis
Pavan Kapanipathi
Ohio Center of Excellence ...
Streams are everywhere
Social Data
Text
Images
Videos
Sensor Data
Streams
Information Overload
500M users generate 500M tweets per day
3
It’s not information overload.
It’s filter failure
-- Clay ...
Each of our projects face
Information Overload
• Disaster Management
• Hazards SEES
• Healthcare Issues
• Depression
• Soc...
• Filtering is necessary
• Understanding the
requirements and utilizing
semantics for filtering is
important
Semantic Filt...
Two Main Topics
• Twarql
• Streaming annotation and flexible
querying on Twitter
• Continuous Semantics
• Tracking dynamic...
Twarql
Tracking health
care debate in the
United States on
Social Media
Health Care Reform
Health Care
Reform
Healthcare r...
Twarql
Extraction Pipeline - Tweet
I think it’s good deal Apple Ipad Tablet (3G, wifi, WiFi + 3G) Hard
Nylon Cube Carrying Case f...
RDF
• RDF Annotation
• Common RDF/OWL Data formats.
• FOAF, SIOC, OPO, MOAT
: Health_care_reform
Twarql – Use Case
Demo
http://knoesis.wright.edu/library/tools/twarql/demo.swf
Continuous Semantics
13
Dynamic Topics
Continuously
Evolving on
Twitter
Entity – Event
relevance
changes
Many entities are
involved
14
Dynamic Topics
Manually crawl using
keywords
“indianelection”“jan25” “sandy”
“swineflu” “ebola”
15
Dynamic Topics
Manually updating keywords
to get topic relevant tweets is
not feasible
“indianelection”
“modi”
“bjp”
“cong...
Problem
How can we automatically update the
filters to track a dynamically evolving
topic on Twitter
17
Hashtags as Filters
• Identify a topic on Twitter
• Tweets with hashtags are more
informative
• Users have a lot of freedo...
Exploring Hashtags as Evolving Filters for
Dynamic Topics
Colorado Shooting
Occupy Wall Street
CS OWS
Tweets: 122,062 Twee...
Top 1% retrieves
around 85% of the
tweets
Hashtag distributions
20
Colorado Shooting Occupy Wall Street
Event Related
Hashtags co-occur
with each other
Hashtag Filters Co-occurrence Graph
21
Summarizing Hashtag Analysis
Starting with one of the event relevant
hashtags, by co-occurrence we can reach
other relevan...
Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Too many
co-occurring hashtags
23
Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Co-occurring:
Threshold δ
Preferably a pr...
Hashtag Co-occurrence works?
o No. Just co-occurrence does not work
o Many noisy or unrelated hashtags co-occurs
o Determi...
Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500...
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Thresh...
Event Relevant Background
Knowledge
o Wikipedia Event Pages
28
o Wikipedia Event Pages
Event Relevant Background
Knowledge
29
o Entities mentioned on the Event page of
Wikipedia are relevant to the Event
Event Relevant Background
Knowledge
30
o Wikipedia’s Hyperlink structure is very rich
o Page-Page (Wikipedia) links
Indian General
Election, 2014
Narendra Modi
R...
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Thresh...
o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (Indi...
o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (Indi...
o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (Indi...
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Thresh...
o Edge Based Measure
o Link Overlap Measure: Jaccard similarity
o Out(c) are the links in Wikipedia page “c”
o Final Score...
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Thresh...
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Thresh...
o Set Based
o Jaccard Similarity
o Considers the entities without the scores
o Vector Based
o Symmetric
o Cosine Similarit...
India General
Election 2014
Narendra
Modi
Intuition behind
Asymmetric
India General
Election 2014
Narendra
Modi
Penalized
...
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Thresh...
o 2 events
o US Presidential Elections (#election2012)
o Hurricane Sandy (#sandy)
o Top 25 co-occurring hashtags
Evaluatio...
o Ranking Problem
o Rank the Top 25 hashtags based on the relevancy
of tweets to the event
o Experiment with all the simil...
Evaluation
45
Evaluation
Evaluated tweets comprising of top-relevant
hashtags detected for dynamic topics
• NDCG - 92% at top-5 Mean Ave...
Conclusions
• Semantic Technologies for Real-time filtering of Social
Data
– Wikipedia as a Dynamic Knowledge base for eve...
Thanks
Contact: @pavankaps
pavan@knoesis.org
Upcoming SlideShare
Loading in …5
×

Knoesis-Semantic filtering-Tutorials

575 views

Published on

Semantic Filtering as an example of Semantic technologies for real-time analysis. This presentation emphasizes the value of semantics for social data filtering, specifically for the challenges faced during dynamically evolving event analysis.

Published in: Education
  • Be the first to comment

Knoesis-Semantic filtering-Tutorials

  1. 1. Semantic Filtering An example of Semantic technologies for real-time analysis Pavan Kapanipathi Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis) Wright State University, USA Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015
  2. 2. Streams are everywhere Social Data Text Images Videos Sensor Data Streams
  3. 3. Information Overload 500M users generate 500M tweets per day 3 It’s not information overload. It’s filter failure -- Clay Shirky
  4. 4. Each of our projects face Information Overload • Disaster Management • Hazards SEES • Healthcare Issues • Depression • Societal Issues • Edrug Trends • Harassment
  5. 5. • Filtering is necessary • Understanding the requirements and utilizing semantics for filtering is important Semantic Filtering
  6. 6. Two Main Topics • Twarql • Streaming annotation and flexible querying on Twitter • Continuous Semantics • Tracking dynamic topics on Twitter
  7. 7. Twarql Tracking health care debate in the United States on Social Media Health Care Reform Health Care Reform Healthcare reform legislation in the United States Patient Protection and Affordable Care Act (Obamacare) Health Care Reform
  8. 8. Twarql
  9. 9. Extraction Pipeline - Tweet I think it’s good deal Apple Ipad Tablet (3G, wifi, WiFi + 3G) Hard Nylon Cube Carrying Case for ipad ( iPad.. http://bit.ly/cry6LF) Dbpedia:Ipad Dbpedia:Tablet URLs http://penguinkang.com/tweetprobe/
  10. 10. RDF • RDF Annotation • Common RDF/OWL Data formats. • FOAF, SIOC, OPO, MOAT
  11. 11. : Health_care_reform Twarql – Use Case
  12. 12. Demo http://knoesis.wright.edu/library/tools/twarql/demo.swf
  13. 13. Continuous Semantics 13
  14. 14. Dynamic Topics Continuously Evolving on Twitter Entity – Event relevance changes Many entities are involved 14
  15. 15. Dynamic Topics Manually crawl using keywords “indianelection”“jan25” “sandy” “swineflu” “ebola” 15
  16. 16. Dynamic Topics Manually updating keywords to get topic relevant tweets is not feasible “indianelection” “modi” “bjp” “congress” “jan25” “egypt” “tunisia” “arabspring” “sandy” “newyork” “redcross” “fema” “swineflu” “ebola” 16
  17. 17. Problem How can we automatically update the filters to track a dynamically evolving topic on Twitter 17
  18. 18. Hashtags as Filters • Identify a topic on Twitter • Tweets with hashtags are more informative • Users have a lot of freedom to create them • Some get popular, most die 18
  19. 19. Exploring Hashtags as Evolving Filters for Dynamic Topics Colorado Shooting Occupy Wall Street CS OWS Tweets: 122,062 Tweets: 6,077,378 Tags: 192,512 Distinct: 12,350 100% Retrieval: 7,763 Tags: 15,963,209 Distinct: 191,602 100% Retrieval: 21,314 HASHTAG FILTERS 19
  20. 20. Top 1% retrieves around 85% of the tweets Hashtag distributions 20
  21. 21. Colorado Shooting Occupy Wall Street Event Related Hashtags co-occur with each other Hashtag Filters Co-occurrence Graph 21
  22. 22. Summarizing Hashtag Analysis Starting with one of the event relevant hashtags, by co-occurrence we can reach other relevant hashtags 22
  23. 23. Determining Relevancy of Co-occurring Hashtags #indianelection2015 #modikisarkar Too many co-occurring hashtags 23
  24. 24. Determining Relevancy of Co-occurring Hashtags #indianelection2015 #modikisarkar Co-occurring: Threshold δ Preferably a prominent hashtag 24
  25. 25. Hashtag Co-occurrence works? o No. Just co-occurrence does not work o Many noisy or unrelated hashtags co-occurs o Determine the “dynamic” relevance of the top co-occurring hashtag with the dynamic topic 25
  26. 26. Determining Relevancy of Co-occurring Hashtags #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring δ Normalized Frequency Scoring 26 (Vector Space Model)
  27. 27. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Dynamically Updated Background Knowledge δ 27
  28. 28. Event Relevant Background Knowledge o Wikipedia Event Pages 28
  29. 29. o Wikipedia Event Pages Event Relevant Background Knowledge 29
  30. 30. o Entities mentioned on the Event page of Wikipedia are relevant to the Event Event Relevant Background Knowledge 30
  31. 31. o Wikipedia’s Hyperlink structure is very rich o Page-Page (Wikipedia) links Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress Event Relevant Background Knowledge – Graph Structure 31
  32. 32. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure One hop from Event Page δ 32
  33. 33. o Hyperlink structure is dynamically updated Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress 10 May 2010 Event Relevant Background Knowledge 33
  34. 34. o Hyperlink structure is dynamically updated Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress 10 May 2010 29 March 2013 29 March 2013 29 March 2013 29 March 2013 Event Relevant Background Knowledge 34
  35. 35. o Hyperlink structure is dynamically updated Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress 10 May 2010 29 March 2013 29 March 2013 29 March 2013 29 March 2013 20 May 2013 20 May 2013 Event Relevant Background Knowledge 35
  36. 36. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page δ 36
  37. 37. o Edge Based Measure o Link Overlap Measure: Jaccard similarity o Out(c) are the links in Wikipedia page “c” o Final Score: r(c,E) = ed(c,E) + oco(c,E) Hyperlink Entity Scoring India General Election, 2014 Narendra Modi India General Election, 2014 India General Election, 2009 1 Mutually Important ed (c,E) = 1 ed (c,E) = 2 37
  38. 38. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3 δ 38
  39. 39. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3 Similarity Check Relevance Score: 0.6 δ 39
  40. 40. o Set Based o Jaccard Similarity o Considers the entities without the scores o Vector Based o Symmetric o Cosine Similarity o Asymmetric o Subsumption Similarity Similarity Check 40
  41. 41. India General Election 2014 Narendra Modi Intuition behind Asymmetric India General Election 2014 Narendra Modi Penalized Ignored Similarity Symmetric Asymmetric 41
  42. 42. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3 Similarity Check Relevance Score: 0.6 δ 42
  43. 43. o 2 events o US Presidential Elections (#election2012) o Hurricane Sandy (#sandy) o Top 25 co-occurring hashtags Evaluation – Dataset 43
  44. 44. o Ranking Problem o Rank the Top 25 hashtags based on the relevancy of tweets to the event o Experiment with all the similarity metrics o Manually annotated the tweets of these hashtags as relevant/irrelevant (Gold Standard) o Ranking Evaluation Metrics o Mean Average Precision o NDCG Evaluation – Strategy 44
  45. 45. Evaluation 45
  46. 46. Evaluation Evaluated tweets comprising of top-relevant hashtags detected for dynamic topics • NDCG - 92% at top-5 Mean Average Precision 46
  47. 47. Conclusions • Semantic Technologies for Real-time filtering of Social Data – Wikipedia as a Dynamic Knowledge base for events – Determining relevant hashtags using Asymmetric similarity measure – More hashtags in turn increase the coverage of Tweets for events • Hashtag Analysis – Co-occurrence technique can be used to detect event relevant hashtags – More popular hashtags are easier to be detected via co- occurrence 47
  48. 48. Thanks Contact: @pavankaps pavan@knoesis.org

×