Knowledge base enabled Information Filtering on Social Web -- EMC

Knowledge-base Enabled Information
Filtering on Social Web
Pavan Kapanipathi
Kno.e.sis Center, Wright State University
Advisor: Amit Sheth
1

Social Web in 60 secs
500M users generate 500M tweets per day
4

Disaster Management Organizations
utilize Social Web
35% of 20M tweets during
hurricane sandy shared information
and news about the disaster 5

Personalized Filtering on Social Web
Following Dynamically
Evolving Topics as
interests
8

Personalization on Social Web
• Following Dynamically
Evolving Topics
• Indian Elections
• US Elections
• Heathcare Debate
9

Personalization on Social Web
• Following Dynamically
Evolving Topics
• Indian Elections
• US Elections
• Heathcare Debate
10

Dynamic Topics
Continuously
Evolving on
Twitter
Entity – Event
relevance
changes
Many entities
are involved
12

Dynamic Topics
Manually crawl using
keywords
“indianelection”“jan25” “sandy”
“swineflu” “ebola”
13

Dynamic Topics
Manually updating
keywords to get topic
relevant tweets is not
feasible
“indianelection”
“modi”
“bjp”
“congress”
“jan25”
“egypt”
“tunisia”
“arabspring”
“sandy”
“newyork”
“redcross”
“fema”
“swineflu” “ebola”
14

Problem
How can we automatically update
the filters to track a dynamically
evolving topic on Twitter
15

Hashtags as Filters
• Identify a topic on Twitter
• Tweets with hashtags are
more informative
• Users have a lot of freedom
to create them
• Some get popular, most die
16

Exploring Hashtags as Evolving
Filters for Dynamic Topics
Colorado Shooting
17

Colorado Shooting
Occupy Wall Street
18

Colorado Shooting
Occupy Wall Street
CS OWS
Tweets: 122,062 Tweets: 6,077,378
Tags: 192,512
Distinct: 12,350
100% Retrieval: 7,763
Tags: 15,963,209
Distinct: 191,602
19

Colorado Shooting
Occupy Wall Street
CS OWS
Tweets: 122,062 Tweets: 6,077,378
Tags: 192,512
Distinct: 12,350
Tags: 15,963,209
Distinct: 191,602
HASHTAG
FILTERS 20

Colorado Shooting Occupy Wall Street
Hashtag Filters Co-occurrence
Graph
21

Colorado Shooting Occupy Wall Street
Event Related
Hashtags co-occur
with each other
Hashtag Filters Co-occurrence
Graph
22

Summarizing Hashtag Analysis
Starting with one of the event
relevant hashtags, by co-
occurrence we can reach other
relevant hashtags
23

Determining Relevancy of Co-
occurring Hashtags
#indianelection2015
#modikisarkar
Too many
co-occurring hashtags
24

Hashtag Filters distributions
25

Not surprising
It’s a Powerlaw
distribution
Hashtag distributions
26

Top 1% retrieves
around 85% of the
tweets
Hashtag distributions
27

Clustering Co-efficient of Hashtag
Co-occurrence network (1%)
Clustering co-efficient
The top ones co-occur
with each other the best
28

occurring Hashtags
#indianelection2015
#modikisarkar
Co-occurring:
Threshold δ
Preferably a prominent hashtag
29

Hashtag Co-occurrence
works?
o No. Just co-occurrence does not work
o Many noisy or unrelated hashtags co-occurs
o Determine the “dynamic” relevance of
the top co-occurring hashtag with the
dynamic topic
30

occurring Hashtags
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
δ
Normalized
Frequency
Scoring
31
(Vector Space Model)

occurring Hashtags (Vector
Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Dynamically Updated
Background Knowledge
δ
32

Event Relevant Background
Knowledge
o Wikipedia Event Pages
33

o Wikipedia Event Pages
Knowledge
34

o Entities mentioned on the Event page of
Wikipedia are relevant to the Event
Knowledge
35

o Wikipedia’s Hyperlink structure is very
rich
o Page-Page (Wikipedia) links
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
Knowledge – Graph Structure
36

Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
One hop from Event
Page
δ
37

o Hyperlink structure is dynamically
updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
BJP
Indian National
Congress
10 May 2010
Knowledge
38

updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
BJP
Indian National
Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
Knowledge
39

updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
BJP
Indian National
Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
20 May 2013
20 May 2013
Knowledge
40

Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Entity scoring based
on relevance to the Event
One hop from Event
Page
δ
41

o Edge Based Measure
o Link Overlap Measure: Jaccard similarity
o Out(c) are the links in Wikipedia page “c”
o Final Score: r(c,E) = ed(c,E) + oco(c,E)
Hyperlink Entity Scoring
India General
Election, 2014
Narendra Modi
India General
Election, 2014
India General
Election, 2009
1
Mutually
Important
ed (c,E) = 1
ed (c,E) = 2
42

Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
One hop from Event
Page
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
δ
43

Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
One hop from Event
Page
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
Similarity
Check
Relevance Score: 0.6
δ
44

o Set Based
o Jaccard Similarity
o Considers the entities without the scores
o Vector Based
o Symmetric
o Cosine Similarity
o Asymmetric
o Subsumption Similarity
Similarity Check
45

India General
Election 2014
Narendra
Modi
Intuition behind
Asymmetric
India General
Election 2014
Narendra
Modi
Penalized
Ignored
Similarity
Symmetric
Asymmetric
46

Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
One hop from Event
Page
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
Similarity
Check
Relevance Score: 0.6
δ
47

o 2 events
o US Presidential Elections (#election2012)
o Hurricane Sandy (#sandy)
o Top 25 co-occurring hashtags
Evaluation – Dataset
48

o Ranking Problem
o Rank the Top 25 hashtags based on the
relevancy of tweets to the event
o Experiment with all the similarity metrics
o Manually annotated the tweets of these
hashtags as relevant/irrelevant (Gold
Standard)
o Ranking Evaluation Metrics
o Mean Average Precision
o NDCG
Evaluation –
Strategy
49

Evaluation
Evaluated tweets comprising of top-
relevant hashtags detected for
dynamic topics
• NDCG - 92% at top-5 Mean Average
Precision
51

A little
pause for
Questions?
52

Personalized Filtering
53
User Interest
Identification/User
Modeling
Filtering Module
Twitter Streaming API
Tweets
Network
Filtered
Tweets

54
User Interest
Identification/User
Modeling
Filtering Module
Tweets
Network
Filtered
Tweets
Dynamic Topics
as Interests
Interest: Indian Elections

55
User Interest
Identification/User
Modeling
Filtering Module
Tweets
Network
Filtered
Tweets
A Significant
Module

o User Interest Identification on Twitter
o Content-based (Only Tweets)
o Term-based (semantic, web, #semanticweb)
o Entity-based (sematic web <same as> #semanticweb)
o Interest Graphs derived from knowledge-base
(Hierarchical Interest Graphs)
o Collaborative (Users’ Friends)
o Hybrid
User Modeling
56

A simple solution to most problems I
am trying to solve

Hierarchical
Interest Graphs
58

What is in your mind? (Next
concept/term)
59

concept/term)
Fruit
60

concept/term)
Fruit
Other Fruit
Names
61

Cognitive Science
o Human memory has been argued to be
structured as a hierarchy of concepts
(Semantic Network)
o Spreading activation theory has been
utilized to simulate search on semantic
network
o This theory has not been well explored
for user interest modeling
62

Hierarchical Interest Graphs
o Extending user profiles from Twitter to
comprise a hierarchy of concepts
o Hierarchy of concepts are derived from
Wikipedia Category Structure
o Each concept in the hierarchy is scored
based on the users extent of interest
63

Semantic
Search
Linked Data Metadata
0.8 0.2 0.6
Scores for
Interests
65
User Interests

Internet
Semantic
Search
Technology
World Wide Web
Semantic
Web
Structured
Information
0.8 0.2 0.6
Scores for
Interests
66
User Interests

Internet
Semantic
Search
Technology
World Wide Web
Semantic
Web
Structured
Information
0.8 0.2 0.6
Scores for
Interests
67
User Interests
0.7
0.5
0.4
0.3

70
Wikipedia Category Graph
Contains
Cycles
More abstract:
World Wide Web or
Semantic Web?

71
Wikipedia Hierarchy
Hierarchical Levels
No Cycles
1
2
3
4
5
6

73
http://en.wikipedia.org/wiki/Semantic_search
http://en.wikipedia.org/wiki/Ontology
o Extracting Wikipedia entities
o Interest Scoring
o Frequency based
User Profile Generation

Internet
Semantic
Search
Technology
World Wide Web
Semantic
Web
User Interests
Structured
Information
0.8 0.2 0.6
Scores for
Interests
74

76
Cricket
M S Dhoni Virat Kohli
Sachin
Tendulkar
Sports
Indian
Cricket
Indian
Cricketers
0.8 0.2 0.6
0.5
0.4
0.25
0.1
Activation Function
Determines the extent of spreading
Example

o Simple Activation Function
𝐴𝑗 = 𝐴𝑖 × 𝑊𝑖𝑗 × 𝐷𝑛
𝑖=0
𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑐ℎ𝑖𝑙𝑑 𝑜𝑟 𝑠𝑢𝑏𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑜𝑓 𝑗 𝐴𝑐𝑡𝑖𝑣𝑎𝑡𝑒𝑑 .
𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑡𝑜 𝑏𝑒 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑒𝑑.
𝑊𝑖𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑑𝑔𝑒 𝑤𝑒𝑖𝑔ℎ𝑡 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑗 𝑎𝑛𝑑 𝑖.
𝐷 𝑖𝑠 𝑡ℎ𝑒 𝑑𝑒𝑐𝑎𝑦 𝑓𝑎𝑐𝑡𝑜𝑟.
77
Activation Function

o Uneven distribution of nodes in the
hierarchy
o Many-many for category-subcategory
relationships
78
78
Challenges – Wikipedia
Category Graph

hierarchy
relationships
79
79
Category Graph

hierarchy
relationships
80
80
Category Graph

81
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
50000
100000
150000
200000
250000
300000
NumberofNodes
Hierarchical Level
81
Addressing Uneven Node
Distribution

hierarchy
relationships
82
82
Category Graph

83
83
Preferential Path Constraint –
Many to Many Links

84
84
Many to Many Links

85
1 2 3 4
85
Many to Many Links

Boosting Common Ancestors
o Nodes that intersect domains/subcategories
activated by diverse entities
86
86

87
Cricket
M S Dhoni Virat Kohli
Sachin
Tendulkar
Sports
Indian
Cricket
Indian
Cricketers3
3
5
5
Michael
Clarke
Shane
Watson
Australian
Cricket
Australian
Cricketers
2
2
87

88
88

o Bell
𝐴𝑗 = 𝐴𝑖 × 𝐹𝑗
𝑛
𝑖=0
o Bell Log
𝐴𝑗 = 𝐴𝑖 × 𝐹𝐿𝑗
𝑛
𝑖=0
o Priority Intersect
𝐴𝑗 = 𝐴𝑖 × 𝐹𝐿𝑗 × 𝑃𝑗𝑖 × 𝐵𝑗
𝑛
𝑖=0
89
Activation Functions

Evaluation
User Study
• 37 Users
• 30K Tweets
Evaluated the top-10 categories of
interests derived from the hierarchy
• 76% Mean Average Precision
• 98% Mean Reciprocal Recall
• 70% are not mentioned in tweets
90

o Working on a Tweet recommendation
system that utilizes Hierarchical
Interest Graph
o Preliminary results are “interesting” 
91
Tweet Recommendation using
Hierarchical Interest Graph

Conclusion
o Focus on “Information” overload instead of
“Data” overload.
o Personalized Information Filtering
o Knowledge-base enabled solutions for
challenges in Tweets filtering
o Wikipedia hyperlink structure and category
graph leveraged for Twitter data filtering
o More Research on User Specific Attribute
Extraction (Personalization) from Twitter
Data
o Activity Estimation
o Location Prediction

kHealth
Knowledge-enabled Healthcare
Applied to ADHF, Asthma, GI, and Dementia
94

Through physical monitoring and
analysis, our cellphones could act as
an early warning system to detect
serious health conditions, and
provide actionable information
canary in a coal mine
Empowering Individuals (who are not Larry Smarr!) for their own health
kHealth: knowledge-enabled healthcare
95

Motivational Scenario
Manually going through
news articles, diabetes
forums, blogs, etc.
- Time consuming
- Relevant?
Interesting?
Informative? Useful?
97
How about all the relevant and important health
information aggregated at one platform?
A diabetic patient is interested in keeping himself up to date with
new information about diabetes

98
Search and Explore
X Controls
Cancer
X = diet, treatment, exercise
(Pattern-based Approach
leveraging domain
semantics)
Top Health News
Informative news about selected
disease
Faceted search (by health topics)
Learn about disease
Source: Wikipedia
Search &
Explore
Top Health
News
Tweet
Traffic
Learn about
Disease
Home

Thanks
Contact:
Email-pavan@knoesis.org
Twitter:@pavankaps
Webpage:
http://knoesis.org/researchers/pavan
99

Knowledge base enabled Information Filtering on Social Web -- EMC

Recommended

Recommended

More Related Content

Similar to Knowledge base enabled Information Filtering on Social Web -- EMC

Similar to Knowledge base enabled Information Filtering on Social Web -- EMC (17)

More from Pavan Kapanipathi

More from Pavan Kapanipathi (8)

Recently uploaded

Recently uploaded (19)

Knowledge base enabled Information Filtering on Social Web -- EMC