Pavan Kapanipathi*, Prateek Jain^, Chitra
Venkataramani^, Amit Sheth*
*Kno.e.sis Center, Wright State University
^IBM TJ W...
 Motivation
 Background
 Approach
 Evaluation
 Conclusion & Future Work
2
Motivation
 Approach
 Evaluation
 Conclusion & Future Work
3
 Tapping into Social Networks to identify
interests is not new (2006+). It works!!
◦ Google, Bing, Samsung TV etc.
 Twit...
 Interests with lesser or no semantics
◦ Bag of Words [1]
◦ Bag of Concepts
 Some Semantics
◦ Bag of Linked Entities wit...
6
 How can Semantics/Knowledge Bases be
utilized to infer interests?
◦ Extensive use of Knowledge Bases to infer user
inter...
Internet
Semantic
Search
Linked
Data
Metadata
Technology
World
Wide Web
Semantic
Web
Entities
Structured
Information
8
 Addressing Data Sparcity Problem
◦ Infer more interests of the users with lesser data.
 Flexibility for Recommendations...
 Motivation
Approach
 Evaluation
 Conclusion & Future Work
10
11
Tweets
Interest Hierarchy
12
Tweets
Interest Hierarchy
 Selecting an Ontology
◦ Available: Wikipedia, Dmoz, OpenCyc, Freebase
◦ Our framework can adapt to any ontology
 Wikipe...
 4.2 Million Articles
 0.8 Million Wikipedia Categories
 2.0 Million Category-Subcategory
relationships
 Challenges
◦ ...
 Clean up -- Removed Wiki Admin Categories
 Hierarchical Interest Graph needs a Base
Hierarchy
◦ Shortest Path from the ...
16
Agriculture Science
Science
Education
Scientists
Main topic
classifications
Sports Health
Health
Care
Health
Economics
...
 Removing Links that does not concur to a
hierarchy
17
18
Tweets
Interest Hierarchy
 Extracting Wikipedia concepts from Tweets
 Interests Scoring
19
http://en.wikipedia.org/wiki/Semantic_search
http://en....
◦ Issues relevant to entity extraction are handled by
the web services
 Stop words removal, URLs, Disambiguation etc.
20
...
 Scoring Wikipedia concepts
21
Internet
Semantic
Search
Linked
Data
Metadata
Technology
World Wide Web
Semantic
Web
User
Interests
Structured
Information...
23
Tweets
Interest Hierarchy
 Result (Challenges)
◦ Infer more categories
without context
◦ Equal weights regardless
Interest Score
◦ Cannot rank cate...
 Graph Algorithm to find contextual nodes
◦ Cognitive Sciences
◦ Neural Networks
◦ Information Retrieval
 Associative, S...
26
Cricket
M S Dhoni Virat Kohli
Sachin
Tendulkar
Sports
Indian
Cricket
Indian
Cricketers
0.8 0.2
0.6
0.5
0.4
0.25
0.1
Act...
27
 No Decay – No Weighted Edge
• Result: Most generic categories ranked higher
 Decays over the hops of the activation
• 0...
29
Agriculture Science
Science
Education
Scientists
Main topic
classifications
Sports Health
Health
Care
Health
Economics
...
 Uneven distribution of nodes in the hierarchy
 Many-many for category-subcategory
relationships
3030
 Uneven distribution of nodes in the hierarchy
 Many-many for category-subcategory
relationships
31
1 2 3 4 5 6 7 8 9 10...
 Uneven distribution of nodes in the hierarchy
 Many-many for category-subcategory
relationships
3232
 Uneven distribution of nodes in the hierarchy
 Many-many for category-subcategory
relationships
3333

34
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
50000
100000
150000
200000
250000
300000
NumberofNodes
Hierarchical Level
34
35
1 2 3 4
35
 Nodes that intersect domains/subcategories activated
by diverse entities
3636
37
Cricket
M S Dhoni Virat Kohli
Sachin
Tendulkar
Sports
Indian
Cricket
Indian
Cricketers3
3
5
5
Michael
Clarke
Shane
Wats...

3838
39
 Motivation
 Approach
Evaluation
 Conclusion & Future Work
40
 User Study Data
◦ 37 Users
◦ 31927 Tweets
41
• Hierarchical Interest Graph
– 111,535 Category
Interests.
– 3000 Categori...
 How many relevant/irrelevant Hierarchical
Interests are retrieved at top-k ranks?
◦ Graded Precision
 How well are the ...
43
Priority Intersect works the best
with
• 76% Mean Average Precision
• 98% Mean Reciprocal Recall
 How many of the categories inferred by the system
were not explicitly mentioned by the user in
tweets? (Semantic Web and...
 Mapped (String match) categories of
Wikipedia to Dmoz.
◦ ~141K categories mapped
 Compared all the category and sub-cat...
 Motivation
 Approach
 Evaluation
Conclusion & Future
Work
46
 Hierarchical Interest Graph (Hierarchy representation of
user interests)
◦ With hierarchical levels of each interest to ...
 Measuring impact of Hierarchical Interest
Graphs for recommendation of Movies/Music
◦ Datasets
 Movielens
 Lastfm
 Tu...
49
Contact: Pavan Kapanipathi
Twitter:@pavankaps
Email: pavan@knoesis.org
More info: Knoesis Wiki – Hierarchical Interest ...
Upcoming SlideShare
Loading in …5
×

User Interests Identification From Twitter using Hierarchical Knowledge Base

1,240 views

Published on

Twitter, due to its massive growth as a social networking
platform, has been in focus for the analysis of its user generated content for personalization and recommendation tasks. A common challenge across these tasks is identifying user interests from tweets. Semantic enrichment of Twitter posts, to determine user interests, has been an active area of research in the recent past. These approaches typically use available public knowledge-bases (such as Wikipedia) to spot entities and create entity-based user profi les. However, exploitation of such knowledgebases to create richer user profi les is yet to be explored. In this work, we leverage hierarchical relationships present in knowledge-bases to infer user interests expressed as a Hierarchical Interest Graph. We argue that the hierarchical semantics of concepts can enhance existing systems to personalize or recommend items based on a varied level of conceptual abstractness. We demonstrate the e ffectiveness of our approach through a user study which shows an average of approximately eight of the top ten weighted hierarchical interests in the graph being relevant to a user's interests.

Published in: Software, Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,240
On SlideShare
0
From Embeds
0
Number of Embeds
95
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

User Interests Identification From Twitter using Hierarchical Knowledge Base

  1. 1. Pavan Kapanipathi*, Prateek Jain^, Chitra Venkataramani^, Amit Sheth* *Kno.e.sis Center, Wright State University ^IBM TJ Watson Research Center 1 #eswc2014Kapanipathi
  2. 2.  Motivation  Background  Approach  Evaluation  Conclusion & Future Work 2
  3. 3. Motivation  Approach  Evaluation  Conclusion & Future Work 3
  4. 4.  Tapping into Social Networks to identify interests is not new (2006+). It works!! ◦ Google, Bing, Samsung TV etc.  Twitter Content ◦ 500M+ Users generating 500M+ tweets per day. ◦ Public and useful for research 4
  5. 5.  Interests with lesser or no semantics ◦ Bag of Words [1] ◦ Bag of Concepts  Some Semantics ◦ Bag of Linked Entities with intentions of using Knowledge Bases. [2, 3] 5 1. Alan Mislove, Bimal Viswanath, Krishna P. Gummadi, and Peter Druschel. You Are Who You Know: Inferring User Profiles in Online Social Networks. WSDM ’10. 2. Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Analyzing User Modeling on Twitter for Personalized News Recommendations. UMAP ’11 3. Fabrizio Orlandi, John Breslin, and Alexandre Passant. Aggregated, Interoperable and Multi-domain User Profiles for the Social Web. I-SEMANTICS ’12.
  6. 6. 6
  7. 7.  How can Semantics/Knowledge Bases be utilized to infer interests? ◦ Extensive use of Knowledge Bases to infer user interests from Tweets is yet to be explored.  First we started with utilizing Hierarchical Relationships 7
  8. 8. Internet Semantic Search Linked Data Metadata Technology World Wide Web Semantic Web Entities Structured Information 8
  9. 9.  Addressing Data Sparcity Problem ◦ Infer more interests of the users with lesser data.  Flexibility for Recommendations ◦ Recommend about Sports or Football  KB knows that Football is a sub-category of Sports ◦ Resource Description Framework and Semantic Web  RDF has lesser data online to recommend. 9
  10. 10.  Motivation Approach  Evaluation  Conclusion & Future Work 10
  11. 11. 11 Tweets Interest Hierarchy
  12. 12. 12 Tweets Interest Hierarchy
  13. 13.  Selecting an Ontology ◦ Available: Wikipedia, Dmoz, OpenCyc, Freebase ◦ Our framework can adapt to any ontology  Wikipedia ◦ Diverse Domains & Coverage ◦ Resemblance to a Taxonomy ◦ Extracted Structured Wikipedia – Dbpedia ◦ Existing entity recognition techniques (Explained further) 13
  14. 14.  4.2 Million Articles  0.8 Million Wikipedia Categories  2.0 Million Category-Subcategory relationships  Challenges ◦ Since crowd-sourced – Noisy ◦ Not a hierarchy/taxonomy  It is a graph  It has cycles 14
  15. 15.  Clean up -- Removed Wiki Admin Categories  Hierarchical Interest Graph needs a Base Hierarchy ◦ Shortest Path from the root node  Root Node: Category:Main Topic Classifications  Assumption – Hops to the root node determines the level of abstraction of the category. 15
  16. 16. 16 Agriculture Science Science Education Scientists Main topic classifications Sports Health Health Care Health Economics Level: 1 Level: 2 Level: 3
  17. 17.  Removing Links that does not concur to a hierarchy 17
  18. 18. 18 Tweets Interest Hierarchy
  19. 19.  Extracting Wikipedia concepts from Tweets  Interests Scoring 19 http://en.wikipedia.org/wiki/Semantic_search http://en.wikipedia.org/wiki/Ontology
  20. 20. ◦ Issues relevant to entity extraction are handled by the web services  Stop words removal, URLs, Disambiguation etc. 20 Precision Recall F-measure Usability Rate Limit License Text Razor 64.6 26.9 38.0 Web Service 500/day Zemanta 57.7 31.8 41.0 Web Service 10000/day *L. Derczynski, D. Maynard, N. Aswani, and K. Bontcheva. Microblog-genre noise and impact on semantic annotation accuracy. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, HT ’13.
  21. 21.  Scoring Wikipedia concepts 21
  22. 22. Internet Semantic Search Linked Data Metadata Technology World Wide Web Semantic Web User Interests Structured Information 0.8 0.2 0.6 Scores for Interests 22
  23. 23. 23 Tweets Interest Hierarchy
  24. 24.  Result (Challenges) ◦ Infer more categories without context ◦ Equal weights regardless Interest Score ◦ Cannot rank categories of Interest for a user ◦ We use Spreading Activation 24 Cricket M S Dhoni Virat Kohli Sachin Tendulkar Sports Indian Cricket Indian Cricketers Honorary Members of the Order of Australia Order of Australia Awards Culture
  25. 25.  Graph Algorithm to find contextual nodes ◦ Cognitive Sciences ◦ Neural Networks ◦ Information Retrieval  Associative, Semantic Networks ◦ Semantic Web  Context Generation 25
  26. 26. 26 Cricket M S Dhoni Virat Kohli Sachin Tendulkar Sports Indian Cricket Indian Cricketers 0.8 0.2 0.6 0.5 0.4 0.25 0.1 Activation Function Determines the extent of spreading
  27. 27. 27
  28. 28.  No Decay – No Weighted Edge • Result: Most generic categories ranked higher  Decays over the hops of the activation • 0.4, 0.6, 0.8 • Result: Same as above 28
  29. 29. 29 Agriculture Science Science Education Scientists Main topic classifications Sports Health Health Care Health Economics Level: 1 Main Topic Classification – 1 Technology – 2 Science – 2 Sports– 2 Business – 2 … … Technology Companies – 3 Scientists– 3 29
  30. 30.  Uneven distribution of nodes in the hierarchy  Many-many for category-subcategory relationships 3030
  31. 31.  Uneven distribution of nodes in the hierarchy  Many-many for category-subcategory relationships 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 50000 100000 150000 200000 250000 300000 Hierarchical Level NumberofNodes 31
  32. 32.  Uneven distribution of nodes in the hierarchy  Many-many for category-subcategory relationships 3232
  33. 33.  Uneven distribution of nodes in the hierarchy  Many-many for category-subcategory relationships 3333
  34. 34.  34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 50000 100000 150000 200000 250000 300000 NumberofNodes Hierarchical Level 34
  35. 35. 35 1 2 3 4 35
  36. 36.  Nodes that intersect domains/subcategories activated by diverse entities 3636
  37. 37. 37 Cricket M S Dhoni Virat Kohli Sachin Tendulkar Sports Indian Cricket Indian Cricketers3 3 5 5 Michael Clarke Shane Watson Australian Cricket Australian Cricketers 2 2 37
  38. 38.  3838
  39. 39. 39
  40. 40.  Motivation  Approach Evaluation  Conclusion & Future Work 40
  41. 41.  User Study Data ◦ 37 Users ◦ 31927 Tweets 41 • Hierarchical Interest Graph – 111,535 Category Interests. – 3000 Categories/user – Ranking Evaluation -- Top-50 Categories.
  42. 42.  How many relevant/irrelevant Hierarchical Interests are retrieved at top-k ranks? ◦ Graded Precision  How well are the retrieved relevant Hierarchical Interests ranked at top-k? ◦ Mean Average Precision  How early in the ranked Hierarchical Interests can we find a relevant result? ◦ Mean Reciprocal Recall 42
  43. 43. 43 Priority Intersect works the best with • 76% Mean Average Precision • 98% Mean Reciprocal Recall
  44. 44.  How many of the categories inferred by the system were not explicitly mentioned by the user in tweets? (Semantic Web and Category:Semantic Web) 44 Priority Intersect at Top-10 • 52% of Categories were not mentioned in tweets by user • 65% of which were marked relevant • 10% were marked May-be
  45. 45.  Mapped (String match) categories of Wikipedia to Dmoz. ◦ ~141K categories mapped  Compared all the category and sub-category relationships of the mapped categories in the hierarchy to manually created Dmoz. ◦ 87% precise (in hierarchy were also found in Dmoz) 45
  46. 46.  Motivation  Approach  Evaluation Conclusion & Future Work 46
  47. 47.  Hierarchical Interest Graph (Hierarchy representation of user interests) ◦ With hierarchical levels of each interest to have flexibility for personalizing and recommending based on its abstractness.  We semantically enhanced user profiles of interests from Twitter using Knowledge bases. ◦ Inferred abstract/hierarchical interests of Twitter users using Wikipedia ◦ This can help reducing the data sparcity problem by inferring relevant interests.  The top-1 hierarchical-interest generated by the system was correct for 36 out of 37 user-study participants. ◦ Mean Average Precision at Top-10 is 0.76 47
  48. 48.  Measuring impact of Hierarchical Interest Graphs for recommendation of Movies/Music ◦ Datasets  Movielens  Lastfm  Tuning the system to utilize the hierarchical levels of interests for personalization and recommendation ◦ Sports (most abstract interest) ◦ Baseball (specific interest) 48
  49. 49. 49 Contact: Pavan Kapanipathi Twitter:@pavankaps Email: pavan@knoesis.org More info: Knoesis Wiki – Hierarchical Interest Graph

×