Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

617 views

Published on

Pavan Kapanipathi's talk at IBM's Frontiers of Cloud Computing and Big Data Workshop 2014. http://researcher.ibm.com/researcher/view_group_subpage.php?id=5565

Due to the increased adoption of social web, users, specifically Twitter users are facing information overload. Unless a user is willing to restrict the sources (eg number of followings), important information relevant to users' interests often go unnoticed. The reasons include (1) the postings may be at a time the user is not looking for; (2) the user unaware and hence not following the information source; (3) and the information arrives at a rate at which the user cannot consume. Furthermore, some information that are temporally relevant, discovered late might be of no use.

My research addresses these challenges by
(1) Generating user profiles of interests from Twitter using Wikipedia. The interests gleaned from users' Twitter data can be leveraged by personalization and recommendation systems in order to reduce information overload/Volume for users.
(2) Filtering twitter data relevant to dynamically evolving entities. Including Volume, this addresses the velocity challenge in delivering relevant information in real-time. The approach is deployed on Twitris to crawl for dynamic event-relevant tweets for analysis. The prominent aspect of the approaches is the use of crowd-sourced knowledge-base such as Wikipedia.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

  1. 1. Pavan Kapanipathi Kno.e.sis Center, Wright State University Dayton, OH Frontiers of Cloud Computing and Big Data Workshop 2014 IBM TJ Watson Research Center Yorktown Heights, NY
  2. 2. o Social Web in 60 seconds o Twitter o Big Data Challenges on Social Web o Addressing Volume o Hierarchical Interest Graphs o Addressing Velocity o Tracking Dynamic Topics on Twitter o Conclusion Overview
  3. 3. Social Web in 60 secs
  4. 4. Twitter 500M tweets per day
  5. 5. Leveraging Twitter o Brands are monitoring Twitter o 62% active in 2011 to 97% active in 2013 o Twitter is used for disaster management o 35% of 20M tweets during hurricane sandy shared information and news o Personalization using Twitter o Search Engines use influence scores derived from Twitter network.
  6. 6. Challenges The Four Vs of Big Data Volume Scale of Data Velocity Streaming Data Variety Different forms of Data Veracity Uncertainty of Data • Data Perspective: 12TB of data/day1. • Information Perspective: Information that interests me. Reducing information overload for users. • Tracking dynamic topics on Twitter. • Improving recall of relevant, dynamic streaming twitter data for real-time Twitter analysis.
  7. 7. Addressing Volume: Information Perspective
  8. 8. State of the Art
  9. 9. Addressing Volume: User Perspective o Approach o Generating interest profiles of users by understanding their activities on Twitter. o Filtering/Recommending content that matches their interests. o Determining user interests from tweets o Exploiting Knowledge base to gain further insights about the interests and infer a hierarchical interest graph.
  10. 10. 10
  11. 11. Internet Semantic Search Linked Data Metadata Technology World Wide Web Semantic Web Structured Information 0.5 0.8 0.2 0.6 User Interests Scores for Interests 11 0.7 0.4 0.3
  12. 12. Hierarchical Interest Graphs o Spreading Activation Theory o Wikipedia Graph based Distributional Semantics o Hierarchical Interest Graph with scores for each category in the Hierarchy.
  13. 13. Evaluation Evaluated the top-10 categories of interests derived from the hierarchy • 76% Mean Average Precision • 98% Mean Reciprocal Recall
  14. 14. Addressing Velocity: Tracking Dynamic Topics on Twitter
  15. 15. Tracking Dynamic Events on Twitter o Twitris – A Semantic Web application for analyzing tweets. o Political, Disasters & Healthcare tweets o Event relevant tweets o Twitter Streaming API, Keywords/geo-location based. o Dynamic events are not easy to crawl using these techniques. o Hashtags as queries.
  16. 16. Hashtag Analysis for Dynamic Topics Colorado Shooting OWS
  17. 17. Hashtag Analysis for Dynamic Topics Hashtags co-occur with each other Colorado Shooting OWS
  18. 18. Hashtag Analysis for Dynamic Topics Powerlaw distribution Hashtags co-occur with each other Colorado Shooting OWS
  19. 19. Hashtag Analysis for Dynamic Topics Powerlaw distribution Hashtags co-occur with each other Colorado Shooting OWS Very few Hashtags are popular. Top 1% can get 85% of the tweets.
  20. 20. Hashtag Analysis for Dynamic Topics Powerlaw distribution Hashtags co-occur with each other Colorado Shooting OWS Very few Hashtags are popular. Top 1% can get 85% of the tweets. Clustering co-efficient
  21. 21. Hashtag Analysis for Dynamic Topics Powerlaw distribution Hashtags co-occur with each other Colorado Shooting OWS Very few Hashtags are popular. Top 1% can get 85% of the tweets. The top ones co-occur with each other the best Clustering co-efficient
  22. 22. Approach using Wikipedia o Input an event-relevant hashtag and the corresponding Event Wikipedia page. o Utilize dynamically evolving hyperlink structure of Wikipedia Event page. o Determine event relevant hashtags based on its similarity to event page and its co-occurrence with the existing relevant hashtags.
  23. 23. Evaluation Evaluated tweets comprising of top-relevant hashtags detected for dynamic topics • NDCG - 92% at top-5 Mean Average Precision
  24. 24. Conclusion o What’s there in this presentation o Big Data challenges in leveraging Twitter. o Focus on “Information” overload instead of “Data” overload. o Wikipedia categories in the Hierarchy are considered as interests by users. o Evolving set of hashtags as queries for dynamic events. o What I missed (catch me at the poster session) o How are Knowledge-bases exploited for our work? o Impact of Knowledge Bases, specifically Wikipedia.
  25. 25. Thanks Contact: Email-pavan@knoesis.org Twitter:@pavankaps

×