MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

  1. 1. An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles<br />Mining Data Semantics Workshop 2011<br />Carlton Northern<br />Old Dominion University<br />8/25/2011<br />1<br />
  2. 2. Background<br />Digital Preservation<br />How are students using social media as a digital preservation strategy?<br />Evaluating Personal Archiving Strategies for Internet-based Information - Marshall, McCown, Nelson<br />2<br />
  3. 3. Goal<br />Ascertain the set of social media profiles for ODU CS students.<br />{<br />}<br />...<br />3<br />
  4. 4. 4<br />What's out there already?<br />
  5. 5. 5<br />Intelius<br />
  6. 6. Wink / my life<br />6<br />
  7. 7. Google<br />7<br />
  8. 8. Requirements and Assumptions<br />Approach must be automated - no human interaction except for search query consisting of:<br />location<br />organization<br />profession/education domain.<br />Achieve precision 0.7 or higher and f-measure 0.5 or higher comparable to a human level of the same activity<br />Must find profiles not indexed by search engines<br />Can use any means available including using search engines, page scraping, web service APIs, etc.<br />Only publicly declared identities; do not expose obfuscated identities <br />e.g., “Bruce Wayne“ -> “Batman"<br />Find profiles from 25 pre-defined sites (next slide)<br />Approach must be extensible, <br />i.e. new social media sites can be added with minimal changes.<br />8<br />
  9. 9. Social Media Sites<br />9<br />
  10. 10. Approach<br />10<br />
  11. 11. 11<br />Algorithm<br />Discovery Phase<br />Generate Usernames<br />Check Rapportive<br />Disambiguation Phase<br />Assign Points for Keywords, Email, Me and Friend Links<br />Check Google and Yahoo<br />Check Sites for Profiles<br />Check Sites For Profiles<br />Check Social Graph<br />Remove Duplicates<br />*Run multiple times<br />
  12. 12. Discovery Phase<br />12<br />
  13. 13. Starting Information<br />Given:<br />Full name, i.e. Carlton Northern<br />CS username, i.e. cnorther<br />CS email, i.e.<br />.forward files -><br />CS profile URI, i.e.<br />Inferred:<br />School affiliation, i.e. Old Dominion<br />Approximate location, i.e. Norfolk, Hampton Roads<br />Computer Science affiliation, i.e. software engineer<br />13<br />
  14. 14. Username Generation<br />Generate usernames from full name derivatives, i.e. for “Carlton Northern” we have:<br />cnorthern<br />northernc<br />carlton.northern<br />carlton_northern<br />carlton-norther<br />14<br />
  15. 15. Poll Sites<br />Issue HTTP GET to determine if a profile exists with a generated username<br />Create site templates for links:<br />’username here’<br />’username here’<br />’username here’<br />2016 students, 6 usernames, 25 sites = 302k requests<br />GET HTTP/1.1<br />If 200 accept response, profile exists, else it doesn’t.<br />Soft 404’s can be somewhat problematic but can be handled.<br />Some sites detect robots and will present a Captcha which is also problematic.<br />15<br />
  16. 16. Run existing profile URLs through Google Social Graph to find “me” links.<br />16<br />Google’s Social Graph API<br />
  17. 17. “Me” Links<br />“me” links are links in Friend of a Friend (FOAF) and XHTML Friends Network (XFN) that specify the same identity<br />For example, a me link from my CS profile page to twitter:<br />17<br /><html> <br /> <head> <br /> <title>Carlton Northern's CS Home Page</title> <br /> </head> <br /> <body> <br /> stuff here ...<br /> <a href= rel=“me”>My Twitter</a><br /> </body><br /></html><br />
  18. 18. Rapportive<br />Rapportive is a contacts relationship management (CRM) tool that sits on top of Gmail<br />Uses AJAX and JSON to serve up content to their Gmail widget.<br />Mined .forward files on the CS departmental server <br />Found only 24 email addresses out of 2016 students<br />Run CS and non CS email addresses through Rapportive’s not-so-public API to access their results.<br />Produced 15.9% of our truth set profile results with only 1.6% being unique to Rapportive<br />18<br />
  19. 19. Google and Yahoo<br />Query Google and Yahoo using their respective APIs.<br />“carlton northern" AND norfolk<br />“carlton northern" AND “computer science"<br />“carlton northern" AND “old dominion“<br />“carlton northern” site:<br />Geonames could be used to derive nearby cities to automatically form search queries<br />The same could be done with WordNet to derive profession or education terms<br />19<br />
  20. 20. Google and Yahoo<br />Calls to Google and Yahoo need to be limited because of API restrictions.<br />Google restricts use to about 1,000 requests per hour<br />Furthermore, best results are in the first 1 – 8 positions of the result set<br />20<br />
  21. 21. Disambiguation Phase<br />21<br />
  22. 22. 22<br /><ul><li>From a public Facebook profile you can (sometimes) get a persons full name, city/area, friends and picture</li></li></ul><li>23<br />Personally Identifiable Information Poor Profile<br />
  23. 23. Personally Identifiable Information Rich Profile<br />24<br />
  24. 24. Point System<br />Simple point system:<br />Keyword matching<br />Link community structure analysis<br />Extraction of semantic and feature data from profiles<br />11 points is considered a validated profile.<br />Points can range from a total negative score to about 50.<br />25<br />
  25. 25. Keyword Matching<br />1 point for weak indicators <br />1 word terms like “programmer” or “student”<br />4 points for stronger indicators <br />2 or more words terms like “computer science” or “software engineer”<br />7 points for very strong indicators <br />locations i.e. “norfolk” or “portsmouth”<br />Localized advertisements can be problematic <br />2 points for first name or given name <br />4 points for last name<br />26<br />
  26. 26. Name Matching<br />Facebook, Linkedin, Google, and Twitter, use real names so:<br />2 points for a first name or diminutive/nickname<br />5 points for a last name<br />Subtract 21 points if neither a nickname or diminutive and a last name are found<br />Watch out for diminutive/nicknames!<br /><br />Linkedin in provides location<br />add or subtract 7 points<br />27<br />
  27. 27. Link Community Structure Analysis<br />Retrieve all links in a page and see if they point to other validated profiles in the data set, if so, assign 5 points<br />28<br />Validated Profile<br />Not-Validated Profile<br />Assign 5 points to Michael’s Twitter profile<br />
  28. 28. Me Links and Email Matching<br />10 points if a profile is found from Rapportive<br />10 points if a profile has a me link from an already validated profile<br />29<br />Validated Profile<br />Not-Validated Profile<br />Assign 10 points to Carlton’s Twitter profile<br />
  29. 29. Experiment<br />30<br />
  30. 30. Dataset<br />2016 students from our departmental server<br />142 graduate<br />1874 undergraduate<br />Generated 9GB worth of data<br />Truth set: 20 graduate students and 2 professors from our research group Web Science and Digital Libraries<br />Use information retrieval metrics of precision, recalland f-measure to assess our truth set<br />31<br />
  31. 31. Truth Set Results Summary<br />32<br />
  32. 32. Social Media Web Site Results<br />33<br />
  33. 33. 34<br />Whole Set Service Graph<br />
  34. 34. 35<br />
  35. 35. 36<br />Truth Set User Graph<br />
  36. 36. 37<br />Whole Set User Graph<br />
  37. 37. 38<br />
  38. 38. 39<br />Whole Set User Graph Without Blogger Links<br />
  39. 39. 40<br />Closeup<br />
  40. 40. Future Work<br />Facial recognition<br />Better link community structure analysis<br />Perform quantitative social media digital preservation study<br />Remove social media sites that produced no or little results (unpopular) and add new ones (<br />41<br />
  41. 41. Potential Impacts/Uses<br />Open source intelligence gathering<br />“Open source” as in publicly available information<br />Social media research<br />Measure the social health of an organization<br />42<br />
  42. 42. Conclusions<br />Completely automated with the only human interaction being with the creation of the search query<br />Precision 0.863, recall .526, f-measure 0.632<br />The approach uses non-traditional search mechanisms to achieve it's goals<br />Only publicly available information was used<br />43<br />
  43. 43. 44<br />Carlton Northern<br /><br /><br />