Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

  1. 1. An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles<br />Mining Data Semantics Workshop 2011<br />Carlton Northern<br />Old Dominion University<br />8/25/2011<br />1<br />
  2. 2. Background<br />Digital Preservation<br />How are students using social media as a digital preservation strategy?<br />Evaluating Personal Archiving Strategies for Internet-based Information - Marshall, McCown, Nelson<br />2<br />
  3. 3. Goal<br />Ascertain the set of social media profiles for ODU CS students.<br />{<br />}<br />...<br />3<br />
  4. 4. 4<br />What's out there already?<br />
  5. 5. 5<br />Intelius<br />
  6. 6. Wink / my life<br />6<br />
  7. 7. Google<br />7<br />
  8. 8. Requirements and Assumptions<br />Approach must be automated - no human interaction except for search query consisting of:<br />location<br />organization<br />profession/education domain.<br />Achieve precision 0.7 or higher and f-measure 0.5 or higher comparable to a human level of the same activity<br />Must find profiles not indexed by search engines<br />Can use any means available including using search engines, page scraping, web service APIs, etc.<br />Only publicly declared identities; do not expose obfuscated identities <br />e.g., “Bruce Wayne“ -> “Batman"<br />Find profiles from 25 pre-defined sites (next slide)<br />Approach must be extensible, <br />i.e. new social media sites can be added with minimal changes.<br />8<br />
  9. 9. Social Media Sites<br />9<br />
  10. 10. Approach<br />10<br />
  11. 11. 11<br />Algorithm<br />Discovery Phase<br />Generate Usernames<br />Check Rapportive<br />Disambiguation Phase<br />Assign Points for Keywords, Email, Me and Friend Links<br />Check Google and Yahoo<br />Check Sites for Profiles<br />Check Sites For Profiles<br />Check Social Graph<br />Remove Duplicates<br />*Run multiple times<br />
  12. 12. Discovery Phase<br />12<br />
  13. 13. Starting Information<br />Given:<br />Full name, i.e. Carlton Northern<br />CS username, i.e. cnorther<br />CS email, i.e.<br />.forward files -><br />CS profile URI, i.e.<br />Inferred:<br />School affiliation, i.e. Old Dominion<br />Approximate location, i.e. Norfolk, Hampton Roads<br />Computer Science affiliation, i.e. software engineer<br />13<br />
  14. 14. Username Generation<br />Generate usernames from full name derivatives, i.e. for “Carlton Northern” we have:<br />cnorthern<br />northernc<br />carlton.northern<br />carlton_northern<br />carlton-norther<br />14<br />
  15. 15. Poll Sites<br />Issue HTTP GET to determine if a profile exists with a generated username<br />Create site templates for links:<br />’username here’<br />’username here’<br />’username here’<br />2016 students, 6 usernames, 25 sites = 302k requests<br />GET HTTP/1.1<br />If 200 accept response, profile exists, else it doesn’t.<br />Soft 404’s can be somewhat problematic but can be handled.<br />Some sites detect robots and will present a Captcha which is also problematic.<br />15<br />
  16. 16. Run existing profile URLs through Google Social Graph to find “me” links.<br />16<br />Google’s Social Graph API<br />
  17. 17. “Me” Links<br />“me” links are links in Friend of a Friend (FOAF) and XHTML Friends Network (XFN) that specify the same identity<br />For example, a me link from my CS profile page to twitter:<br />17<br /><html> <br /> <head> <br /> <title>Carlton Northern's CS Home Page</title> <br /> </head> <br /> <body> <br /> stuff here ...<br /> <a href= rel=“me”>My Twitter</a><br /> </body><br /></html><br />
  18. 18. Rapportive<br />Rapportive is a contacts relationship management (CRM) tool that sits on top of Gmail<br />Uses AJAX and JSON to serve up content to their Gmail widget.<br />Mined .forward files on the CS departmental server <br />Found only 24 email addresses out of 2016 students<br />Run CS and non CS email addresses through Rapportive’s not-so-public API to access their results.<br />Produced 15.9% of our truth set profile results with only 1.6% being unique to Rapportive<br />18<br />
  19. 19. Google and Yahoo<br />Query Google and Yahoo using their respective APIs.<br />“carlton northern" AND norfolk<br />“carlton northern" AND “computer science"<br />“carlton northern" AND “old dominion“<br />“carlton northern” site:<br />Geonames could be used to derive nearby cities to automatically form search queries<br />The same could be done with WordNet to derive profession or education terms<br />19<br />
  20. 20. Google and Yahoo<br />Calls to Google and Yahoo need to be limited because of API restrictions.<br />Google restricts use to about 1,000 requests per hour<br />Furthermore, best results are in the first 1 – 8 positions of the result set<br />20<br />
  21. 21. Disambiguation Phase<br />21<br />
  22. 22. 22<br /><ul><li>From a public Facebook profile you can (sometimes) get a persons full name, city/area, friends and picture</li></li></ul><li>23<br />Personally Identifiable Information Poor Profile<br />
  23. 23. Personally Identifiable Information Rich Profile<br />24<br />
  24. 24. Point System<br />Simple point system:<br />Keyword matching<br />Link community structure analysis<br />Extraction of semantic and feature data from profiles<br />11 points is considered a validated profile.<br />Points can range from a total negative score to about 50.<br />25<br />
  25. 25. Keyword Matching<br />1 point for weak indicators <br />1 word terms like “programmer” or “student”<br />4 points for stronger indicators <br />2 or more words terms like “computer science” or “software engineer”<br />7 points for very strong indicators <br />locations i.e. “norfolk” or “portsmouth”<br />Localized advertisements can be problematic <br />2 points for first name or given name <br />4 points for last name<br />26<br />
  26. 26. Name Matching<br />Facebook, Linkedin, Google, and Twitter, use real names so:<br />2 points for a first name or diminutive/nickname<br />5 points for a last name<br />Subtract 21 points if neither a nickname or diminutive and a last name are found<br />Watch out for diminutive/nicknames!<br /><br />Linkedin in provides location<br />add or subtract 7 points<br />27<br />
  27. 27. Link Community Structure Analysis<br />Retrieve all links in a page and see if they point to other validated profiles in the data set, if so, assign 5 points<br />28<br />Validated Profile<br />Not-Validated Profile<br />Assign 5 points to Michael’s Twitter profile<br />
  28. 28. Me Links and Email Matching<br />10 points if a profile is found from Rapportive<br />10 points if a profile has a me link from an already validated profile<br />29<br />Validated Profile<br />Not-Validated Profile<br />Assign 10 points to Carlton’s Twitter profile<br />
  29. 29. Experiment<br />30<br />
  30. 30. Dataset<br />2016 students from our departmental server<br />142 graduate<br />1874 undergraduate<br />Generated 9GB worth of data<br />Truth set: 20 graduate students and 2 professors from our research group Web Science and Digital Libraries<br />Use information retrieval metrics of precision, recalland f-measure to assess our truth set<br />31<br />
  31. 31. Truth Set Results Summary<br />32<br />
  32. 32. Social Media Web Site Results<br />33<br />
  33. 33. 34<br />Whole Set Service Graph<br />
  34. 34. 35<br />
  35. 35. 36<br />Truth Set User Graph<br />
  36. 36. 37<br />Whole Set User Graph<br />
  37. 37. 38<br />
  38. 38. 39<br />Whole Set User Graph Without Blogger Links<br />
  39. 39. 40<br />Closeup<br />
  40. 40. Future Work<br />Facial recognition<br />Better link community structure analysis<br />Perform quantitative social media digital preservation study<br />Remove social media sites that produced no or little results (unpopular) and add new ones (<br />41<br />
  41. 41. Potential Impacts/Uses<br />Open source intelligence gathering<br />“Open source” as in publicly available information<br />Social media research<br />Measure the social health of an organization<br />42<br />
  42. 42. Conclusions<br />Completely automated with the only human interaction being with the creation of the search query<br />Precision 0.863, recall .526, f-measure 0.632<br />The approach uses non-traditional search mechanisms to achieve it's goals<br />Only publicly available information was used<br />43<br />
  43. 43. 44<br />Carlton Northern<br /><br /><br />