SlideShare a Scribd company logo
1 of 44
Download to read offline
An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles Mining Data Semantics Workshop 2011 Carlton Northern Old Dominion University 8/25/2011 1
Background Digital Preservation How are students using social media as a digital preservation strategy? Evaluating Personal Archiving Strategies for Internet-based Information - Marshall, McCown, Nelson http://www.cs.odu.edu/~mln/pubs/archiving-2007/eval-personal-arch-strat-archiving07.pdf 2
Goal Ascertain the set of social media profiles for ODU CS students. { } ... 3
4 What's out there already?
5 Intelius
Wink / my life 6
Google 7
Requirements and Assumptions Approach must be automated - no human interaction except for search query consisting of: location organization profession/education domain. Achieve precision 0.7 or higher and f-measure 0.5 or higher comparable to a human level of the same activity Must find profiles not indexed by search engines Can use any means available including using search engines, page scraping, web service APIs, etc. Only publicly declared identities; do not expose obfuscated identities  e.g., “Bruce Wayne“  -> “Batman" Find profiles from 25 pre-defined sites (next slide) Approach must be extensible,  i.e. new social media sites can be added with minimal changes. 8
Social Media Sites 9
Approach 10
11 Algorithm Discovery Phase Generate Usernames Check Rapportive Disambiguation Phase Assign Points for Keywords, Email, Me and Friend Links Check Google and Yahoo Check Sites for Profiles Check Sites For Profiles Check Social Graph Remove Duplicates *Run multiple times
Discovery Phase 12
Starting Information Given: Full name, i.e. Carlton Northern CS username, i.e. cnorther CS email, i.e. cnorther@cs.odu.edu .forward files -> carlton.northern@gmail.com CS profile URI, i.e. http://www.cs.odu.edu/~cnorther Inferred: School affiliation, i.e. Old Dominion Approximate location, i.e. Norfolk, Hampton Roads Computer Science affiliation, i.e. software engineer 13
Username Generation Generate usernames from full name derivatives, i.e. for “Carlton Northern” we have: cnorthern northernc carlton.northern carlton_northern carlton-norther 14
Poll Sites Issue HTTP GET to determine if a profile exists with a generated username Create site templates for links: http://www.facebook.com/’username here’ http://www.stumbleupon.com/stumbler/’username here’ https://picasaweb.google.com/’username here’ 2016 students, 6 usernames, 25 sites = 302k requests GET http://www.facebook.com/carlton.northern HTTP/1.1 If 200 accept response, profile exists, else it doesn’t. Soft 404’s can be somewhat problematic but can be handled. Some sites detect robots and will present a Captcha which is also problematic. 15
Run existing profile URLs through Google Social Graph to find “me” links. 16 Google’s Social Graph API
“Me” Links “me” links are links in Friend of a Friend (FOAF) and XHTML Friends Network (XFN) that specify the same identity For example, a me link from my CS profile page to twitter: 17 <html>  	  <head>      <title>Carlton Northern's CS Home Page</title>     </head>    <body>      stuff here ...     <a href=http://twitter.com/carltonnorthern rel=“me”>My Twitter</a>   </body> </html>
Rapportive Rapportive is a contacts relationship management (CRM) tool that sits on top of Gmail Uses AJAX and JSON to serve up content to their Gmail widget. Mined .forward files on the CS departmental server  Found only 24 email addresses out of 2016 students Run CS and non CS email addresses through Rapportive’s not-so-public API to access their results. Produced 15.9% of our truth set profile results with only 1.6% being unique to Rapportive 18
Google and Yahoo Query Google and Yahoo using their respective APIs. “carlton northern" AND norfolk “carlton northern" AND “computer science" “carlton northern" AND “old dominion“ “carlton northern” site:http://www.facebook.com Geonames could be used to derive nearby cities to automatically form search queries The same could be done with WordNet to derive profession or education terms 19
Google and Yahoo Calls to Google and Yahoo need to be limited because of API restrictions. Google restricts use to about 1,000 requests per hour Furthermore, best results are in the first 1 – 8 positions of the result set 20
Disambiguation Phase 21
22 ,[object Object],[object Object]
Personally Identifiable Information Rich Profile 24
Point System Simple point system: Keyword matching Link community structure analysis Extraction of semantic and feature data from profiles 11 points is considered a validated profile. Points can range from a total negative score to about 50. 25
Keyword Matching 1 point for weak indicators  1 word terms like “programmer” or “student” 4 points for stronger indicators  2 or more words terms like “computer science” or “software engineer” 7 points for very strong indicators  locations i.e. “norfolk” or “portsmouth” Localized advertisements can be problematic  2 points for first name or given name  4 points for last name 26
Name Matching Facebook, Linkedin, Google, and Twitter, use real names so: 2 points for a first name or diminutive/nickname 5 points for a last name Subtract 21 points if neither a nickname or diminutive and a last name are found Watch out for diminutive/nicknames! http://code.google.com/p/nickname-and-diminutive-names-lookup/ Linkedin in provides location add or subtract 7 points 27
Link Community Structure Analysis Retrieve all links in a page and see if they point to other validated profiles in the data set, if so, assign 5 points 28 Validated Profile Not-Validated Profile Assign 5 points to Michael’s Twitter profile
Me Links and Email Matching 10 points if a profile is found from Rapportive 10 points if a profile has a me link from an already validated profile 29 Validated Profile Not-Validated Profile Assign 10 points to Carlton’s Twitter profile
Experiment 30
Dataset 2016 students from our departmental server 142 graduate 1874 undergraduate Generated 9GB worth of data Truth set: 20 graduate students and 2 professors from our research group Web Science and Digital Libraries Use information retrieval metrics of precision, recalland f-measure to assess our truth set 31
Truth Set Results Summary 32
Social Media Web Site Results 33
34 Whole Set Service Graph
35
36 Truth Set User Graph
37 Whole Set User Graph
38
39 Whole Set User Graph Without Blogger Links
40 Closeup
Future Work Facial recognition Better link community structure analysis Perform quantitative social media digital preservation study Remove social media sites that produced no or little results (unpopular) and add new ones (foursquare.com)? 41
Potential Impacts/Uses Open source intelligence gathering “Open source” as in publicly available information Social media research Measure the social health of an organization 42
Conclusions Completely automated with the only human interaction being with the creation of the search query Precision 0.863, recall .526, f-measure 0.632 The approach uses non-traditional search mechanisms to achieve it's goals Only publicly available information was used 43
44 Carlton Northern carlton.northern@gmail.com http://carlton-northern.com/
MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

More Related Content

What's hot

Talent42 2014 Jeopardy preso final
Talent42 2014 Jeopardy preso finalTalent42 2014 Jeopardy preso final
Talent42 2014 Jeopardy preso finalTalent42
 
Social Media Resources
Social Media ResourcesSocial Media Resources
Social Media ResourcesScott Triana
 
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...Carlton Northern
 
Research power point for students
Research power point for studentsResearch power point for students
Research power point for studentslarchmeany1
 
Diving Into Facebook And Twitter
Diving Into Facebook And TwitterDiving Into Facebook And Twitter
Diving Into Facebook And TwitterPaulette Bennett
 
Searching the Web for Your Next Job
Searching the Web for Your Next JobSearching the Web for Your Next Job
Searching the Web for Your Next JobNoah Wolfe
 
Lis4380 f13-w4
Lis4380 f13-w4Lis4380 f13-w4
Lis4380 f13-w4caseyyu
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
Is the Age of privacy over? Facebook, Privacy and Qualitative Research
Is the Age of privacy over?  Facebook, Privacy and Qualitative ResearchIs the Age of privacy over?  Facebook, Privacy and Qualitative Research
Is the Age of privacy over? Facebook, Privacy and Qualitative ResearchLisa Blenkinsop
 
Lis4380 f13-w7
Lis4380 f13-w7Lis4380 f13-w7
Lis4380 f13-w7caseyyu
 
Social media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for researchSocial media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for researchToronto Metropolitan University
 
Using Social Media Technologies for Professional Networking
Using Social Media Technologies for Professional NetworkingUsing Social Media Technologies for Professional Networking
Using Social Media Technologies for Professional NetworkingShaundra Walker
 

What's hot (15)

Talent42 2014 Jeopardy preso final
Talent42 2014 Jeopardy preso finalTalent42 2014 Jeopardy preso final
Talent42 2014 Jeopardy preso final
 
LinkedIn Report
LinkedIn ReportLinkedIn Report
LinkedIn Report
 
Social Media Resources
Social Media ResourcesSocial Media Resources
Social Media Resources
 
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
 
Research power point for students
Research power point for studentsResearch power point for students
Research power point for students
 
Diving Into Facebook And Twitter
Diving Into Facebook And TwitterDiving Into Facebook And Twitter
Diving Into Facebook And Twitter
 
Searching the Web for Your Next Job
Searching the Web for Your Next JobSearching the Web for Your Next Job
Searching the Web for Your Next Job
 
Lis4380 f13-w4
Lis4380 f13-w4Lis4380 f13-w4
Lis4380 f13-w4
 
Unit 35
Unit 35Unit 35
Unit 35
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Is the Age of privacy over? Facebook, Privacy and Qualitative Research
Is the Age of privacy over?  Facebook, Privacy and Qualitative ResearchIs the Age of privacy over?  Facebook, Privacy and Qualitative Research
Is the Age of privacy over? Facebook, Privacy and Qualitative Research
 
Lis4380 f13-w7
Lis4380 f13-w7Lis4380 f13-w7
Lis4380 f13-w7
 
Crossroads Social Network Survival Guide
Crossroads Social Network Survival GuideCrossroads Social Network Survival Guide
Crossroads Social Network Survival Guide
 
Social media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for researchSocial media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for research
 
Using Social Media Technologies for Professional Networking
Using Social Media Technologies for Professional NetworkingUsing Social Media Technologies for Professional Networking
Using Social Media Technologies for Professional Networking
 

Viewers also liked

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
 
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010Yahoo Developer Network
 
Hadoop integration with SAP HANA
Hadoop integration with SAP HANAHadoop integration with SAP HANA
Hadoop integration with SAP HANADebajit Banerjee
 
Liberate your Application Logging
Liberate your Application LoggingLiberate your Application Logging
Liberate your Application LoggingGlenn Block
 
SQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsSQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsDataWorks Summit
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 

Viewers also liked (6)

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
 
Hadoop integration with SAP HANA
Hadoop integration with SAP HANAHadoop integration with SAP HANA
Hadoop integration with SAP HANA
 
Liberate your Application Logging
Liberate your Application LoggingLiberate your Application Logging
Liberate your Application Logging
 
SQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsSQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analytics
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 

Similar to MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Path 101 Opportunity
Path 101 OpportunityPath 101 Opportunity
Path 101 Opportunitypath101
 
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptx
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptxYouemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptx
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptxYoueminAngeRoxaneMie
 
SEO - when doing the right things doesn't help you improve rankings
SEO - when doing the right things doesn't help you improve rankingsSEO - when doing the right things doesn't help you improve rankings
SEO - when doing the right things doesn't help you improve rankingsWil Reynolds
 
Polishing Your (Online) Portfolio: Using Free Web Tools for Self Promotion
Polishing Your (Online) Portfolio: Using Free Web Tools for Self PromotionPolishing Your (Online) Portfolio: Using Free Web Tools for Self Promotion
Polishing Your (Online) Portfolio: Using Free Web Tools for Self PromotionSusanne Markgren
 
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
Team of Rivals: UX, SEO, Content & Dev  UXDC 2015Team of Rivals: UX, SEO, Content & Dev  UXDC 2015
Team of Rivals: UX, SEO, Content & Dev UXDC 2015Marianne Sweeny
 
2007 09-27-social networking-allen-restout
2007 09-27-social networking-allen-restout2007 09-27-social networking-allen-restout
2007 09-27-social networking-allen-restouttata tanishq
 
Sla canada student nov 25 2021
Sla canada student nov 25 2021Sla canada student nov 25 2021
Sla canada student nov 25 2021Stephen Abram
 
Using Social Media In A Job Search
Using Social Media In A Job SearchUsing Social Media In A Job Search
Using Social Media In A Job Searchcssceo
 
Social Media Career Development & Job Search
Social Media Career Development & Job SearchSocial Media Career Development & Job Search
Social Media Career Development & Job SearchJoel Postman
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and SearchPeter Skomoroch
 
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...Nicola Osborne
 
Seo and Social Media: the Blurring of the Line
Seo and Social Media: the Blurring of the LineSeo and Social Media: the Blurring of the Line
Seo and Social Media: the Blurring of the Lineerycked
 
LMFAO Leveraging Machines for Awesome Outreach
LMFAO  Leveraging Machines for Awesome OutreachLMFAO  Leveraging Machines for Awesome Outreach
LMFAO Leveraging Machines for Awesome OutreachGareth Simpson
 
Job Search Employ On
Job Search Employ OnJob Search Employ On
Job Search Employ OnAmy O'Donnell
 
Do Employers Look at ePortfolios?
Do Employers Look at ePortfolios?Do Employers Look at ePortfolios?
Do Employers Look at ePortfolios?Don Presant
 
What Is Path 101
What Is Path 101What Is Path 101
What Is Path 101path101
 
Social Networking Using Linked In For Job Search V9 00 091117
Social Networking Using Linked In For Job Search V9 00 091117Social Networking Using Linked In For Job Search V9 00 091117
Social Networking Using Linked In For Job Search V9 00 091117Thomas Lassandro
 

Similar to MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles (20)

Path 101 Opportunity
Path 101 OpportunityPath 101 Opportunity
Path 101 Opportunity
 
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptx
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptxYouemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptx
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptx
 
SEO - when doing the right things doesn't help you improve rankings
SEO - when doing the right things doesn't help you improve rankingsSEO - when doing the right things doesn't help you improve rankings
SEO - when doing the right things doesn't help you improve rankings
 
Optimising Google's Knowledge Graph - #SMX Munich
Optimising Google's Knowledge Graph - #SMX MunichOptimising Google's Knowledge Graph - #SMX Munich
Optimising Google's Knowledge Graph - #SMX Munich
 
Polishing Your (Online) Portfolio: Using Free Web Tools for Self Promotion
Polishing Your (Online) Portfolio: Using Free Web Tools for Self PromotionPolishing Your (Online) Portfolio: Using Free Web Tools for Self Promotion
Polishing Your (Online) Portfolio: Using Free Web Tools for Self Promotion
 
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
Team of Rivals: UX, SEO, Content & Dev  UXDC 2015Team of Rivals: UX, SEO, Content & Dev  UXDC 2015
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
 
2007 09-27-social networking-allen-restout
2007 09-27-social networking-allen-restout2007 09-27-social networking-allen-restout
2007 09-27-social networking-allen-restout
 
Sla canada student nov 25 2021
Sla canada student nov 25 2021Sla canada student nov 25 2021
Sla canada student nov 25 2021
 
Using Social Media In A Job Search
Using Social Media In A Job SearchUsing Social Media In A Job Search
Using Social Media In A Job Search
 
Social Media Career Development & Job Search
Social Media Career Development & Job SearchSocial Media Career Development & Job Search
Social Media Career Development & Job Search
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
 
Cataloguing your friends and neighbours
Cataloguing your friends and neighboursCataloguing your friends and neighbours
Cataloguing your friends and neighbours
 
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...
 
Seo and Social Media: the Blurring of the Line
Seo and Social Media: the Blurring of the LineSeo and Social Media: the Blurring of the Line
Seo and Social Media: the Blurring of the Line
 
LMFAO Leveraging Machines for Awesome Outreach
LMFAO  Leveraging Machines for Awesome OutreachLMFAO  Leveraging Machines for Awesome Outreach
LMFAO Leveraging Machines for Awesome Outreach
 
Job Search Employ On
Job Search Employ OnJob Search Employ On
Job Search Employ On
 
Do Employers Look at ePortfolios?
Do Employers Look at ePortfolios?Do Employers Look at ePortfolios?
Do Employers Look at ePortfolios?
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
What Is Path 101
What Is Path 101What Is Path 101
What Is Path 101
 
Social Networking Using Linked In For Job Search V9 00 091117
Social Networking Using Linked In For Job Search V9 00 091117Social Networking Using Linked In For Job Search V9 00 091117
Social Networking Using Linked In For Job Search V9 00 091117
 

Recently uploaded

JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 

Recently uploaded (20)

JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 

MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

  • 1. An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles Mining Data Semantics Workshop 2011 Carlton Northern Old Dominion University 8/25/2011 1
  • 2. Background Digital Preservation How are students using social media as a digital preservation strategy? Evaluating Personal Archiving Strategies for Internet-based Information - Marshall, McCown, Nelson http://www.cs.odu.edu/~mln/pubs/archiving-2007/eval-personal-arch-strat-archiving07.pdf 2
  • 3. Goal Ascertain the set of social media profiles for ODU CS students. { } ... 3
  • 4. 4 What's out there already?
  • 6. Wink / my life 6
  • 8. Requirements and Assumptions Approach must be automated - no human interaction except for search query consisting of: location organization profession/education domain. Achieve precision 0.7 or higher and f-measure 0.5 or higher comparable to a human level of the same activity Must find profiles not indexed by search engines Can use any means available including using search engines, page scraping, web service APIs, etc. Only publicly declared identities; do not expose obfuscated identities e.g., “Bruce Wayne“ -> “Batman" Find profiles from 25 pre-defined sites (next slide) Approach must be extensible, i.e. new social media sites can be added with minimal changes. 8
  • 11. 11 Algorithm Discovery Phase Generate Usernames Check Rapportive Disambiguation Phase Assign Points for Keywords, Email, Me and Friend Links Check Google and Yahoo Check Sites for Profiles Check Sites For Profiles Check Social Graph Remove Duplicates *Run multiple times
  • 13. Starting Information Given: Full name, i.e. Carlton Northern CS username, i.e. cnorther CS email, i.e. cnorther@cs.odu.edu .forward files -> carlton.northern@gmail.com CS profile URI, i.e. http://www.cs.odu.edu/~cnorther Inferred: School affiliation, i.e. Old Dominion Approximate location, i.e. Norfolk, Hampton Roads Computer Science affiliation, i.e. software engineer 13
  • 14. Username Generation Generate usernames from full name derivatives, i.e. for “Carlton Northern” we have: cnorthern northernc carlton.northern carlton_northern carlton-norther 14
  • 15. Poll Sites Issue HTTP GET to determine if a profile exists with a generated username Create site templates for links: http://www.facebook.com/’username here’ http://www.stumbleupon.com/stumbler/’username here’ https://picasaweb.google.com/’username here’ 2016 students, 6 usernames, 25 sites = 302k requests GET http://www.facebook.com/carlton.northern HTTP/1.1 If 200 accept response, profile exists, else it doesn’t. Soft 404’s can be somewhat problematic but can be handled. Some sites detect robots and will present a Captcha which is also problematic. 15
  • 16. Run existing profile URLs through Google Social Graph to find “me” links. 16 Google’s Social Graph API
  • 17. “Me” Links “me” links are links in Friend of a Friend (FOAF) and XHTML Friends Network (XFN) that specify the same identity For example, a me link from my CS profile page to twitter: 17 <html> <head> <title>Carlton Northern's CS Home Page</title> </head> <body> stuff here ... <a href=http://twitter.com/carltonnorthern rel=“me”>My Twitter</a> </body> </html>
  • 18. Rapportive Rapportive is a contacts relationship management (CRM) tool that sits on top of Gmail Uses AJAX and JSON to serve up content to their Gmail widget. Mined .forward files on the CS departmental server Found only 24 email addresses out of 2016 students Run CS and non CS email addresses through Rapportive’s not-so-public API to access their results. Produced 15.9% of our truth set profile results with only 1.6% being unique to Rapportive 18
  • 19. Google and Yahoo Query Google and Yahoo using their respective APIs. “carlton northern" AND norfolk “carlton northern" AND “computer science" “carlton northern" AND “old dominion“ “carlton northern” site:http://www.facebook.com Geonames could be used to derive nearby cities to automatically form search queries The same could be done with WordNet to derive profession or education terms 19
  • 20. Google and Yahoo Calls to Google and Yahoo need to be limited because of API restrictions. Google restricts use to about 1,000 requests per hour Furthermore, best results are in the first 1 – 8 positions of the result set 20
  • 22.
  • 24. Point System Simple point system: Keyword matching Link community structure analysis Extraction of semantic and feature data from profiles 11 points is considered a validated profile. Points can range from a total negative score to about 50. 25
  • 25. Keyword Matching 1 point for weak indicators 1 word terms like “programmer” or “student” 4 points for stronger indicators 2 or more words terms like “computer science” or “software engineer” 7 points for very strong indicators locations i.e. “norfolk” or “portsmouth” Localized advertisements can be problematic 2 points for first name or given name 4 points for last name 26
  • 26. Name Matching Facebook, Linkedin, Google, and Twitter, use real names so: 2 points for a first name or diminutive/nickname 5 points for a last name Subtract 21 points if neither a nickname or diminutive and a last name are found Watch out for diminutive/nicknames! http://code.google.com/p/nickname-and-diminutive-names-lookup/ Linkedin in provides location add or subtract 7 points 27
  • 27. Link Community Structure Analysis Retrieve all links in a page and see if they point to other validated profiles in the data set, if so, assign 5 points 28 Validated Profile Not-Validated Profile Assign 5 points to Michael’s Twitter profile
  • 28. Me Links and Email Matching 10 points if a profile is found from Rapportive 10 points if a profile has a me link from an already validated profile 29 Validated Profile Not-Validated Profile Assign 10 points to Carlton’s Twitter profile
  • 30. Dataset 2016 students from our departmental server 142 graduate 1874 undergraduate Generated 9GB worth of data Truth set: 20 graduate students and 2 professors from our research group Web Science and Digital Libraries Use information retrieval metrics of precision, recalland f-measure to assess our truth set 31
  • 31. Truth Set Results Summary 32
  • 32. Social Media Web Site Results 33
  • 33. 34 Whole Set Service Graph
  • 34. 35
  • 35. 36 Truth Set User Graph
  • 36. 37 Whole Set User Graph
  • 37. 38
  • 38. 39 Whole Set User Graph Without Blogger Links
  • 40. Future Work Facial recognition Better link community structure analysis Perform quantitative social media digital preservation study Remove social media sites that produced no or little results (unpopular) and add new ones (foursquare.com)? 41
  • 41. Potential Impacts/Uses Open source intelligence gathering “Open source” as in publicly available information Social media research Measure the social health of an organization 42
  • 42. Conclusions Completely automated with the only human interaction being with the creation of the search query Precision 0.863, recall .526, f-measure 0.632 The approach uses non-traditional search mechanisms to achieve it's goals Only publicly available information was used 43
  • 43. 44 Carlton Northern carlton.northern@gmail.com http://carlton-northern.com/