SlideShare a Scribd company logo
1 of 16
Harnessing the Social Web: The Science of Identity Disambiguation Matthew Rowe and Fabio Ciravegna Organisations, Information and Knowledge Group University of Sheffield, UK Web Science 2010
Outline Problem Dissemination of personal information across the Web Motivation Need for automation Harnessing the Social Web Identity Disambiguation Inference Rules Self-training Evaluation Results Conclusions
Problem Large amount of information now residing on the World Wide Web is personal information Disseminated voluntarily: homepages, profiles pages Or involuntarily: telephone directories, electoral registers Sensitive nature of this information has lead to: Identity Theft: act of stealing a person’s identity and reusing it Currently costs UK economy £1.2 billion http://www.identitytheft.org.uk/faqs.asp Lateral Surveillance: act of watching someone without their knowledge Often performed by employers vetting potential employees And by socialities vetting prospective dates Could affect reputation if detrimental content exists
Motivation To avoid such practices, web users must manually collect web resources which may cite them and then decide which do The latter stage of this process is referred to as disambiguation Decides which web resources are references and produces a unary set of identity web references for a given person However, this practice is  Time consuming Expensive Must be repeated often as the more and more data is published on the  Web Automated disambiguation techniques can replace this manual processing To function effectively however, seed data (background knowledge about a person) is required: Expensive to produce (e.g. filling in an extensive form) Must contain sufficient features describing a person’s identity
Harnessing the Social Web Overcome the problem of producing seed data manually by harnessing the Social Web Social Web platforms such as Facebook, Twitter and MySpace allow web users to build an online persona/identity visible to others Sociological studies have argued of the similarity between online and offline identities (Hart et al, 2008) states that online social networks are merely extensions of offline lives (Ellison et al, 2007) states that Social Web platforms are used to reinforce established offline relationships A user study was conducted to assess the relationship between digital identities constructed on Social Web platforms and their real worldequivalent using 50 participants from the University of Sheffield (25 male, 25 female) with a wide age range (18 – 45) Study consisted of three stages 1. Participants listed their real world social network 2. Digital social network was extracted from Facebook for each participant 3. Digital and real world networks were compared Relevance: proportion of digital social network containing strong-tied relationships Coverage: extent of to which the real world network is replicated online Results from the user study show Coverage range of 0.5 to 1 with an average of 0.77 Indicating that, on average, 77% of a person’s real world social network is replicated online Average relevance of 0.23 Indicating that, on average, 23% of a person’s digital social network contains strong tied relationships
Collecting Seed Data from the Social Web
Collecting Seed Data from the Social Web <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> rdf:typefoaf:Person ; foaf:name "Matthew Rowe" ; foaf:homepage <www.dcs.shef.ac.uk/~mrowe> ; foaf:mbox <m.rowe@dcs.shef.ac.uk> ; foaf:based_near <http://www.geonames.org/2638077> ; foaf:knows <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#fabio> ; foaf:knows <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#sam> . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#fabio> rdf:typefoaf:Person ; foaf:name "Fabio Ciravegna"; foaf:mbox <fabio@dcs.shef.ac.uk>; foaf:homepage <http://www.dcs.shef.ac.uk/~fabio> . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#sam> rdf:typefoaf:Person ; foaf:name "Sam Chapman" ; foaf:mbox <sam@dcs.shef.ac.uk> ; foaf:homepage <http://www.dcs.shef.ac.uk/~sam> . <http://www.geonames.org/2638077> rdf:typegeo:Feature ; geo:name “Sheffield” ; geo:inCountry “UK” .
Identity Disambiguation: Inference Rules Rules provides a means to logically infer conclusions based on the presence of information In the context of identity disambiguation, rules replicate the cognitive process by which a human decides if a web resource refers to a given entity Using background knowledge known about the entity Uses a supervised approach by only using the provided seed data to make decisions Rules are built from the seed data as follows: RDF instances are extracted from the seed data (e.g. an instance of a person or location) A rule is constructed from the information in each instance description Rules are then added to the rule base which are then applied to a collected set of web resources to disambiguate identity web references If the triple pattern in the antecedent (the if part) of the rule matches the knowledge structure of a web resource then a web reference is inferred
Identity Disambiguation: Inference Rules
Identity Disambiguation: Self-training Self-training provides a semi-supervised approach to disambiguation: Seed data collected from the Social Web provides the positive training data Possible web citations provide the unlabelled data Negative training data is generated using Rocchio classification over the unlabelled data Positive and negative training data is then used to train an initial classifier Classifier is applied to the unlabelled data and labels each example Training sets (positive and negative) are enlarged with the examples from the unlabelled data which exhibit the strongest classification confidences Examples are removed from the unlabelled data, reducing its size Steps 4-7 are repeated until all unlabelled data has been classified Tested 3 different machine learning classifiers: Perceptron, Support Vector Machines and Naïve Bayes RDF models (for both the seed data and the web resources) are converted into machine learning instances RDF instances from the models are used as featuresfor the machine learninginstances This permits the variation of distinct feature similarity measures between 3 different RDF graph matching techniques:  RDF Entailment: does one graph subsume that of another? Inverse Functional Property Matching: do property values match in distinct graphs where the property is inverse functional? Jaccard similarity (strict graph equivalence): are the graphs identical?
Identity Disambiguation: Self-training Intuition is that as the classifier learns from unlabelled data it will learn from previously unknown features Seed data only covers a portion of a person’s identity Will lead to the detection of more web references This is similar to the cognitive process by which humans identify web citations  Only a portion of background knowledge is known at the start As more web references are found, the knowledge of the person is expanded
Evaluation Dataset 50 members of the Semantic Web and Web 2.0 communities Collected seed data from Facebook and Twitter Collected possible web citations from searching WWW and the Semantic Web for each participant Converted each returned resource into an RDF model representation ~346 web resources to be analysed for each participant Evaluation Measures Information retrieval metrics: precision, recall and f-measure Web presence level: proportion of web resources that refer to each participant (e.g. 50 of 350 web resources refer to a given person, then web presence is 14%) Baseline Measure: Human Processing Group of 12 raters manually processed a portion of the dataset for each participant 3 raters performed disambiguation for each participant, then used interrater agreement (Hripcsak & Rothschild, 2005) to calculate IR metrics
Evaluation: Inference Rules Yields high levels of precision, but poor recall scores Specific nature of rules leads to poor application to new instances   Consistently outperforms humans in terms of precision for all web presence levels At low levels of web presence, where web references are sparse, humans perform poorly This is characterised by a “Needle in a Haystack” problem Inhibited by the lack web references to learn from
Evaluation: Self-training Perceptron and SVM are combined with Entailment outperform humans Due to the large levels of recall achieved by these permutations Entailment leads to a reduction in overfitting to the training data Precision is lowered, but recall is improved significantly Performance also remains consistent for all web presence levels Jaccard uses strict matching between RDF instances, leading to high precision levels Poor recall levels due to overfitting to training data  unable to generalise to new instances
Conclusions Social Web platforms provide a useful source for identity information Significant similarity between real world and digital social networks This can in turn be used to support automated disambiguation techniques Inference Rules, using a supervised strategy, yields high precision levels yet fails to detect a large portion of identity web references Self-training overcomes the limitations of supervised techniques by learning from disambiguation decisions High recall levels demonstrate the effectiveness of such methods to detect a large portion of web references Future work will look to combine these two methods together Enlarging the positive training data using Inference Rules – given their high precision levels Then applying Self-training to increase recall levels
Twitter: @mattroweshow Web: http://www.dcs.shef.ac.uk/~mrowe Email: m.rowe@dcs.shef.ac.uk Questions? (Hart et al, 2008) - J. Hart, C. Ridley, F. Taher, C. Sas, and A. Dix. Exploring the facebook experience: a new approach to usability. In NordiCHI ’08: Proceedings of the 5th Nordic conference on Human-computer interaction, pages 471–474, New York, NY, USA, 2008. ACM (Ellison et al, 2007) - N. B. Ellison, C. Steinfield, and C. Lampe. Thebenefits of facebook friends: Social capital and college students’ use of online social network sites. Journal of Computer Mediated Communication, 12:1143–1168, 2007. (Hripcsak & Rothschild, 2005) – G. Hripcsak and A. S. Rothschild. Agreement, the f-measure, and reliability in information retrieval. Journal of American Medical Informatics Association, 12(3):296–298, 2005.

More Related Content

What's hot

Social Network Analysis - an Introduction (minus the Maths)
Social Network Analysis - an Introduction (minus the Maths)Social Network Analysis - an Introduction (minus the Maths)
Social Network Analysis - an Introduction (minus the Maths)Katy Jordan
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measuresdnac
 
Library Connect Webinar - Calculating sharing metrics: Possible approaches
Library Connect Webinar - Calculating sharing metrics: Possible approaches Library Connect Webinar - Calculating sharing metrics: Possible approaches
Library Connect Webinar - Calculating sharing metrics: Possible approaches Library_Connect
 
11 Network Experiments and Interventions
11 Network Experiments and Interventions11 Network Experiments and Interventions
11 Network Experiments and Interventionsdnac
 
An Introduction to NodeXL for Social Scientists
An Introduction to NodeXL for Social ScientistsAn Introduction to NodeXL for Social Scientists
An Introduction to NodeXL for Social ScientistsDr Wasim Ahmed
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collectiondnac
 
2006-05-25__coi-semdis
2006-05-25__coi-semdis2006-05-25__coi-semdis
2006-05-25__coi-semdiswebuploader
 
A Community of Quality: Using Social Network Analysis to Study University-Wid...
A Community of Quality: Using Social Network Analysis to Study University-Wid...A Community of Quality: Using Social Network Analysis to Study University-Wid...
A Community of Quality: Using Social Network Analysis to Study University-Wid...Stephanie Richter
 
CISummit 2013: Luke Matthews, Tracking the Electronic Metadata Trail of the S...
CISummit 2013: Luke Matthews, Tracking the Electronic Metadata Trail of the S...CISummit 2013: Luke Matthews, Tracking the Electronic Metadata Trail of the S...
CISummit 2013: Luke Matthews, Tracking the Electronic Metadata Trail of the S...Steven Wardell
 
04 Diffusion and Peer Influence
04 Diffusion and Peer Influence04 Diffusion and Peer Influence
04 Diffusion and Peer Influencednac
 
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisFuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisIJERA Editor
 
03 Ego Network Analysis
03 Ego Network Analysis03 Ego Network Analysis
03 Ego Network Analysisdnac
 
Kaleidoscope conference slides - Academic networking
Kaleidoscope conference slides - Academic networkingKaleidoscope conference slides - Academic networking
Kaleidoscope conference slides - Academic networkingKaty Jordan
 
Benchmarking the Privacy-­Preserving People Search
Benchmarking the Privacy-­Preserving People SearchBenchmarking the Privacy-­Preserving People Search
Benchmarking the Privacy-­Preserving People SearchDaqing He
 
10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studiesdnac
 

What's hot (18)

10 Network Experiments
10 Network Experiments10 Network Experiments
10 Network Experiments
 
Social Network Analysis - an Introduction (minus the Maths)
Social Network Analysis - an Introduction (minus the Maths)Social Network Analysis - an Introduction (minus the Maths)
Social Network Analysis - an Introduction (minus the Maths)
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures
 
ESWC 2014 Tutorial Part 4
ESWC 2014 Tutorial Part 4ESWC 2014 Tutorial Part 4
ESWC 2014 Tutorial Part 4
 
Library Connect Webinar - Calculating sharing metrics: Possible approaches
Library Connect Webinar - Calculating sharing metrics: Possible approaches Library Connect Webinar - Calculating sharing metrics: Possible approaches
Library Connect Webinar - Calculating sharing metrics: Possible approaches
 
11 Network Experiments and Interventions
11 Network Experiments and Interventions11 Network Experiments and Interventions
11 Network Experiments and Interventions
 
05 Network Canvas (2017)
05 Network Canvas (2017)05 Network Canvas (2017)
05 Network Canvas (2017)
 
An Introduction to NodeXL for Social Scientists
An Introduction to NodeXL for Social ScientistsAn Introduction to NodeXL for Social Scientists
An Introduction to NodeXL for Social Scientists
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
 
2006-05-25__coi-semdis
2006-05-25__coi-semdis2006-05-25__coi-semdis
2006-05-25__coi-semdis
 
A Community of Quality: Using Social Network Analysis to Study University-Wid...
A Community of Quality: Using Social Network Analysis to Study University-Wid...A Community of Quality: Using Social Network Analysis to Study University-Wid...
A Community of Quality: Using Social Network Analysis to Study University-Wid...
 
CISummit 2013: Luke Matthews, Tracking the Electronic Metadata Trail of the S...
CISummit 2013: Luke Matthews, Tracking the Electronic Metadata Trail of the S...CISummit 2013: Luke Matthews, Tracking the Electronic Metadata Trail of the S...
CISummit 2013: Luke Matthews, Tracking the Electronic Metadata Trail of the S...
 
04 Diffusion and Peer Influence
04 Diffusion and Peer Influence04 Diffusion and Peer Influence
04 Diffusion and Peer Influence
 
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisFuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
 
03 Ego Network Analysis
03 Ego Network Analysis03 Ego Network Analysis
03 Ego Network Analysis
 
Kaleidoscope conference slides - Academic networking
Kaleidoscope conference slides - Academic networkingKaleidoscope conference slides - Academic networking
Kaleidoscope conference slides - Academic networking
 
Benchmarking the Privacy-­Preserving People Search
Benchmarking the Privacy-­Preserving People SearchBenchmarking the Privacy-­Preserving People Search
Benchmarking the Privacy-­Preserving People Search
 
10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies
 

Similar to Harnessing the Social Web: The Science of Identity Disambiguation

PhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataPhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataMatthew Rowe
 
Poster Abstracts
Poster AbstractsPoster Abstracts
Poster Abstractsbutest
 
Social Web 2.0 Class Week 4: Social Networks, Privacy
Social Web 2.0 Class Week 4: Social Networks, PrivacySocial Web 2.0 Class Week 4: Social Networks, Privacy
Social Web 2.0 Class Week 4: Social Networks, PrivacyShelly D. Farnham, Ph.D.
 
Access Lab 2020: Context aware unified institutional knowledge services
Access Lab 2020: Context aware unified institutional knowledge servicesAccess Lab 2020: Context aware unified institutional knowledge services
Access Lab 2020: Context aware unified institutional knowledge servicesOpenAthens
 
The Credibility of Digital Identity Information on the Social Web: A User Study
The Credibility of Digital Identity Information on the Social Web: A User StudyThe Credibility of Digital Identity Information on the Social Web: A User Study
The Credibility of Digital Identity Information on the Social Web: A User StudyMatthew Rowe
 
992 sms10 social_media_services
992 sms10 social_media_services992 sms10 social_media_services
992 sms10 social_media_servicessiyaza
 
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsAndre Freitas
 
Effects of Social Networking in Academic Literacy
Effects of Social Networking in Academic LiteracyEffects of Social Networking in Academic Literacy
Effects of Social Networking in Academic LiteracySteve Chilton
 
Learning Social Networks From Web Documents Using Support
Learning Social Networks From Web Documents Using SupportLearning Social Networks From Web Documents Using Support
Learning Social Networks From Web Documents Using Supportceya
 
Social Recommender Systems
Social Recommender SystemsSocial Recommender Systems
Social Recommender Systemsguest77b0cd12
 
Question Answering over Linked Data - Reasoning Issues
Question Answering over Linked Data - Reasoning IssuesQuestion Answering over Linked Data - Reasoning Issues
Question Answering over Linked Data - Reasoning IssuesMichael Petychakis
 
Interlinking semantics, web2.0, and the real-world
Interlinking semantics, web2.0, and the real-worldInterlinking semantics, web2.0, and the real-world
Interlinking semantics, web2.0, and the real-worldThe Open University
 
Knowledge Sharing over social networking systems
Knowledge Sharing over social networking systemsKnowledge Sharing over social networking systems
Knowledge Sharing over social networking systemstanguy
 
Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011idoguy
 
Netnography in online dating services
Netnography in online dating servicesNetnography in online dating services
Netnography in online dating servicesDanish Ilyas
 
Rise presentation-2012-01
Rise presentation-2012-01Rise presentation-2012-01
Rise presentation-2012-01Richard Nurse
 

Similar to Harnessing the Social Web: The Science of Identity Disambiguation (20)

PhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataPhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social Data
 
Poster Abstracts
Poster AbstractsPoster Abstracts
Poster Abstracts
 
Social Web 2.0 Class Week 4: Social Networks, Privacy
Social Web 2.0 Class Week 4: Social Networks, PrivacySocial Web 2.0 Class Week 4: Social Networks, Privacy
Social Web 2.0 Class Week 4: Social Networks, Privacy
 
Access Lab 2020: Context aware unified institutional knowledge services
Access Lab 2020: Context aware unified institutional knowledge servicesAccess Lab 2020: Context aware unified institutional knowledge services
Access Lab 2020: Context aware unified institutional knowledge services
 
The Credibility of Digital Identity Information on the Social Web: A User Study
The Credibility of Digital Identity Information on the Social Web: A User StudyThe Credibility of Digital Identity Information on the Social Web: A User Study
The Credibility of Digital Identity Information on the Social Web: A User Study
 
992 sms10 social_media_services
992 sms10 social_media_services992 sms10 social_media_services
992 sms10 social_media_services
 
02 Network Data Collection (2016)
02 Network Data Collection (2016)02 Network Data Collection (2016)
02 Network Data Collection (2016)
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
 
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic Applications
 
Effects of Social Networking in Academic Literacy
Effects of Social Networking in Academic LiteracyEffects of Social Networking in Academic Literacy
Effects of Social Networking in Academic Literacy
 
Learning Social Networks From Web Documents Using Support
Learning Social Networks From Web Documents Using SupportLearning Social Networks From Web Documents Using Support
Learning Social Networks From Web Documents Using Support
 
Social Recommender Systems
Social Recommender SystemsSocial Recommender Systems
Social Recommender Systems
 
Question Answering over Linked Data - Reasoning Issues
Question Answering over Linked Data - Reasoning IssuesQuestion Answering over Linked Data - Reasoning Issues
Question Answering over Linked Data - Reasoning Issues
 
Interlinking semantics, web2.0, and the real-world
Interlinking semantics, web2.0, and the real-worldInterlinking semantics, web2.0, and the real-world
Interlinking semantics, web2.0, and the real-world
 
Knowledge Sharing over social networking systems
Knowledge Sharing over social networking systemsKnowledge Sharing over social networking systems
Knowledge Sharing over social networking systems
 
Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011
 
Netnography in online dating services
Netnography in online dating servicesNetnography in online dating services
Netnography in online dating services
 
Rise presentation-2012-01
Rise presentation-2012-01Rise presentation-2012-01
Rise presentation-2012-01
 
Proposal.docx
Proposal.docxProposal.docx
Proposal.docx
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 

More from Matthew Rowe

Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache SparkMatthew Rowe
 
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesPredicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesMatthew Rowe
 
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Matthew Rowe
 
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting RatingsSemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings Matthew Rowe
 
The Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesThe Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesMatthew Rowe
 
From Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web UsersFrom Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web UsersMatthew Rowe
 
Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Matthew Rowe
 
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...Matthew Rowe
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Matthew Rowe
 
Identity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureIdentity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureMatthew Rowe
 
Measuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMeasuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMatthew Rowe
 
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Matthew Rowe
 
Attention Economics in Social Web Systems
Attention Economics in Social Web SystemsAttention Economics in Social Web Systems
Attention Economics in Social Web SystemsMatthew Rowe
 
What makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsWhat makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsMatthew Rowe
 
Existing Research and Future Research Agenda
Existing Research and Future Research AgendaExisting Research and Future Research Agenda
Existing Research and Future Research AgendaMatthew Rowe
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social SemanticsMatthew Rowe
 
Modelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesModelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesMatthew Rowe
 
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsUsing Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsMatthew Rowe
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsMatthew Rowe
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataMatthew Rowe
 

More from Matthew Rowe (20)

Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache Spark
 
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesPredicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian Sequences
 
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
 
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting RatingsSemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
 
The Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesThe Semantic Evolution of Online Communities
The Semantic Evolution of Online Communities
 
From Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web UsersFrom Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web Users
 
Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...
 
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
 
Identity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureIdentity: Physical, Cyber, Future
Identity: Physical, Cyber, Future
 
Measuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMeasuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online Communities
 
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
 
Attention Economics in Social Web Systems
Attention Economics in Social Web SystemsAttention Economics in Social Web Systems
Attention Economics in Social Web Systems
 
What makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsWhat makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositions
 
Existing Research and Future Research Agenda
Existing Research and Future Research AgendaExisting Research and Future Research Agenda
Existing Research and Future Research Agenda
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social Semantics
 
Modelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesModelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online Communities
 
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsUsing Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community Forums
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
 

Recently uploaded

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 

Recently uploaded (20)

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 

Harnessing the Social Web: The Science of Identity Disambiguation

  • 1. Harnessing the Social Web: The Science of Identity Disambiguation Matthew Rowe and Fabio Ciravegna Organisations, Information and Knowledge Group University of Sheffield, UK Web Science 2010
  • 2. Outline Problem Dissemination of personal information across the Web Motivation Need for automation Harnessing the Social Web Identity Disambiguation Inference Rules Self-training Evaluation Results Conclusions
  • 3. Problem Large amount of information now residing on the World Wide Web is personal information Disseminated voluntarily: homepages, profiles pages Or involuntarily: telephone directories, electoral registers Sensitive nature of this information has lead to: Identity Theft: act of stealing a person’s identity and reusing it Currently costs UK economy £1.2 billion http://www.identitytheft.org.uk/faqs.asp Lateral Surveillance: act of watching someone without their knowledge Often performed by employers vetting potential employees And by socialities vetting prospective dates Could affect reputation if detrimental content exists
  • 4. Motivation To avoid such practices, web users must manually collect web resources which may cite them and then decide which do The latter stage of this process is referred to as disambiguation Decides which web resources are references and produces a unary set of identity web references for a given person However, this practice is Time consuming Expensive Must be repeated often as the more and more data is published on the Web Automated disambiguation techniques can replace this manual processing To function effectively however, seed data (background knowledge about a person) is required: Expensive to produce (e.g. filling in an extensive form) Must contain sufficient features describing a person’s identity
  • 5. Harnessing the Social Web Overcome the problem of producing seed data manually by harnessing the Social Web Social Web platforms such as Facebook, Twitter and MySpace allow web users to build an online persona/identity visible to others Sociological studies have argued of the similarity between online and offline identities (Hart et al, 2008) states that online social networks are merely extensions of offline lives (Ellison et al, 2007) states that Social Web platforms are used to reinforce established offline relationships A user study was conducted to assess the relationship between digital identities constructed on Social Web platforms and their real worldequivalent using 50 participants from the University of Sheffield (25 male, 25 female) with a wide age range (18 – 45) Study consisted of three stages 1. Participants listed their real world social network 2. Digital social network was extracted from Facebook for each participant 3. Digital and real world networks were compared Relevance: proportion of digital social network containing strong-tied relationships Coverage: extent of to which the real world network is replicated online Results from the user study show Coverage range of 0.5 to 1 with an average of 0.77 Indicating that, on average, 77% of a person’s real world social network is replicated online Average relevance of 0.23 Indicating that, on average, 23% of a person’s digital social network contains strong tied relationships
  • 6. Collecting Seed Data from the Social Web
  • 7. Collecting Seed Data from the Social Web <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> rdf:typefoaf:Person ; foaf:name "Matthew Rowe" ; foaf:homepage <www.dcs.shef.ac.uk/~mrowe> ; foaf:mbox <m.rowe@dcs.shef.ac.uk> ; foaf:based_near <http://www.geonames.org/2638077> ; foaf:knows <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#fabio> ; foaf:knows <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#sam> . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#fabio> rdf:typefoaf:Person ; foaf:name "Fabio Ciravegna"; foaf:mbox <fabio@dcs.shef.ac.uk>; foaf:homepage <http://www.dcs.shef.ac.uk/~fabio> . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#sam> rdf:typefoaf:Person ; foaf:name "Sam Chapman" ; foaf:mbox <sam@dcs.shef.ac.uk> ; foaf:homepage <http://www.dcs.shef.ac.uk/~sam> . <http://www.geonames.org/2638077> rdf:typegeo:Feature ; geo:name “Sheffield” ; geo:inCountry “UK” .
  • 8. Identity Disambiguation: Inference Rules Rules provides a means to logically infer conclusions based on the presence of information In the context of identity disambiguation, rules replicate the cognitive process by which a human decides if a web resource refers to a given entity Using background knowledge known about the entity Uses a supervised approach by only using the provided seed data to make decisions Rules are built from the seed data as follows: RDF instances are extracted from the seed data (e.g. an instance of a person or location) A rule is constructed from the information in each instance description Rules are then added to the rule base which are then applied to a collected set of web resources to disambiguate identity web references If the triple pattern in the antecedent (the if part) of the rule matches the knowledge structure of a web resource then a web reference is inferred
  • 10. Identity Disambiguation: Self-training Self-training provides a semi-supervised approach to disambiguation: Seed data collected from the Social Web provides the positive training data Possible web citations provide the unlabelled data Negative training data is generated using Rocchio classification over the unlabelled data Positive and negative training data is then used to train an initial classifier Classifier is applied to the unlabelled data and labels each example Training sets (positive and negative) are enlarged with the examples from the unlabelled data which exhibit the strongest classification confidences Examples are removed from the unlabelled data, reducing its size Steps 4-7 are repeated until all unlabelled data has been classified Tested 3 different machine learning classifiers: Perceptron, Support Vector Machines and Naïve Bayes RDF models (for both the seed data and the web resources) are converted into machine learning instances RDF instances from the models are used as featuresfor the machine learninginstances This permits the variation of distinct feature similarity measures between 3 different RDF graph matching techniques: RDF Entailment: does one graph subsume that of another? Inverse Functional Property Matching: do property values match in distinct graphs where the property is inverse functional? Jaccard similarity (strict graph equivalence): are the graphs identical?
  • 11. Identity Disambiguation: Self-training Intuition is that as the classifier learns from unlabelled data it will learn from previously unknown features Seed data only covers a portion of a person’s identity Will lead to the detection of more web references This is similar to the cognitive process by which humans identify web citations Only a portion of background knowledge is known at the start As more web references are found, the knowledge of the person is expanded
  • 12. Evaluation Dataset 50 members of the Semantic Web and Web 2.0 communities Collected seed data from Facebook and Twitter Collected possible web citations from searching WWW and the Semantic Web for each participant Converted each returned resource into an RDF model representation ~346 web resources to be analysed for each participant Evaluation Measures Information retrieval metrics: precision, recall and f-measure Web presence level: proportion of web resources that refer to each participant (e.g. 50 of 350 web resources refer to a given person, then web presence is 14%) Baseline Measure: Human Processing Group of 12 raters manually processed a portion of the dataset for each participant 3 raters performed disambiguation for each participant, then used interrater agreement (Hripcsak & Rothschild, 2005) to calculate IR metrics
  • 13. Evaluation: Inference Rules Yields high levels of precision, but poor recall scores Specific nature of rules leads to poor application to new instances Consistently outperforms humans in terms of precision for all web presence levels At low levels of web presence, where web references are sparse, humans perform poorly This is characterised by a “Needle in a Haystack” problem Inhibited by the lack web references to learn from
  • 14. Evaluation: Self-training Perceptron and SVM are combined with Entailment outperform humans Due to the large levels of recall achieved by these permutations Entailment leads to a reduction in overfitting to the training data Precision is lowered, but recall is improved significantly Performance also remains consistent for all web presence levels Jaccard uses strict matching between RDF instances, leading to high precision levels Poor recall levels due to overfitting to training data unable to generalise to new instances
  • 15. Conclusions Social Web platforms provide a useful source for identity information Significant similarity between real world and digital social networks This can in turn be used to support automated disambiguation techniques Inference Rules, using a supervised strategy, yields high precision levels yet fails to detect a large portion of identity web references Self-training overcomes the limitations of supervised techniques by learning from disambiguation decisions High recall levels demonstrate the effectiveness of such methods to detect a large portion of web references Future work will look to combine these two methods together Enlarging the positive training data using Inference Rules – given their high precision levels Then applying Self-training to increase recall levels
  • 16. Twitter: @mattroweshow Web: http://www.dcs.shef.ac.uk/~mrowe Email: m.rowe@dcs.shef.ac.uk Questions? (Hart et al, 2008) - J. Hart, C. Ridley, F. Taher, C. Sas, and A. Dix. Exploring the facebook experience: a new approach to usability. In NordiCHI ’08: Proceedings of the 5th Nordic conference on Human-computer interaction, pages 471–474, New York, NY, USA, 2008. ACM (Ellison et al, 2007) - N. B. Ellison, C. Steinfield, and C. Lampe. Thebenefits of facebook friends: Social capital and college students’ use of online social network sites. Journal of Computer Mediated Communication, 12:1143–1168, 2007. (Hripcsak & Rothschild, 2005) – G. Hripcsak and A. S. Rothschild. Agreement, the f-measure, and reliability in information retrieval. Journal of American Medical Informatics Association, 12(3):296–298, 2005.