SlideShare a Scribd company logo
1 of 67
Disambiguating Identity Web References using Social Data Matthew Rowe Organisations, Information and Knowledge Group Department of Computer Science University of Sheffield
Outline Problem Setting Research Questions Claims of the Thesis State of the Art Requirements for Disambiguation and Seed Data Disambiguating Identity Web References Leveraging Seed Data from the Social Web Generating Metadata Models Disambiguation Techniques Evaluation Conclusions Dissemination and Impact
Personal Information on the Web Personal information on the Web is disseminated: Voluntarily Involuntarily Increase in personal information: Identity Theft Lateral Surveillance Web users must discover their identity web references 2 stage process Finding Disambiguating Disambiguation = reduction of web reference ambiguity My thesis addresses disambiguation
Ambiguity!
Matthew Rowe: Composer
Matthew Rowe: Cyclist
Matthew Rowe: Gardener
Matthew Rowe: Song Writer
Matthew Rowe: PhD Student
Problem Setting Performing disambiguation manually: Time consuming Laborious Handle masses of information Repeated often The Web keeps changing Solution = automated techniques Alleviate the need for humans Need background knowledge Who am I searching for? What makes them unique?
Research Questions How can identity web references be disambiguated automatically? Alleviate human processing: ,[object Object],Supervision: ,[object Object],Seed Data: ,[object Object],Interpretation: ,[object Object],[object Object]
State of the Art Disambiguation techniques are divisible into 2 types:  Seeded techniques E.g. [Bekkerman and McCallum, 2005], Commercial Services    Pros Disambiguate web references for a single person Cons: Require seed data No explanation of how seed data is acquired  Unseeded techniques E.g. [Song et al, 2007] Pros Require no background knowledge Cons Groups web references into clusters Need to choose the correct cluster
Requirements Requirements for Seeded Disambiguation: Bootstrap the disambiguation process with minimal supervision Achieve disambiguation accuracy comparable to human processing Cope with web resources not containing seed data features Disambiguation must be effective for all individuals Requirements for Seed Data: Produce seed data with minimal cost Generate reliable seed data
Disambiguating Identity Web References
Harnessing the Social Web WWW has evolved into a web of participation Digital identity is important on the Social Web Digital identity is fragmented across the Social Web Data Portability from Social Web platforms is limited http://www.economist.com/business/displaystory.cfm?story_id=10880936
Data found on Social Web platforms is representative of real identity information
User Study Data found on Social Web platforms is representative of real identity information 50 participants from the University of Sheffield  Consisted of 3 stages, each participant: List real world social network Extract digital social network Compare networks Relevance: 0.23 Coverage: 0.77 Updates previous findings  [Subrahmanyam et al, 2008] M Rowe. The Credibility of Digital Identity Information on the Social Web: A User Study. In proceedings of 4th Workshop on Information Credibility on the Web, World Wide Web Conference 2010. Raleigh, USA. (2010)
Disambiguating Identity Web References
Leveraging Seed Data from the Social Web 3. Seed Data: ,[object Object],[object Object]
Leveraging Seed Data from the Social Web Link things together!
Leveraging Seed Data from the Social Web Blocking Step ,[object Object],Compare values of Inverse Functional Properties ,[object Object],Compare Geo URIs ,[object Object],Compare Geo data ,[object Object],M Rowe. Interlinking Distributed Social Graphs. In proceedings of Linked Data on the Web Workshop, World Wide Web Conference, Madrid, Spain. (2009)
Leveraging Seed Data from the Social Web Allows remote resource information to change Automated techniques: Follow the links Retrieve the instance information
Disambiguating Identity Web References
Generating Metadata Models Input to disambiguation techniques is a set of web resources Web resources come in many flavours: Data models XHTML documents containing embedded semantics HTML documents 4. Interpretation: How can automated techniques interpret information? Solution = Semantic Web technologies! Convert web resources to RDF Metadata descriptions = ontology concepts Information is Consistent Interpretable
Generating RDF Models from XHTML Documents http://events.linkeddata.org/ldow2009/
Generating RDF Models from XHTML Documents
Generating RDF Models from HTML Documents Rise in use of lowercase semantics! However only 2.6% of web documents contain semantics  		[Mika et al, 2009] Majority of the web is HTML  Bad for machines Must extract person information Then build an RDF model Person information is structured  for legibility for segmentation i.e. logical distinction between elements
Generating RDF Models from HTML Documents
Generating RDF Models from HTML Documents ,[object Object]
Need a Document Object Model
Therefore Tidy it!,[object Object]
1 window = Info about 1 person
Get Xpath expression to the window,[object Object]
E.g. name, email, www, location
Train model parameters: Transition probs, emission probs, start probs
Use Viterbi algorithm to label tokens with states
Returns most likely state sequence,[object Object]
Disambiguating Identity Web References
Disambiguation 1: Inference Rules 1. Extract instances from Seed Data 2. For each instance, build a rule: ,[object Object]
Add triples to the rule
Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules to the web resources
Disambiguation 1: Inference Rules 1. Extract instances from Seed Data 2. For each instance, build a rule: ,[object Object]
Add triples to the rule
Create a new rule is a triple’s predicate is Inverse Functional3. Apply the rules to the web resources
Disambiguation 1: Inference Rules PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . 	?urlfoaf:topic ?p . 	?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . 	?qfoaf:name ?m . 	?urlfoaf:topic ?r . 	?rfoaf:name ?m } 1. Extract instances 2. For each instance, build a rule: ,[object Object]
Add triples to the rule
Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules to the web resources
Disambiguation 1: Inference Rules PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . 	?urlfoaf:topic ?p . 	?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . 	?qfoaf:homepage ?h . 	?urlfoaf:topic ?r . 	?rfoaf:homepage ?h } 1. Extract instances 2. For each instance, build a rule: ,[object Object]
Add triples to the rule
Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules to the web resources
Disambiguation 1: Inference Rules 1. Extract instances 2. For each instance, build a rule: ,[object Object]
Add triples to the rule
Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . 	?urlfoaf:topic ?p . 	?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . 	?qfoaf:homepage ?h . 	?urlfoaf:topic ?r . 	?rfoaf:homepage ?h }
Disambiguation 1: Inference Rules Advantages: ,[object Object]
Applies graph patternsDisadvantages: ,[object Object]
Strict matching: lack of generalisationM Rowe. Inferring Web Citations using Social Data and SPARQL Rules. In proceedings of Linking of User Profiles and Applications in the Social Semantic Web, Extended Semantic Web Conference 2010. Heraklion, Crete. (2010)
Disambiguation 2: Random Walks Seed data and web resources are RDF RDF has a graph structure: <subject, predicate, object> <source_node, edge, target_node> Graph-based disambiguation techniques: E.g. [Jiang et al, 2009] Build a graph-space Partition data points in the graph-space Requires methods to: Compile a graph-space Compare nodes Cluster nodes
Disambiguation 2: Random Walks ,[object Object]
Via common resources/literals,[object Object]
Disambiguation: Random Walks
Disambiguation 2: Random Walks ,[object Object]
Inhibit transitions through the graph space
Get the component containing the social graph,[object Object]
Disambiguation 2: Random Walks ,[object Object]
Commute Time distance
Leave node i : reach node j : return to node i
Optimum Transitions

More Related Content

What's hot

The Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionThe Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionJohn Breslin
 
Data-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCDData-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCDFrank Lynam
 
Rise presentation-2012-01
Rise presentation-2012-01Rise presentation-2012-01
Rise presentation-2012-01Richard Nurse
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentationurvics
 
Predicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebPredicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebMatthew Rowe
 
992 sms10 social_media_services
992 sms10 social_media_services992 sms10 social_media_services
992 sms10 social_media_servicessiyaza
 
Doctoral seminar (DBIS RWTH Aachen)
Doctoral seminar  (DBIS RWTH Aachen)Doctoral seminar  (DBIS RWTH Aachen)
Doctoral seminar (DBIS RWTH Aachen)Zina Petrushyna
 
"Undergrad ecologists aren't learning data management" - ESA 2013
"Undergrad ecologists aren't learning data management" -  ESA 2013"Undergrad ecologists aren't learning data management" -  ESA 2013
"Undergrad ecologists aren't learning data management" - ESA 2013Carly Strasser
 
Blogs for Information Management
Blogs for Information ManagementBlogs for Information Management
Blogs for Information ManagementChristina Pikas
 
Presentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOMPresentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOMMathieu d'Aquin
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Databutest
 
CrossRef at SciELO15 Conference 2013
CrossRef at SciELO15 Conference 2013CrossRef at SciELO15 Conference 2013
CrossRef at SciELO15 Conference 2013Crossref
 
Harith Alani's presentation at SSSW 2011
Harith Alani's presentation at SSSW 2011Harith Alani's presentation at SSSW 2011
Harith Alani's presentation at SSSW 2011sssw2011
 
Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...Fabien Gandon
 
THOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier LinkingTHOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier LinkingMaaike Duine
 
Persistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DonePersistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DoneHerbert Van de Sompel
 
Persistent Identifier Services and their Metadata by John Kunze
Persistent Identifier Services and their Metadata by John KunzePersistent Identifier Services and their Metadata by John Kunze
Persistent Identifier Services and their Metadata by John Kunzedatascienceiqss
 

What's hot (20)

The Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionThe Social Semantic Web: An Introduction
The Social Semantic Web: An Introduction
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
 
Data-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCDData-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCD
 
Rise presentation-2012-01
Rise presentation-2012-01Rise presentation-2012-01
Rise presentation-2012-01
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentation
 
Predicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebPredicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic Web
 
992 sms10 social_media_services
992 sms10 social_media_services992 sms10 social_media_services
992 sms10 social_media_services
 
Doctoral seminar (DBIS RWTH Aachen)
Doctoral seminar  (DBIS RWTH Aachen)Doctoral seminar  (DBIS RWTH Aachen)
Doctoral seminar (DBIS RWTH Aachen)
 
"Undergrad ecologists aren't learning data management" - ESA 2013
"Undergrad ecologists aren't learning data management" -  ESA 2013"Undergrad ecologists aren't learning data management" -  ESA 2013
"Undergrad ecologists aren't learning data management" - ESA 2013
 
Blogs for Information Management
Blogs for Information ManagementBlogs for Information Management
Blogs for Information Management
 
Presentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOMPresentation of LUCERO at EURECOM
Presentation of LUCERO at EURECOM
 
Linking and referencing
Linking and referencingLinking and referencing
Linking and referencing
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
 
CrossRef at SciELO15 Conference 2013
CrossRef at SciELO15 Conference 2013CrossRef at SciELO15 Conference 2013
CrossRef at SciELO15 Conference 2013
 
Harith Alani's presentation at SSSW 2011
Harith Alani's presentation at SSSW 2011Harith Alani's presentation at SSSW 2011
Harith Alani's presentation at SSSW 2011
 
Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...
 
THOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier LinkingTHOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier Linking
 
Persistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DonePersistent Identification: Easier Said than Done
Persistent Identification: Easier Said than Done
 
Persistent Identifier Services and their Metadata by John Kunze
Persistent Identifier Services and their Metadata by John KunzePersistent Identifier Services and their Metadata by John Kunze
Persistent Identifier Services and their Metadata by John Kunze
 

Similar to Automated Disambiguation of Identity Web References Using Social Data

Harnessing the Social Web: The Science of Identity Disambiguation
Harnessing the Social Web: The Science of Identity DisambiguationHarnessing the Social Web: The Science of Identity Disambiguation
Harnessing the Social Web: The Science of Identity DisambiguationMatthew Rowe
 
Connect the Dots: Bridging Silos of Information (WPCampus 2019)
Connect the Dots: Bridging Silos of Information (WPCampus 2019)Connect the Dots: Bridging Silos of Information (WPCampus 2019)
Connect the Dots: Bridging Silos of Information (WPCampus 2019)Elaine Shannon
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebMathieu d'Aquin
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseGiorgio Orsi
 
Mduke sagecite-jisc-march11
Mduke sagecite-jisc-march11Mduke sagecite-jisc-march11
Mduke sagecite-jisc-march11monicaduke
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Juan Sequeda
 
Sem tech2013 tutorial
Sem tech2013 tutorialSem tech2013 tutorial
Sem tech2013 tutorialThengo Kim
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesThanh Tran
 
2006-05-25__coi-semdis
2006-05-25__coi-semdis2006-05-25__coi-semdis
2006-05-25__coi-semdiswebuploader
 
Knowledge Sharing over social networking systems
Knowledge Sharing over social networking systemsKnowledge Sharing over social networking systems
Knowledge Sharing over social networking systemstanguy
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationSören Auer
 
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)Stian Soiland-Reyes
 
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)Stian Soiland-Reyes
 
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataMatthew Rowe
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataMatthew Rowe
 
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)Stian Soiland-Reyes
 

Similar to Automated Disambiguation of Identity Web References Using Social Data (20)

Harnessing the Social Web: The Science of Identity Disambiguation
Harnessing the Social Web: The Science of Identity DisambiguationHarnessing the Social Web: The Science of Identity Disambiguation
Harnessing the Social Web: The Science of Identity Disambiguation
 
Linked Data
Linked DataLinked Data
Linked Data
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
Connect the Dots: Bridging Silos of Information (WPCampus 2019)
Connect the Dots: Bridging Silos of Information (WPCampus 2019)Connect the Dots: Bridging Silos of Information (WPCampus 2019)
Connect the Dots: Bridging Silos of Information (WPCampus 2019)
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic Web
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
Mduke sagecite-jisc-march11
Mduke sagecite-jisc-march11Mduke sagecite-jisc-march11
Mduke sagecite-jisc-march11
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Sem tech2013 tutorial
Sem tech2013 tutorialSem tech2013 tutorial
Sem tech2013 tutorial
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search Technologies
 
2006-05-25__coi-semdis
2006-05-25__coi-semdis2006-05-25__coi-semdis
2006-05-25__coi-semdis
 
Web mining
Web miningWeb mining
Web mining
 
Knowledge Sharing over social networking systems
Knowledge Sharing over social networking systemsKnowledge Sharing over social networking systems
Knowledge Sharing over social networking systems
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
 
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)
 
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
 
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked Data
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
 
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
 

More from Matthew Rowe

Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache SparkMatthew Rowe
 
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesPredicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesMatthew Rowe
 
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Matthew Rowe
 
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting RatingsSemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings Matthew Rowe
 
The Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesThe Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesMatthew Rowe
 
From Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web UsersFrom Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web UsersMatthew Rowe
 
Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Matthew Rowe
 
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...Matthew Rowe
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Matthew Rowe
 
Identity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureIdentity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureMatthew Rowe
 
Measuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMeasuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMatthew Rowe
 
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Matthew Rowe
 
Attention Economics in Social Web Systems
Attention Economics in Social Web SystemsAttention Economics in Social Web Systems
Attention Economics in Social Web SystemsMatthew Rowe
 
What makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsWhat makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsMatthew Rowe
 
Existing Research and Future Research Agenda
Existing Research and Future Research AgendaExisting Research and Future Research Agenda
Existing Research and Future Research AgendaMatthew Rowe
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social SemanticsMatthew Rowe
 
Modelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesModelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesMatthew Rowe
 
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsUsing Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsMatthew Rowe
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsMatthew Rowe
 
Forecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeForecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeMatthew Rowe
 

More from Matthew Rowe (20)

Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache Spark
 
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesPredicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian Sequences
 
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
 
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting RatingsSemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
 
The Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesThe Semantic Evolution of Online Communities
The Semantic Evolution of Online Communities
 
From Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web UsersFrom Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web Users
 
Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...
 
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
 
Identity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureIdentity: Physical, Cyber, Future
Identity: Physical, Cyber, Future
 
Measuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMeasuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online Communities
 
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
 
Attention Economics in Social Web Systems
Attention Economics in Social Web SystemsAttention Economics in Social Web Systems
Attention Economics in Social Web Systems
 
What makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsWhat makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositions
 
Existing Research and Future Research Agenda
Existing Research and Future Research AgendaExisting Research and Future Research Agenda
Existing Research and Future Research Agenda
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social Semantics
 
Modelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesModelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online Communities
 
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsUsing Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community Forums
 
Forecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeForecasting Audience Increase on Youtube
Forecasting Audience Increase on Youtube
 

Recently uploaded

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 

Recently uploaded (20)

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 

Automated Disambiguation of Identity Web References Using Social Data

  • 1. Disambiguating Identity Web References using Social Data Matthew Rowe Organisations, Information and Knowledge Group Department of Computer Science University of Sheffield
  • 2. Outline Problem Setting Research Questions Claims of the Thesis State of the Art Requirements for Disambiguation and Seed Data Disambiguating Identity Web References Leveraging Seed Data from the Social Web Generating Metadata Models Disambiguation Techniques Evaluation Conclusions Dissemination and Impact
  • 3. Personal Information on the Web Personal information on the Web is disseminated: Voluntarily Involuntarily Increase in personal information: Identity Theft Lateral Surveillance Web users must discover their identity web references 2 stage process Finding Disambiguating Disambiguation = reduction of web reference ambiguity My thesis addresses disambiguation
  • 10. Problem Setting Performing disambiguation manually: Time consuming Laborious Handle masses of information Repeated often The Web keeps changing Solution = automated techniques Alleviate the need for humans Need background knowledge Who am I searching for? What makes them unique?
  • 11.
  • 12. State of the Art Disambiguation techniques are divisible into 2 types: Seeded techniques E.g. [Bekkerman and McCallum, 2005], Commercial Services Pros Disambiguate web references for a single person Cons: Require seed data No explanation of how seed data is acquired Unseeded techniques E.g. [Song et al, 2007] Pros Require no background knowledge Cons Groups web references into clusters Need to choose the correct cluster
  • 13. Requirements Requirements for Seeded Disambiguation: Bootstrap the disambiguation process with minimal supervision Achieve disambiguation accuracy comparable to human processing Cope with web resources not containing seed data features Disambiguation must be effective for all individuals Requirements for Seed Data: Produce seed data with minimal cost Generate reliable seed data
  • 15. Harnessing the Social Web WWW has evolved into a web of participation Digital identity is important on the Social Web Digital identity is fragmented across the Social Web Data Portability from Social Web platforms is limited http://www.economist.com/business/displaystory.cfm?story_id=10880936
  • 16. Data found on Social Web platforms is representative of real identity information
  • 17. User Study Data found on Social Web platforms is representative of real identity information 50 participants from the University of Sheffield Consisted of 3 stages, each participant: List real world social network Extract digital social network Compare networks Relevance: 0.23 Coverage: 0.77 Updates previous findings [Subrahmanyam et al, 2008] M Rowe. The Credibility of Digital Identity Information on the Social Web: A User Study. In proceedings of 4th Workshop on Information Credibility on the Web, World Wide Web Conference 2010. Raleigh, USA. (2010)
  • 19.
  • 20. Leveraging Seed Data from the Social Web Link things together!
  • 21.
  • 22. Leveraging Seed Data from the Social Web Allows remote resource information to change Automated techniques: Follow the links Retrieve the instance information
  • 24. Generating Metadata Models Input to disambiguation techniques is a set of web resources Web resources come in many flavours: Data models XHTML documents containing embedded semantics HTML documents 4. Interpretation: How can automated techniques interpret information? Solution = Semantic Web technologies! Convert web resources to RDF Metadata descriptions = ontology concepts Information is Consistent Interpretable
  • 25. Generating RDF Models from XHTML Documents http://events.linkeddata.org/ldow2009/
  • 26. Generating RDF Models from XHTML Documents
  • 27. Generating RDF Models from HTML Documents Rise in use of lowercase semantics! However only 2.6% of web documents contain semantics [Mika et al, 2009] Majority of the web is HTML Bad for machines Must extract person information Then build an RDF model Person information is structured for legibility for segmentation i.e. logical distinction between elements
  • 28. Generating RDF Models from HTML Documents
  • 29.
  • 30. Need a Document Object Model
  • 31.
  • 32. 1 window = Info about 1 person
  • 33.
  • 34. E.g. name, email, www, location
  • 35. Train model parameters: Transition probs, emission probs, start probs
  • 36. Use Viterbi algorithm to label tokens with states
  • 37.
  • 39.
  • 40. Add triples to the rule
  • 41. Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules to the web resources
  • 42.
  • 43. Add triples to the rule
  • 44. Create a new rule is a triple’s predicate is Inverse Functional3. Apply the rules to the web resources
  • 45.
  • 46. Add triples to the rule
  • 47. Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules to the web resources
  • 48.
  • 49. Add triples to the rule
  • 50. Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules to the web resources
  • 51.
  • 52. Add triples to the rule
  • 53. Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . ?urlfoaf:topic ?p . ?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . ?qfoaf:homepage ?h . ?urlfoaf:topic ?r . ?rfoaf:homepage ?h }
  • 54.
  • 55.
  • 56. Strict matching: lack of generalisationM Rowe. Inferring Web Citations using Social Data and SPARQL Rules. In proceedings of Linking of User Profiles and Applications in the Social Semantic Web, Extended Semantic Web Conference 2010. Heraklion, Crete. (2010)
  • 57. Disambiguation 2: Random Walks Seed data and web resources are RDF RDF has a graph structure: <subject, predicate, object> <source_node, edge, target_node> Graph-based disambiguation techniques: E.g. [Jiang et al, 2009] Build a graph-space Partition data points in the graph-space Requires methods to: Compile a graph-space Compare nodes Cluster nodes
  • 58.
  • 59.
  • 61.
  • 62. Inhibit transitions through the graph space
  • 63.
  • 64.
  • 66. Leave node i : reach node j : return to node i
  • 68.
  • 70. Leave node i : reach node j : return to node i
  • 72.
  • 74. Every point is in a cluster
  • 75.
  • 76.
  • 77. Relies on tuning clustering thresholdM Rowe. Applying Semantic Social Graphs to Disambiguate Identity References. In proceedings of European Semantic Web Conference 2009, Heraklion, Crete. (2009)
  • 78. Disambiguation 3: Self-training Classic ML scenario: Lots of unlabelled data Limited labelled data Disambiguating identity web references is just the same! Possible web citations = large Social data = small Semi-supervised learning is a solution Train a classifier Using labelled and unlabelled data! Classification task is binary Does this web resource refer to person X or not?
  • 79. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training
  • 80. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training
  • 81. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training
  • 82. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training
  • 83. Begin Self-training: Train the Classifier Classify the web resources Rank classifications Enlarge training sets Repeat steps 1-4 Disambiguation 3: Self-training
  • 84. Training/Testing data is RDF Convert to a machine learning dataset Features = RDF instances Vary the feature similarity measure: Jaccard Similarity Inverse Functional Property Matching RDF Entailment Tested three different classifiers: Perceptron Support Vector Machine Naïve Bayes Disambiguation 3: Self-training
  • 85. Advantages Directly learn from disambiguation decisions Utilise abundance of unlabelled data Disadvantages Requires reliable negatives Mistakes can reinforce themselves M Rowe and F Ciravegna. Harnessing the Social Web: The Science of Identity Disambiguation. In proceedings of Web Science Conference 2010. Raleigh, USA. (2010) Disambiguation 3: Self-training
  • 86. Evaluation Measures: Precision, Recall, F-Measure Dataset 50 participants from the Semantic Web and Web 2.0 communities ~17300 web resources: 346 web resources for each participant Baselines Baseline 1: Person name as positive classification Baseline 2: Hierarchical Clustering using Person Names [Malin, 2005] Baseline 3: Human Processing
  • 87. Evaluation: Inference Rules High precision Better than humans Precise graph pattern matching Low recall Rules are strict No room for variability Hard to generalise No learning from disambiguation decisions
  • 88. Evaluation: Random Walks High recall Higher than humans Incorporates unlabelled data into random walks Uses features not in the seed data Precision Lower than humans and rules Ambiguous name literals lead to false positives
  • 89. Evaluation: Self-training High Recall SVM + Entailment classifies 91% of references High F-Measure Higher than humans Perceptron + Entailment and SVM + Entailment
  • 90.
  • 91. Conclusions: Claims Automated disambiguation techniques are able to replace human processing Techniques are comparable to humans Overcome manual processing Data found on Social Web platforms is representative of real identity information 77% of a real world social network is covered online Social data provides the background knowledge required by automated disambiguation techniques Techniques function using social data Biographical and social network enables disambiguation
  • 92. Dissemination and Impact Published 21 peer-reviewed publications Paper in the Journal of Web Semantics (impact: 3.5) Presented work at many international conferences Program committee member for 5 international workshops Invited Expert for the World Wide Web Consortium’s Social Web Incubator Group Listed as one of top 100 visionaries “discussing the future of the web” http://www.semanticweb.com/semanticweb100/ Linked Data service for the DCS Best Poster at the Extended Semantic Web Conference 2010 http://data.dcs.shef.ac.uk Tools widely used by the Semantic Web community FOAF Generator Social Identity Schema Mapping (SISM) Vocabulary
  • 93. Twitter: @mattroweshow Web: http://www.dcs.shef.ac.uk/~mrowe Email: m.rowe@dcs.shef.ac.uk Questions? For a condensed version of my thesis: M Rowe and F Ciravegna. Disambiguating Identity Web References using Web 2.0 Data and Semantics. In Press for special issue on "Web 2.0" in the Journal of Web Semantics. (2010)

Editor's Notes

  1. VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
  2. 1,580,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  3. 1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  4. 1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  5. 1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  6. VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
  7. From heterogeneous sources!
  8. KnowItAll system [Etzioni et al, 2005]: identifies facts in web pages and populates a knowledge base DBPedia project [Auer et al, 2008]: extracts information from Wikipedia and builds a large machine-readable knowledge base Social Web platforms such as Facebook and MySpaceSocial dataSufficient to support automated techniques
  9. Seeded techniques: [Bekkerman and McCallum, 2005] Seed data = the social network of a personCluster pages on link structure web pages are collected and clustered based on link structures Unseeded techniques:[Song et al, 2007]Aligns person names with web page topicCluster pages =Generative model built from topic list Seeded techniques suited to this thesis’ problem settingNo need to partition web citations into k clustersHandling a large amount of irrelevant informationAble to focus on disambiguating web references for a single personIn line with state of the art approaches
  10. Places requirements on the technique + seed data
  11. My solution to the problem settingUser-centric approachSemantic Web technologies = consistent interpretation of information
  12. Web EvoluationWikipedia = wisdom of the crowdBlogging platforms = web users share thoughts/opinionsWeb = a Social WebDigital IdentityRich functionalitiesUsers build bespoke identitiesDigital identity can be divided into 3 tiersMy Identity: persistent identity information (name, date of birth, genealogical relations)Shared Identity: social networks, friend relationshipsAbstracted Identity: demographics of usage (e.g. community of practice)ID FragmentationMySpace = share/discuss musicLinkedIn = make business connectionsData PortabilityInfo in proprietary formatsHard to link togetherNeed to solve these issues
  13. Social Web platforms:maintain offline relationships [Hart et al, 2008] reinforce existing offline relationships [Ellison et al, 2007]
  14. Social Web platforms:maintain offline relationships [Hart et al, 2008] reinforce existing offline relationships [Ellison et al, 2007]Real world networkContains strong-tied relationships [Donath&amp; Boyd, 2004]RelevanceRatio of strong-tied to weak-tied relationships in the digital social networkCoverageProportion of the real-world in the digital social networkResults:Relevance 23% of a digital social network contains strong-tied relationshipsCoverage77% of a participant’s real-world social network appears onlineDifferent from findings by [Subrahmanyam et al, 2008] 49% for coverage (they define it as overlap)Due to different demographics
  15. My solution to the problem settingUser-centric approachSemantic Web technologies = consistent interpretation of information
  16. Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
  17. Explain what RDF is!!!!Graph like model of dataExport individual social graphsFrom Facebook, Twitter, etc!Overcomes data portability issues!
  18. Want to detect equivalent instances!!
  19. Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
  20. Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
  21. We now have our Seed data! It is in machine readable RDF Using FOAF + Geonames
  22. Some flavours taste better to machines
  23. XHTML document contains embedded semantics Both human and machine-readableWell formed structure with lightweight markup for DOM elementsE.g. Microformats, RDF in Attributes (RDFa)XSL Transformations are specified in a document’s headerThis allows the XHTML structure to be converted into RDFGRDDL is used to lift an RDF model from a given document
  24. XHTML document contains embedded semantics Both human and machine-readableWell formed structure with lightweight markup for DOM elementsE.g. Microformats, RDF in Attributes (RDFa)XSL Transformations are specified in a document’s headerThis allows the XHTML structure to be converted into RDFGRDDL is used to lift an RDF model from a given document
  25. Designed for human consumptionmachinesHard to build a metadata model fromHTML markup controls the arrangement and presentation of informationFormatting provides logical distinctions between pieces of information in a given HTML document
  26. 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  27. 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  28. 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  29. 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  30. We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques 3 techniques were explored: 1. rule-based 2. graph-based 3. semi-supervised machine learning
  31. Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  32. Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  33. Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  34. Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  35. Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  36. Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  37. [Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDF
  38. Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  39. Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  40. Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  41. Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  42. Adjacency Matrix gives the local similarity of nodes in the spaceProbability of moving from one node to another given that t steps have been traversed
  43. Commute TimeIf many paths exist between nodesAND those paths are short in lengthTHEN Commute Time decreases!Optimum Transitions
  44. Kernel functions over the graph spaceWe now have similarity measures between nodes!Lower the distances/steps -&gt; greater the similarity!
  45. Kernel functions over the graph spaceWe now have similarity measures between nodes!Lower the distances/steps -&gt; greater the similarity!
  46. Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  47. Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselves
  48. Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  49. Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  50. Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  51. Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  52. Jaccard = strictIFP = Less strict, but requires certain propertiiesEntailment = allows variability
  53. SNOWBALL
  54. Precision =proportion of web resources which are correctly labelled as citing a personRecall = proportion of web references which are correctly disambiguated F-Measure = harmonic mean of precision and recall
  55. Achieves high levels of precisionOutperforming humans and other baselinesSPARQL rules require strict literal and resource matching within the triple patternsLeads to poor recall levels howeverUnable to learn from past disambiguation decisionsAt lower-levels of web presence (where identity web references are sparse) rules outperform all baselines in terms of f-measureHumans find it difficult to detect sparse web referencesAutomating disambiguation at such levels is more suitable
  56. Achieves higher levels of recall for both distance measures than human processingCommute Time yields higher precision levels than Optimum TransitionsDue to the round trip cost used to cluster web resources with the social graphPerformance improves as web presence levels increaseRandom Walks performs better where feature sets are large in sizeIndicative of large web presencePrecision levels are less than inference rulesClustering using commute time and optimum transitions leads to an increase in false positivesAmbiguous nodes in the graph-space leads to incorrect disambiguation decisionsE.g. literal in a metadata model denoting a person’s name
  57. Entailment consistent achieves the highest f-measure scores for each classifierReduction in overfitting to training dataGeneralises well to new instancesCharacterised by recall level achieved with SVMTwo permutations outperform humansPerceptron and SVM with EntailmentJaccard and IFP perform well at low-levels of web presenceF-measure reduces as identity web references grow in numberStrict feature matching leads to poor recall levelsOverfitting to training dataSelf-training outperforms both Random Walks and Inference Rules for certain permutationsDirect use of disambiguation decisions allows classifiers to improve upon their initial hypothesis
  58. 17 as first author5 papers accepted for publication since thesis submission