Automated Disambiguation of Identity Web References Using Social Data

•Download as PPTX, PDF•

3 likes•1,009 views

This document summarizes Matthew Rowe's PhD thesis on automatically disambiguating identity web references. The thesis claims that automated techniques can replace humans in performing disambiguation at scale and with high accuracy by leveraging seed data from social web platforms. It outlines three disambiguation techniques evaluated in the thesis: 1) inference rules, 2) random walks on graphs, and 3) self-training classifiers. Evaluation results show the self-training approach achieves recall comparable to humans while the rule-based approach has the highest precision. The document also discusses generating metadata models from web resources and requirements for seed data and disambiguation techniques.

Education

Disambiguating Identity Web References using Social Data Matthew Rowe Organisations, Information and Knowledge Group Department of Computer Science University of Sheffield

Outline Problem Setting Research Questions Claims of the Thesis State of the Art Requirements for Disambiguation and Seed Data Disambiguating Identity Web References Leveraging Seed Data from the Social Web Generating Metadata Models Disambiguation Techniques Evaluation Conclusions Dissemination and Impact

Personal Information on the Web Personal information on the Web is disseminated: Voluntarily Involuntarily Increase in personal information: Identity Theft Lateral Surveillance Web users must discover their identity web references 2 stage process Finding Disambiguating Disambiguation = reduction of web reference ambiguity My thesis addresses disambiguation

Problem Setting Performing disambiguation manually: Time consuming Laborious Handle masses of information Repeated often The Web keeps changing Solution = automated techniques Alleviate the need for humans Need background knowledge Who am I searching for? What makes them unique?

Research Questions How can identity web references be disambiguated automatically? Alleviate human processing: ,[object Object],Supervision: ,[object Object],Seed Data: ,[object Object],Interpretation: ,[object Object],[object Object]

State of the Art Disambiguation techniques are divisible into 2 types: Seeded techniques E.g. [Bekkerman and McCallum, 2005], Commercial Services Pros Disambiguate web references for a single person Cons: Require seed data No explanation of how seed data is acquired Unseeded techniques E.g. [Song et al, 2007] Pros Require no background knowledge Cons Groups web references into clusters Need to choose the correct cluster

Requirements Requirements for Seeded Disambiguation: Bootstrap the disambiguation process with minimal supervision Achieve disambiguation accuracy comparable to human processing Cope with web resources not containing seed data features Disambiguation must be effective for all individuals Requirements for Seed Data: Produce seed data with minimal cost Generate reliable seed data

Harnessing the Social Web WWW has evolved into a web of participation Digital identity is important on the Social Web Digital identity is fragmented across the Social Web Data Portability from Social Web platforms is limited http://www.economist.com/business/displaystory.cfm?story_id=10880936

Data found on Social Web platforms is representative of real identity information

User Study Data found on Social Web platforms is representative of real identity information 50 participants from the University of Sheffield Consisted of 3 stages, each participant: List real world social network Extract digital social network Compare networks Relevance: 0.23 Coverage: 0.77 Updates previous findings [Subrahmanyam et al, 2008] M Rowe. The Credibility of Digital Identity Information on the Social Web: A User Study. In proceedings of 4th Workshop on Information Credibility on the Web, World Wide Web Conference 2010. Raleigh, USA. (2010)

Leveraging Seed Data from the Social Web 3. Seed Data: ,[object Object],[object Object]

Leveraging Seed Data from the Social Web Link things together!

Leveraging Seed Data from the Social Web Blocking Step ,[object Object],Compare values of Inverse Functional Properties ,[object Object],Compare Geo URIs ,[object Object],Compare Geo data ,[object Object],M Rowe. Interlinking Distributed Social Graphs. In proceedings of Linked Data on the Web Workshop, World Wide Web Conference, Madrid, Spain. (2009)

Leveraging Seed Data from the Social Web Allows remote resource information to change Automated techniques: Follow the links Retrieve the instance information

Generating Metadata Models Input to disambiguation techniques is a set of web resources Web resources come in many flavours: Data models XHTML documents containing embedded semantics HTML documents 4. Interpretation: How can automated techniques interpret information? Solution = Semantic Web technologies! Convert web resources to RDF Metadata descriptions = ontology concepts Information is Consistent Interpretable

Generating RDF Models from XHTML Documents http://events.linkeddata.org/ldow2009/

Generating RDF Models from XHTML Documents

Generating RDF Models from HTML Documents Rise in use of lowercase semantics! However only 2.6% of web documents contain semantics [Mika et al, 2009] Majority of the web is HTML Bad for machines Must extract person information Then build an RDF model Person information is structured for legibility for segmentation i.e. logical distinction between elements

Generating RDF Models from HTML Documents

Generating RDF Models from HTML Documents ,[object Object]

Get Xpath expression to the window,[object Object]

Train model parameters: Transition probs, emission probs, start probs

Use Viterbi algorithm to label tokens with states

Returns most likely state sequence,[object Object]

Disambiguation 1: Inference Rules 1. Extract instances from Seed Data 2. For each instance, build a rule: ,[object Object]

Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules to the web resources

Create a new rule is a triple’s predicate is Inverse Functional3. Apply the rules to the web resources

Disambiguation 1: Inference Rules PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . ?urlfoaf:topic ?p . ?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . ?qfoaf:name ?m . ?urlfoaf:topic ?r . ?rfoaf:name ?m } 1. Extract instances 2. For each instance, build a rule: ,[object Object]

Disambiguation 1: Inference Rules 1. Extract instances 2. For each instance, build a rule: ,[object Object]

Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . ?urlfoaf:topic ?p . ?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . ?qfoaf:homepage ?h . ?urlfoaf:topic ?r . ?rfoaf:homepage ?h }

Disambiguation 1: Inference Rules Advantages: ,[object Object]

Applies graph patternsDisadvantages: ,[object Object]

Strict matching: lack of generalisationM Rowe. Inferring Web Citations using Social Data and SPARQL Rules. In proceedings of Linking of User Profiles and Applications in the Social Semantic Web, Extended Semantic Web Conference 2010. Heraklion, Crete. (2010)

Disambiguation 2: Random Walks Seed data and web resources are RDF RDF has a graph structure: <subject, predicate, object> <source_node, edge, target_node> Graph-based disambiguation techniques: E.g. [Jiang et al, 2009] Build a graph-space Partition data points in the graph-space Requires methods to: Compile a graph-space Compare nodes Cluster nodes

Disambiguation 2: Random Walks ,[object Object]

Via common resources/literals,[object Object]

Inhibit transitions through the graph space

Get the component containing the social graph,[object Object]

Leave node i : reach node j : return to node i

What's hot

The Social Semantic Web: An IntroductionJohn Breslin

NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...National Information Standards Organization (NISO)

NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...National Information Standards Organization (NISO)

Data-mining the Semantic Web @TCDFrank Lynam

Rise presentation-2012-01Richard Nurse

NetIKX Semantic Search Presentationurvics

Predicting Discussions on the Social Semantic WebMatthew Rowe

992 sms10 social_media_servicessiyaza

Doctoral seminar (DBIS RWTH Aachen)Zina Petrushyna

"Undergrad ecologists aren't learning data management" - ESA 2013Carly Strasser

Blogs for Information ManagementChristina Pikas

Presentation of LUCERO at EURECOMMathieu d'Aquin

Linking and referencingCraig Betts Sydney Australia

URI Disambiguation in the Context of Linked Databutest

CrossRef at SciELO15 Conference 2013Crossref

Harith Alani's presentation at SSSW 2011sssw2011

Web open standards for linked data and knowledge graphs as enablers of EU dig...Fabien Gandon

THOR Workshop - Persistent Identifier LinkingMaaike Duine

Persistent Identification: Easier Said than DoneHerbert Van de Sompel

Persistent Identifier Services and their Metadata by John Kunzedatascienceiqss

What's hot (20)

The Social Semantic Web: An Introduction

NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...

NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...

Data-mining the Semantic Web @TCD

Rise presentation-2012-01

NetIKX Semantic Search Presentation

Predicting Discussions on the Social Semantic Web

992 sms10 social_media_services

Doctoral seminar (DBIS RWTH Aachen)

"Undergrad ecologists aren't learning data management" - ESA 2013

Blogs for Information Management

Presentation of LUCERO at EURECOM

Linking and referencing

URI Disambiguation in the Context of Linked Data

CrossRef at SciELO15 Conference 2013

Harith Alani's presentation at SSSW 2011

Web open standards for linked data and knowledge graphs as enablers of EU dig...

THOR Workshop - Persistent Identifier Linking

Persistent Identification: Easier Said than Done

Persistent Identifier Services and their Metadata by John Kunze

Similar to Automated Disambiguation of Identity Web References Using Social Data

Harnessing the Social Web: The Science of Identity DisambiguationMatthew Rowe

Linked DataDanny Ayers

Pratical Deep Dive into the Semantic Web - #smconnectJan-Willem Bobbink - Freelance SEO Consultant

Connect the Dots: Bridging Silos of Information (WPCampus 2019)Elaine Shannon

Doing Clever Things with the Semantic WebMathieu d'Aquin

Web Data Extraction: A Crash CourseGiorgio Orsi

Mduke sagecite-jisc-march11monicaduke

Alamw15 VIVOKristi Holmes

Consuming Linked Data 4/5 Semtech2011Juan Sequeda

Sem tech2013 tutorialThengo Kim

Recent Trends in Semantic Search TechnologiesThanh Tran

2006-05-25__coi-semdiswebuploader

Web miningTanjarul Islam Mishu

Knowledge Sharing over social networking systemstanguy

Linked data for Enterprise Data IntegrationSören Auer

2013 06-24 Wf4Ever: Annotating research objects (PDF)Stian Soiland-Reyes

2013 06-24 Wf4Ever: Annotating research objects (PPTX)Stian Soiland-Reyes

Data.dcs: Converting Legacy Data into Linked DataMatthew Rowe

Semantic Technologies: Representing Semantic DataMatthew Rowe

2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)Stian Soiland-Reyes

Similar to Automated Disambiguation of Identity Web References Using Social Data (20)

Harnessing the Social Web: The Science of Identity Disambiguation

Linked Data

Pratical Deep Dive into the Semantic Web - #smconnect

Connect the Dots: Bridging Silos of Information (WPCampus 2019)

Doing Clever Things with the Semantic Web

Web Data Extraction: A Crash Course

Mduke sagecite-jisc-march11

Alamw15 VIVO

Consuming Linked Data 4/5 Semtech2011

Sem tech2013 tutorial

Recent Trends in Semantic Search Technologies

2006-05-25__coi-semdis

Web mining

Knowledge Sharing over social networking systems

Linked data for Enterprise Data Integration

2013 06-24 Wf4Ever: Annotating research objects (PDF)

2013 06-24 Wf4Ever: Annotating research objects (PPTX)

Data.dcs: Converting Legacy Data into Linked Data

Semantic Technologies: Representing Semantic Data

2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)

Recently uploaded

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique

URLs and Routing in the Odoo 17 Website AppCeline George

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr

Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre

Presiding Officer Training module 2024 lok sabha electionsanshu789521

CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝9953056974 Low Rate Call Girls In Saket, Delhi NCR

Código Creativo y Arte de Software | Unidad 1Maestría en Comunicación Digital Interactiva - UNR

9953330565 Low Rate Call Girls In Rohini Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle

Staff of Color (SOC) Retention Efforts DDSDDavid Douglas School District

TataKelola dan KamSiber Kecerdasan Buatan v022.pdfSarwono Sutikno, Dr.Eng.,CISA,CISSP,CISM,CSX-F

MENTAL STATUS EXAMINATION format.docxPoojaSen20

Mastering the Unannounced Regulatory InspectionSafetyChain Software

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy

How to Configure Email Server in Odoo 17Celine George

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD

Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth

Recently uploaded (20)

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx

URLs and Routing in the Odoo 17 Website App

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx

Presiding Officer Training module 2024 lok sabha elections

CARE OF CHILD IN INCUBATOR..........pptx

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝

Código Creativo y Arte de Software | Unidad 1

9953330565 Low Rate Call Girls In Rohini Delhi NCR

Hybridoma Technology ( Production , Purification , and Application )

Staff of Color (SOC) Retention Efforts DDSD

TataKelola dan KamSiber Kecerdasan Buatan v022.pdf

MENTAL STATUS EXAMINATION format.docx

Mastering the Unannounced Regulatory Inspection

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf

How to Configure Email Server in Odoo 17

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...

Introduction to ArtificiaI Intelligence in Higher Education

Automated Disambiguation of Identity Web References Using Social Data

1. Disambiguating Identity Web References using Social Data Matthew Rowe Organisations, Information and Knowledge Group Department of Computer Science University of Sheffield

2. Outline Problem Setting Research Questions Claims of the Thesis State of the Art Requirements for Disambiguation and Seed Data Disambiguating Identity Web References Leveraging Seed Data from the Social Web Generating Metadata Models Disambiguation Techniques Evaluation Conclusions Dissemination and Impact

3. Personal Information on the Web Personal information on the Web is disseminated: Voluntarily Involuntarily Increase in personal information: Identity Theft Lateral Surveillance Web users must discover their identity web references 2 stage process Finding Disambiguating Disambiguation = reduction of web reference ambiguity My thesis addresses disambiguation

4. Ambiguity!

5. Matthew Rowe: Composer

6. Matthew Rowe: Cyclist

7. Matthew Rowe: Gardener

8. Matthew Rowe: Song Writer

9. Matthew Rowe: PhD Student

10. Problem Setting Performing disambiguation manually: Time consuming Laborious Handle masses of information Repeated often The Web keeps changing Solution = automated techniques Alleviate the need for humans Need background knowledge Who am I searching for? What makes them unique?

11.

12. State of the Art Disambiguation techniques are divisible into 2 types: Seeded techniques E.g. [Bekkerman and McCallum, 2005], Commercial Services Pros Disambiguate web references for a single person Cons: Require seed data No explanation of how seed data is acquired Unseeded techniques E.g. [Song et al, 2007] Pros Require no background knowledge Cons Groups web references into clusters Need to choose the correct cluster

13. Requirements Requirements for Seeded Disambiguation: Bootstrap the disambiguation process with minimal supervision Achieve disambiguation accuracy comparable to human processing Cope with web resources not containing seed data features Disambiguation must be effective for all individuals Requirements for Seed Data: Produce seed data with minimal cost Generate reliable seed data

14. Disambiguating Identity Web References

15. Harnessing the Social Web WWW has evolved into a web of participation Digital identity is important on the Social Web Digital identity is fragmented across the Social Web Data Portability from Social Web platforms is limited http://www.economist.com/business/displaystory.cfm?story_id=10880936

16. Data found on Social Web platforms is representative of real identity information

17. User Study Data found on Social Web platforms is representative of real identity information 50 participants from the University of Sheffield Consisted of 3 stages, each participant: List real world social network Extract digital social network Compare networks Relevance: 0.23 Coverage: 0.77 Updates previous findings [Subrahmanyam et al, 2008] M Rowe. The Credibility of Digital Identity Information on the Social Web: A User Study. In proceedings of 4th Workshop on Information Credibility on the Web, World Wide Web Conference 2010. Raleigh, USA. (2010)

18. Disambiguating Identity Web References

19.

20. Leveraging Seed Data from the Social Web Link things together!

21.

22. Leveraging Seed Data from the Social Web Allows remote resource information to change Automated techniques: Follow the links Retrieve the instance information

23. Disambiguating Identity Web References

24. Generating Metadata Models Input to disambiguation techniques is a set of web resources Web resources come in many flavours: Data models XHTML documents containing embedded semantics HTML documents 4. Interpretation: How can automated techniques interpret information? Solution = Semantic Web technologies! Convert web resources to RDF Metadata descriptions = ontology concepts Information is Consistent Interpretable

25. Generating RDF Models from XHTML Documents http://events.linkeddata.org/ldow2009/

26. Generating RDF Models from XHTML Documents

27. Generating RDF Models from HTML Documents Rise in use of lowercase semantics! However only 2.6% of web documents contain semantics [Mika et al, 2009] Majority of the web is HTML Bad for machines Must extract person information Then build an RDF model Person information is structured for legibility for segmentation i.e. logical distinction between elements

28. Generating RDF Models from HTML Documents

29.

30. Need a Document Object Model

31.

32. 1 window = Info about 1 person

33.

34. E.g. name, email, www, location

35. Train model parameters: Transition probs, emission probs, start probs

36. Use Viterbi algorithm to label tokens with states

37.

38. Disambiguating Identity Web References

39.

40. Add triples to the rule

41. Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules to the web resources

42.

43. Add triples to the rule

44. Create a new rule is a triple’s predicate is Inverse Functional3. Apply the rules to the web resources

45.

46. Add triples to the rule

47. Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules to the web resources

48.

49. Add triples to the rule

50. Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules to the web resources

51.

52. Add triples to the rule

53. Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . ?urlfoaf:topic ?p . ?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . ?qfoaf:homepage ?h . ?urlfoaf:topic ?r . ?rfoaf:homepage ?h }

54.

55.

56. Strict matching: lack of generalisationM Rowe. Inferring Web Citations using Social Data and SPARQL Rules. In proceedings of Linking of User Profiles and Applications in the Social Semantic Web, Extended Semantic Web Conference 2010. Heraklion, Crete. (2010)

57. Disambiguation 2: Random Walks Seed data and web resources are RDF RDF has a graph structure: <subject, predicate, object> <source_node, edge, target_node> Graph-based disambiguation techniques: E.g. [Jiang et al, 2009] Build a graph-space Partition data points in the graph-space Requires methods to: Compile a graph-space Compare nodes Cluster nodes

58.

59.

60. Disambiguation: Random Walks

61.

62. Inhibit transitions through the graph space

63.

64.

65. Commute Time distance

66. Leave node i : reach node j : return to node i

67. Optimum Transitions

68.

69. Commute Time distance

70. Leave node i : reach node j : return to node i

71. Optimum Transitions

72.

73. Via agglomerative clustering

74. Every point is in a cluster

75.

76.

77. Relies on tuning clustering thresholdM Rowe. Applying Semantic Social Graphs to Disambiguate Identity References. In proceedings of European Semantic Web Conference 2009, Heraklion, Crete. (2009)

78. Disambiguation 3: Self-training Classic ML scenario: Lots of unlabelled data Limited labelled data Disambiguating identity web references is just the same! Possible web citations = large Social data = small Semi-supervised learning is a solution Train a classifier Using labelled and unlabelled data! Classification task is binary Does this web resource refer to person X or not?

79. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training

80. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training

81. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training

82. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training

83. Begin Self-training: Train the Classifier Classify the web resources Rank classifications Enlarge training sets Repeat steps 1-4 Disambiguation 3: Self-training

84. Training/Testing data is RDF Convert to a machine learning dataset Features = RDF instances Vary the feature similarity measure: Jaccard Similarity Inverse Functional Property Matching RDF Entailment Tested three different classifiers: Perceptron Support Vector Machine Naïve Bayes Disambiguation 3: Self-training

85. Advantages Directly learn from disambiguation decisions Utilise abundance of unlabelled data Disadvantages Requires reliable negatives Mistakes can reinforce themselves M Rowe and F Ciravegna. Harnessing the Social Web: The Science of Identity Disambiguation. In proceedings of Web Science Conference 2010. Raleigh, USA. (2010) Disambiguation 3: Self-training

86. Evaluation Measures: Precision, Recall, F-Measure Dataset 50 participants from the Semantic Web and Web 2.0 communities ~17300 web resources: 346 web resources for each participant Baselines Baseline 1: Person name as positive classification Baseline 2: Hierarchical Clustering using Person Names [Malin, 2005] Baseline 3: Human Processing

87. Evaluation: Inference Rules High precision Better than humans Precise graph pattern matching Low recall Rules are strict No room for variability Hard to generalise No learning from disambiguation decisions

88. Evaluation: Random Walks High recall Higher than humans Incorporates unlabelled data into random walks Uses features not in the seed data Precision Lower than humans and rules Ambiguous name literals lead to false positives

89. Evaluation: Self-training High Recall SVM + Entailment classifies 91% of references High F-Measure Higher than humans Perceptron + Entailment and SVM + Entailment

90.

91. Conclusions: Claims Automated disambiguation techniques are able to replace human processing Techniques are comparable to humans Overcome manual processing Data found on Social Web platforms is representative of real identity information 77% of a real world social network is covered online Social data provides the background knowledge required by automated disambiguation techniques Techniques function using social data Biographical and social network enables disambiguation

92. Dissemination and Impact Published 21 peer-reviewed publications Paper in the Journal of Web Semantics (impact: 3.5) Presented work at many international conferences Program committee member for 5 international workshops Invited Expert for the World Wide Web Consortium’s Social Web Incubator Group Listed as one of top 100 visionaries “discussing the future of the web” http://www.semanticweb.com/semanticweb100/ Linked Data service for the DCS Best Poster at the Extended Semantic Web Conference 2010 http://data.dcs.shef.ac.uk Tools widely used by the Semantic Web community FOAF Generator Social Identity Schema Mapping (SISM) Vocabulary

93. Twitter: @mattroweshow Web: http://www.dcs.shef.ac.uk/~mrowe Email: m.rowe@dcs.shef.ac.uk Questions? For a condensed version of my thesis: M Rowe and F Ciravegna. Disambiguating Identity Web References using Web 2.0 Data and Semantics. In Press for special issue on "Web 2.0" in the Journal of Web Semantics. (2010)

Editor's Notes

VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
1,580,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
From heterogeneous sources!
KnowItAll system [Etzioni et al, 2005]: identifies facts in web pages and populates a knowledge base DBPedia project [Auer et al, 2008]: extracts information from Wikipedia and builds a large machine-readable knowledge base Social Web platforms such as Facebook and MySpaceSocial dataSufficient to support automated techniques
Seeded techniques: [Bekkerman and McCallum, 2005] Seed data = the social network of a personCluster pages on link structure web pages are collected and clustered based on link structures Unseeded techniques:[Song et al, 2007]Aligns person names with web page topicCluster pages =Generative model built from topic list Seeded techniques suited to this thesis’ problem settingNo need to partition web citations into k clustersHandling a large amount of irrelevant informationAble to focus on disambiguating web references for a single personIn line with state of the art approaches
Places requirements on the technique + seed data
My solution to the problem settingUser-centric approachSemantic Web technologies = consistent interpretation of information
Web EvoluationWikipedia = wisdom of the crowdBlogging platforms = web users share thoughts/opinionsWeb = a Social WebDigital IdentityRich functionalitiesUsers build bespoke identitiesDigital identity can be divided into 3 tiersMy Identity: persistent identity information (name, date of birth, genealogical relations)Shared Identity: social networks, friend relationshipsAbstracted Identity: demographics of usage (e.g. community of practice)ID FragmentationMySpace = share/discuss musicLinkedIn = make business connectionsData PortabilityInfo in proprietary formatsHard to link togetherNeed to solve these issues
Social Web platforms:maintain offline relationships [Hart et al, 2008] reinforce existing offline relationships [Ellison et al, 2007]
Social Web platforms:maintain offline relationships [Hart et al, 2008] reinforce existing offline relationships [Ellison et al, 2007]Real world networkContains strong-tied relationships [Donath& Boyd, 2004]RelevanceRatio of strong-tied to weak-tied relationships in the digital social networkCoverageProportion of the real-world in the digital social networkResults:Relevance 23% of a digital social network contains strong-tied relationshipsCoverage77% of a participant’s real-world social network appears onlineDifferent from findings by [Subrahmanyam et al, 2008] 49% for coverage (they define it as overlap)Due to different demographics
My solution to the problem settingUser-centric approachSemantic Web technologies = consistent interpretation of information
Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
Explain what RDF is!!!!Graph like model of dataExport individual social graphsFrom Facebook, Twitter, etc!Overcomes data portability issues!
Want to detect equivalent instances!!
Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
We now have our Seed data! It is in machine readable RDF Using FOAF + Geonames
Some flavours taste better to machines
XHTML document contains embedded semantics Both human and machine-readableWell formed structure with lightweight markup for DOM elementsE.g. Microformats, RDF in Attributes (RDFa)XSL Transformations are specified in a document’s headerThis allows the XHTML structure to be converted into RDFGRDDL is used to lift an RDF model from a given document
XHTML document contains embedded semantics Both human and machine-readableWell formed structure with lightweight markup for DOM elementsE.g. Microformats, RDF in Attributes (RDFa)XSL Transformations are specified in a document’s headerThis allows the XHTML structure to be converted into RDFGRDDL is used to lift an RDF model from a given document
Designed for human consumptionmachinesHard to build a metadata model fromHTML markup controls the arrangement and presentation of informationFormatting provides logical distinctions between pieces of information in a given HTML document
1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques 3 techniques were explored: 1. rule-based 2. graph-based 3. semi-supervised machine learning
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDF
Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
Adjacency Matrix gives the local similarity of nodes in the spaceProbability of moving from one node to another given that t steps have been traversed
Commute TimeIf many paths exist between nodesAND those paths are short in lengthTHEN Commute Time decreases!Optimum Transitions
Kernel functions over the graph spaceWe now have similarity measures between nodes!Lower the distances/steps -> greater the similarity!
Kernel functions over the graph spaceWe now have similarity measures between nodes!Lower the distances/steps -> greater the similarity!
Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselves
Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
Jaccard = strictIFP = Less strict, but requires certain propertiiesEntailment = allows variability
SNOWBALL
Precision =proportion of web resources which are correctly labelled as citing a personRecall = proportion of web references which are correctly disambiguated F-Measure = harmonic mean of precision and recall
Achieves high levels of precisionOutperforming humans and other baselinesSPARQL rules require strict literal and resource matching within the triple patternsLeads to poor recall levels howeverUnable to learn from past disambiguation decisionsAt lower-levels of web presence (where identity web references are sparse) rules outperform all baselines in terms of f-measureHumans find it difficult to detect sparse web referencesAutomating disambiguation at such levels is more suitable
Achieves higher levels of recall for both distance measures than human processingCommute Time yields higher precision levels than Optimum TransitionsDue to the round trip cost used to cluster web resources with the social graphPerformance improves as web presence levels increaseRandom Walks performs better where feature sets are large in sizeIndicative of large web presencePrecision levels are less than inference rulesClustering using commute time and optimum transitions leads to an increase in false positivesAmbiguous nodes in the graph-space leads to incorrect disambiguation decisionsE.g. literal in a metadata model denoting a person’s name
Entailment consistent achieves the highest f-measure scores for each classifierReduction in overfitting to training dataGeneralises well to new instancesCharacterised by recall level achieved with SVMTwo permutations outperform humansPerceptron and SVM with EntailmentJaccard and IFP perform well at low-levels of web presenceF-measure reduces as identity web references grow in numberStrict feature matching leads to poor recall levelsOverfitting to training dataSelf-training outperforms both Random Walks and Inference Rules for certain permutationsDirect use of disambiguation decisions allows classifiers to improve upon their initial hypothesis
17 as first author5 papers accepted for publication since thesis submission

Automated Disambiguation of Identity Web References Using Social Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automated Disambiguation of Identity Web References Using Social Data

Similar to Automated Disambiguation of Identity Web References Using Social Data (20)

More from Matthew Rowe

More from Matthew Rowe (20)

Recently uploaded

Recently uploaded (20)

Automated Disambiguation of Identity Web References Using Social Data

Editor's Notes