Disambiguating Identity Web References using Social Data<br />Matthew Rowe<br />Organisations, Information and Knowledge G...
Outline<br />Problem Setting<br />Research Questions<br />Claims of the Thesis<br />State of the Art<br />Requirements for...
Personal Information on the Web<br />Personal information on the Web is disseminated:<br />Voluntarily<br />Involuntarily<...
Ambiguity!<br />
Matthew Rowe: Composer<br />
Matthew Rowe: Cyclist<br />
Matthew Rowe: Gardener<br />
Matthew Rowe: Song Writer<br />
Matthew Rowe: PhD Student<br />
Problem Setting<br />Performing disambiguation manually:<br />Time consuming<br />Laborious<br />Handle masses of informat...
Research Questions<br />How can identity web references be disambiguated automatically?<br />Alleviate human processing:<b...
State of the Art<br />Disambiguation techniques are divisible into 2 types: <br />Seeded techniques<br />E.g. [Bekkerman a...
Requirements<br />Requirements for Seeded Disambiguation:<br />Bootstrap the disambiguation process with minimal supervisi...
Disambiguating Identity Web References<br />
Harnessing the Social Web<br />WWW has evolved into a web of participation<br />Digital identity is important on the Socia...
Data found on Social Web platforms is representative of real identity information<br />
User Study<br />Data found on Social Web platforms is representative of real identity information<br />50 participants fro...
Disambiguating Identity Web References<br />
Leveraging Seed Data from the Social Web<br />3. Seed Data:<br /><ul><li>How can this be gathered inexpensively?</li></li>...
Leveraging Seed Data from the Social Web<br />Link things together!<br />
Leveraging Seed Data from the Social Web<br />Blocking Step<br /><ul><li>Only compare people with the same name</li></ul>C...
Leveraging Seed Data from the Social Web<br />Allows remote resource information to change<br />Automated techniques:<br /...
Disambiguating Identity Web References<br />
Generating Metadata Models<br />Input to disambiguation techniques is a set of web resources<br />Web resources come in ma...
Generating RDF Models from XHTML Documents<br />http://events.linkeddata.org/ldow2009/<br />
Generating RDF Models from XHTML Documents<br />
Generating RDF Models from HTML Documents<br />Rise in use of lowercase semantics!<br />However only 2.6% of web documents...
Generating RDF Models from HTML Documents<br />
Generating RDF Models from HTML Documents<br /><ul><li>HTML is often poorly structured
Need a Document Object Model
Therefore Tidy it!</li></li></ul><li>Generating RDF Models from HTML Documents<br /><ul><li>Identify document segments for...
1 window = Info about 1 person
Get Xpath expression to the window</li></li></ul><li>Generating RDF Models from HTML Documents<br /><ul><li>Extract inform...
E.g. name, email, www, location
Train model parameters: Transition probs, emission probs, start probs
Use Viterbi algorithm to label tokens with states
Returns most likely state sequence</li></li></ul><li>Generating RDF Models from HTML Documents<br />M Rowe. Data.dcs: Conv...
Disambiguating Identity Web References<br />
Disambiguation 1: Inference Rules<br />1. Extract instances from Seed Data<br />2. For each instance, build a rule:<br /><...
Add triples to the rule
Create a new rule if a triple’s predicate is Inverse Functional</li></ul>3. Apply the rules to the web resources<br />
Disambiguation 1: Inference Rules<br />1. Extract instances from Seed Data<br />2. For each instance, build a rule:<br /><...
Add triples to the rule
Create a new rule is a triple’s predicate is Inverse Functional</li></ul>3. Apply the rules to the web resources<br />
Disambiguation 1: Inference Rules<br />PREFIX foaf:<http://xmlns.com/foaf/0.1/><br />CONSTRUCT { <http://www.dcs.shef.ac.u...
Add triples to the rule
Create a new rule if a triple’s predicate is Inverse Functional</li></ul>3. Apply the rules to the web resources<br />
Disambiguation 1: Inference Rules<br />PREFIX foaf:<http://xmlns.com/foaf/0.1/><br />CONSTRUCT { <http://www.dcs.shef.ac.u...
Add triples to the rule
Create a new rule if a triple’s predicate is Inverse Functional</li></ul>3. Apply the rules to the web resources<br />
Disambiguation 1: Inference Rules<br />1. Extract instances<br />2. For each instance, build a rule:<br /><ul><li>Build a ...
Add triples to the rule
Create a new rule if a triple’s predicate is Inverse Functional</li></ul>3. Apply the rules<br />PREFIX foaf:<http://xmlns...
Disambiguation 1: Inference Rules<br />Advantages:<br /><ul><li>Highly precise
Applies graph patterns</li></ul>Disadvantages:<br /><ul><li>Does not learn from past decisions (supervised)
Strict matching: lack of generalisation</li></ul>M Rowe. Inferring Web Citations using Social Data and SPARQL Rules. In pr...
Disambiguation 2: Random Walks<br />Seed data and web resources are RDF<br />RDF has a graph structure:<br /><subject, pre...
Disambiguation 2: Random Walks<br /><ul><li>Link the social graph with the web resources
Via common resources/literals</li></li></ul><li>Disambiguation 2: Random Walks<br />
Disambiguation: Random Walks<br />
Disambiguation 2: Random Walks<br /><ul><li>Graph space may contain islands of nodes
Inhibit transitions through the graph space
Get the component containing the social graph</li></li></ul><li>Disambiguation 2: Random Walks<br /><ul><li>Perform Random...
Disambiguation 2: Random Walks<br /><ul><li>Measure Distances:
Commute Time distance
Leave node i : reach node j : return to node i
Optimum Transitions
Upcoming SlideShare
Loading in …5
×

PhD Viva - Disambiguating Identity Web References using Social Data

1,469 views

Published on

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,469
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
34
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
  • 1,580,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  • 1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  • 1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  • 1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  • VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
  • From heterogeneous sources!
  • KnowItAll system [Etzioni et al, 2005]: identifies facts in web pages and populates a knowledge base DBPedia project [Auer et al, 2008]: extracts information from Wikipedia and builds a large machine-readable knowledge base Social Web platforms such as Facebook and MySpaceSocial dataSufficient to support automated techniques
  • Seeded techniques: [Bekkerman and McCallum, 2005] Seed data = the social network of a personCluster pages on link structure web pages are collected and clustered based on link structures Unseeded techniques:[Song et al, 2007]Aligns person names with web page topicCluster pages =Generative model built from topic list Seeded techniques suited to this thesis’ problem settingNo need to partition web citations into k clustersHandling a large amount of irrelevant informationAble to focus on disambiguating web references for a single personIn line with state of the art approaches
  • Places requirements on the technique + seed data
  • My solution to the problem settingUser-centric approachSemantic Web technologies = consistent interpretation of information
  • Web EvoluationWikipedia = wisdom of the crowdBlogging platforms = web users share thoughts/opinionsWeb = a Social WebDigital IdentityRich functionalitiesUsers build bespoke identitiesDigital identity can be divided into 3 tiersMy Identity: persistent identity information (name, date of birth, genealogical relations)Shared Identity: social networks, friend relationshipsAbstracted Identity: demographics of usage (e.g. community of practice)ID FragmentationMySpace = share/discuss musicLinkedIn = make business connectionsData PortabilityInfo in proprietary formatsHard to link togetherNeed to solve these issues
  • Social Web platforms:maintain offline relationships [Hart et al, 2008] reinforce existing offline relationships [Ellison et al, 2007]
  • Social Web platforms:maintain offline relationships [Hart et al, 2008] reinforce existing offline relationships [Ellison et al, 2007]Real world networkContains strong-tied relationships [Donath&amp; Boyd, 2004]RelevanceRatio of strong-tied to weak-tied relationships in the digital social networkCoverageProportion of the real-world in the digital social networkResults:Relevance 23% of a digital social network contains strong-tied relationshipsCoverage77% of a participant’s real-world social network appears onlineDifferent from findings by [Subrahmanyam et al, 2008] 49% for coverage (they define it as overlap)Due to different demographics
  • My solution to the problem settingUser-centric approachSemantic Web technologies = consistent interpretation of information
  • Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
  • Explain what RDF is!!!!Graph like model of dataExport individual social graphsFrom Facebook, Twitter, etc!Overcomes data portability issues!
  • Want to detect equivalent instances!!
  • Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
  • Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
  • We now have our Seed data! It is in machine readable RDF Using FOAF + Geonames
  • Some flavours taste better to machines
  • XHTML document contains embedded semantics Both human and machine-readableWell formed structure with lightweight markup for DOM elementsE.g. Microformats, RDF in Attributes (RDFa)XSL Transformations are specified in a document’s headerThis allows the XHTML structure to be converted into RDFGRDDL is used to lift an RDF model from a given document
  • XHTML document contains embedded semantics Both human and machine-readableWell formed structure with lightweight markup for DOM elementsE.g. Microformats, RDF in Attributes (RDFa)XSL Transformations are specified in a document’s headerThis allows the XHTML structure to be converted into RDFGRDDL is used to lift an RDF model from a given document
  • Designed for human consumptionmachinesHard to build a metadata model fromHTML markup controls the arrangement and presentation of informationFormatting provides logical distinctions between pieces of information in a given HTML document
  • 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  • 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  • 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  • 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  • We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques 3 techniques were explored: 1. rule-based 2. graph-based 3. semi-supervised machine learning
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • [Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDF
  • Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  • Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  • Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  • Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  • Adjacency Matrix gives the local similarity of nodes in the spaceProbability of moving from one node to another given that t steps have been traversed
  • Commute TimeIf many paths exist between nodesAND those paths are short in lengthTHEN Commute Time decreases!Optimum Transitions
  • Kernel functions over the graph spaceWe now have similarity measures between nodes!Lower the distances/steps -&gt; greater the similarity!
  • Kernel functions over the graph spaceWe now have similarity measures between nodes!Lower the distances/steps -&gt; greater the similarity!
  • Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On &amp; Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  • Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselves
  • Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  • Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  • Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  • Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  • Jaccard = strictIFP = Less strict, but requires certain propertiiesEntailment = allows variability
  • SNOWBALL
  • Precision =proportion of web resources which are correctly labelled as citing a personRecall = proportion of web references which are correctly disambiguated F-Measure = harmonic mean of precision and recall
  • Achieves high levels of precisionOutperforming humans and other baselinesSPARQL rules require strict literal and resource matching within the triple patternsLeads to poor recall levels howeverUnable to learn from past disambiguation decisionsAt lower-levels of web presence (where identity web references are sparse) rules outperform all baselines in terms of f-measureHumans find it difficult to detect sparse web referencesAutomating disambiguation at such levels is more suitable
  • Achieves higher levels of recall for both distance measures than human processingCommute Time yields higher precision levels than Optimum TransitionsDue to the round trip cost used to cluster web resources with the social graphPerformance improves as web presence levels increaseRandom Walks performs better where feature sets are large in sizeIndicative of large web presencePrecision levels are less than inference rulesClustering using commute time and optimum transitions leads to an increase in false positivesAmbiguous nodes in the graph-space leads to incorrect disambiguation decisionsE.g. literal in a metadata model denoting a person’s name
  • Entailment consistent achieves the highest f-measure scores for each classifierReduction in overfitting to training dataGeneralises well to new instancesCharacterised by recall level achieved with SVMTwo permutations outperform humansPerceptron and SVM with EntailmentJaccard and IFP perform well at low-levels of web presenceF-measure reduces as identity web references grow in numberStrict feature matching leads to poor recall levelsOverfitting to training dataSelf-training outperforms both Random Walks and Inference Rules for certain permutationsDirect use of disambiguation decisions allows classifiers to improve upon their initial hypothesis
  • 17 as first author5 papers accepted for publication since thesis submission
  • PhD Viva - Disambiguating Identity Web References using Social Data

    1. 1. Disambiguating Identity Web References using Social Data<br />Matthew Rowe<br />Organisations, Information and Knowledge Group<br />Department of Computer Science<br />University of Sheffield<br />
    2. 2. Outline<br />Problem Setting<br />Research Questions<br />Claims of the Thesis<br />State of the Art<br />Requirements for Disambiguation and Seed Data<br />Disambiguating Identity Web References<br />Leveraging Seed Data from the Social Web<br />Generating Metadata Models<br />Disambiguation Techniques<br />Evaluation<br />Conclusions<br />Dissemination and Impact<br />
    3. 3. Personal Information on the Web<br />Personal information on the Web is disseminated:<br />Voluntarily<br />Involuntarily<br />Increase in personal information:<br />Identity Theft<br />Lateral Surveillance<br />Web users must discover their identity web references<br />2 stage process<br />Finding<br />Disambiguating<br />Disambiguation = reduction of web reference ambiguity<br />My thesis addresses disambiguation<br />
    4. 4. Ambiguity!<br />
    5. 5. Matthew Rowe: Composer<br />
    6. 6. Matthew Rowe: Cyclist<br />
    7. 7. Matthew Rowe: Gardener<br />
    8. 8. Matthew Rowe: Song Writer<br />
    9. 9. Matthew Rowe: PhD Student<br />
    10. 10. Problem Setting<br />Performing disambiguation manually:<br />Time consuming<br />Laborious<br />Handle masses of information<br />Repeated often<br />The Web keeps changing<br />Solution = automated techniques<br />Alleviate the need for humans<br />Need background knowledge<br />Who am I searching for?<br />What makes them unique? <br />
    11. 11. Research Questions<br />How can identity web references be disambiguated automatically?<br />Alleviate human processing:<br /><ul><li>Can automated techniques replace humans?</li></ul>Supervision:<br /><ul><li>Can automated techniques function independently?</li></ul>Seed Data:<br /><ul><li>How can this be gathered inexpensively?</li></ul>Interpretation:<br /><ul><li>How can automated techniques interpret information?</li></li></ul><li>Claims of the Thesis<br /><ul><li>Automated disambiguation techniques are able to replace human processing</li></ul>Retrieve and process information at large-scale<br />With high accuracy<br />Data found on Social Web platforms is representative of real identity information<br />Platforms allow users to build a digital identity<br /><ul><li>Social data provides the background knowledge required by automated disambiguation techniques</li></ul>Overcoming the burden of seed data generation<br />
    12. 12. State of the Art<br />Disambiguation techniques are divisible into 2 types: <br />Seeded techniques<br />E.g. [Bekkerman and McCallum, 2005], Commercial Services <br />Pros<br />Disambiguate web references for a single person<br />Cons:<br />Require seed data<br />No explanation of how seed data is acquired <br />Unseeded techniques<br />E.g. [Song et al, 2007]<br />Pros<br />Require no background knowledge<br />Cons<br />Groups web references into clusters<br />Need to choose the correct cluster <br />
    13. 13. Requirements<br />Requirements for Seeded Disambiguation:<br />Bootstrap the disambiguation process with minimal supervision<br />Achieve disambiguation accuracy comparable to human processing<br />Cope with web resources not containing seed data features<br />Disambiguation must be effective for all individuals<br />Requirements for Seed Data:<br />Produce seed data with minimal cost<br />Generate reliable seed data<br />
    14. 14. Disambiguating Identity Web References<br />
    15. 15. Harnessing the Social Web<br />WWW has evolved into a web of participation<br />Digital identity is important on the Social Web<br />Digital identity is fragmented across the Social Web<br />Data Portability from Social Web platforms is limited<br />http://www.economist.com/business/displaystory.cfm?story_id=10880936<br />
    16. 16. Data found on Social Web platforms is representative of real identity information<br />
    17. 17. User Study<br />Data found on Social Web platforms is representative of real identity information<br />50 participants from the University of Sheffield <br />Consisted of 3 stages, each participant:<br />List real world social network<br />Extract digital social network<br />Compare networks<br />Relevance: 0.23<br />Coverage: 0.77<br />Updates previous findings <br />[Subrahmanyam et al, 2008]<br />M Rowe. The Credibility of Digital Identity Information on the Social Web: A User Study. In proceedings of 4th Workshop on Information Credibility on the Web, World Wide Web Conference 2010. Raleigh, USA. (2010)<br />
    18. 18. Disambiguating Identity Web References<br />
    19. 19. Leveraging Seed Data from the Social Web<br />3. Seed Data:<br /><ul><li>How can this be gathered inexpensively?</li></li></ul><li>Leveraging Seed Data from the Social Web<br />Use Semantics!<br />M Rowe and F Ciravegna. Getting to Me - Exporting Semantic Social Network Information from Facebook. In proceedings of Social Data on the Web Workshop, ISWC 2008, Karlsruhe, Germany. (2008)<br />http://www.dcs.shef.ac.uk/~mrowe/foafgenerator.html<br />
    20. 20. Leveraging Seed Data from the Social Web<br />Link things together!<br />
    21. 21. Leveraging Seed Data from the Social Web<br />Blocking Step<br /><ul><li>Only compare people with the same name</li></ul>Compare values of Inverse Functional Properties<br /><ul><li>E.g. Homepage/Email</li></ul>Compare Geo URIs<br /><ul><li>E.g. Matching locations</li></ul>Compare Geo data<br /><ul><li>Using Linked Data sources</li></ul>M Rowe. Interlinking Distributed Social Graphs. In proceedings of Linked Data on the Web Workshop, World Wide Web Conference, Madrid, Spain. (2009)<br />
    22. 22. Leveraging Seed Data from the Social Web<br />Allows remote resource information to change<br />Automated techniques:<br />Follow the links<br />Retrieve the instance information<br />
    23. 23. Disambiguating Identity Web References<br />
    24. 24. Generating Metadata Models<br />Input to disambiguation techniques is a set of web resources<br />Web resources come in many flavours:<br />Data models<br />XHTML documents containing embedded semantics<br />HTML documents<br />4. Interpretation:<br />How can automated techniques interpret information?<br />Solution = Semantic Web technologies!<br />Convert web resources to RDF<br />Metadata descriptions = ontology concepts<br />Information is<br />Consistent<br />Interpretable<br />
    25. 25. Generating RDF Models from XHTML Documents<br />http://events.linkeddata.org/ldow2009/<br />
    26. 26. Generating RDF Models from XHTML Documents<br />
    27. 27. Generating RDF Models from HTML Documents<br />Rise in use of lowercase semantics!<br />However only 2.6% of web documents contain semantics <br /> [Mika et al, 2009]<br />Majority of the web is HTML <br />Bad for machines<br />Must extract person information<br />Then build an RDF model<br />Person information is structured <br />for legibility<br />for segmentation<br />i.e. logical distinction between elements<br />
    28. 28. Generating RDF Models from HTML Documents<br />
    29. 29. Generating RDF Models from HTML Documents<br /><ul><li>HTML is often poorly structured
    30. 30. Need a Document Object Model
    31. 31. Therefore Tidy it!</li></li></ul><li>Generating RDF Models from HTML Documents<br /><ul><li>Identify document segments for extraction
    32. 32. 1 window = Info about 1 person
    33. 33. Get Xpath expression to the window</li></li></ul><li>Generating RDF Models from HTML Documents<br /><ul><li>Extract information using a Hidden Markov Model
    34. 34. E.g. name, email, www, location
    35. 35. Train model parameters: Transition probs, emission probs, start probs
    36. 36. Use Viterbi algorithm to label tokens with states
    37. 37. Returns most likely state sequence</li></li></ul><li>Generating RDF Models from HTML Documents<br />M Rowe. Data.dcs: Converting Legacy Data into Linked Data. In proceedings of Linked Data on the Web Workshop, World Wide Web Conference 2010. Raleigh, USA. (2010)<br />
    38. 38. Disambiguating Identity Web References<br />
    39. 39. Disambiguation 1: Inference Rules<br />1. Extract instances from Seed Data<br />2. For each instance, build a rule:<br /><ul><li>Build a skeleton rule
    40. 40. Add triples to the rule
    41. 41. Create a new rule if a triple’s predicate is Inverse Functional</li></ul>3. Apply the rules to the web resources<br />
    42. 42. Disambiguation 1: Inference Rules<br />1. Extract instances from Seed Data<br />2. For each instance, build a rule:<br /><ul><li>Build a skeleton rule
    43. 43. Add triples to the rule
    44. 44. Create a new rule is a triple’s predicate is Inverse Functional</li></ul>3. Apply the rules to the web resources<br />
    45. 45. Disambiguation 1: Inference Rules<br />PREFIX foaf:<http://xmlns.com/foaf/0.1/><br />CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url }<br />WHERE {<br /><http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n .<br /> ?urlfoaf:topic ?p .<br /> ?pfoaf:name ?n .<br /><http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q .<br /> ?qfoaf:name ?m .<br /> ?urlfoaf:topic ?r .<br /> ?rfoaf:name ?m<br />}<br />1. Extract instances<br />2. For each instance, build a rule:<br /><ul><li>Build a skeleton rule
    46. 46. Add triples to the rule
    47. 47. Create a new rule if a triple’s predicate is Inverse Functional</li></ul>3. Apply the rules to the web resources<br />
    48. 48. Disambiguation 1: Inference Rules<br />PREFIX foaf:<http://xmlns.com/foaf/0.1/><br />CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url }<br />WHERE {<br /><http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n .<br /> ?urlfoaf:topic ?p .<br /> ?pfoaf:name ?n .<br /><http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q .<br /> ?qfoaf:homepage ?h .<br /> ?urlfoaf:topic ?r .<br /> ?rfoaf:homepage ?h<br />}<br />1. Extract instances<br />2. For each instance, build a rule:<br /><ul><li>Build a skeleton rule
    49. 49. Add triples to the rule
    50. 50. Create a new rule if a triple’s predicate is Inverse Functional</li></ul>3. Apply the rules to the web resources<br />
    51. 51. Disambiguation 1: Inference Rules<br />1. Extract instances<br />2. For each instance, build a rule:<br /><ul><li>Build a skeleton rule
    52. 52. Add triples to the rule
    53. 53. Create a new rule if a triple’s predicate is Inverse Functional</li></ul>3. Apply the rules<br />PREFIX foaf:<http://xmlns.com/foaf/0.1/><br />CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url }<br />WHERE {<br /><http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n .<br /> ?urlfoaf:topic ?p .<br /> ?pfoaf:name ?n .<br /><http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q .<br /> ?qfoaf:homepage ?h .<br /> ?urlfoaf:topic ?r .<br /> ?rfoaf:homepage ?h<br />}<br />
    54. 54. Disambiguation 1: Inference Rules<br />Advantages:<br /><ul><li>Highly precise
    55. 55. Applies graph patterns</li></ul>Disadvantages:<br /><ul><li>Does not learn from past decisions (supervised)
    56. 56. Strict matching: lack of generalisation</li></ul>M Rowe. Inferring Web Citations using Social Data and SPARQL Rules. In proceedings of Linking of User Profiles and Applications in the Social Semantic Web, Extended Semantic Web Conference 2010. Heraklion, Crete. (2010)<br />
    57. 57. Disambiguation 2: Random Walks<br />Seed data and web resources are RDF<br />RDF has a graph structure:<br /><subject, predicate, object><br /><source_node, edge, target_node><br />Graph-based disambiguation techniques:<br />E.g. [Jiang et al, 2009]<br />Build a graph-space<br />Partition data points in the graph-space<br />Requires methods to:<br />Compile a graph-space<br />Compare nodes<br />Cluster nodes<br />
    58. 58. Disambiguation 2: Random Walks<br /><ul><li>Link the social graph with the web resources
    59. 59. Via common resources/literals</li></li></ul><li>Disambiguation 2: Random Walks<br />
    60. 60. Disambiguation: Random Walks<br />
    61. 61. Disambiguation 2: Random Walks<br /><ul><li>Graph space may contain islands of nodes
    62. 62. Inhibit transitions through the graph space
    63. 63. Get the component containing the social graph</li></li></ul><li>Disambiguation 2: Random Walks<br /><ul><li>Perform Random Walks through the graph</li></ul>Derive Adjacency Matrix <br />Derive Diagonal Degree Matrix <br />Compute Transition Probability Matrix <br />
    64. 64. Disambiguation 2: Random Walks<br /><ul><li>Measure Distances:
    65. 65. Commute Time distance
    66. 66. Leave node i : reach node j : return to node i
    67. 67. Optimum Transitions
    68. 68. Move through the graph until probability peaks</li></li></ul><li>Disambiguation: Random Walks<br /><ul><li>Measure Distances:
    69. 69. Commute Time distance
    70. 70. Leave node i : reach node j : return to node i
    71. 71. Optimum Transitions
    72. 72. Move through the graph until P peaks</li></li></ul><li>Disambiguation 2: Random Walks<br /><ul><li>Group web resources with social graph
    73. 73. Via agglomerative clustering
    74. 74. Every point is in a cluster
    75. 75. Merge clusters until none can be merged</li></li></ul><li>Disambiguation 2: Random Walks<br />Advantages:<br /><ul><li>Semi-supervised
    76. 76. Exploits the graph structure of RDF</li></ul>Disadvantages:<br /><ul><li>Computationally heavy (Matrix powers!)
    77. 77. Relies on tuning clustering threshold</li></ul>M Rowe. Applying Semantic Social Graphs to Disambiguate Identity References. In proceedings of European Semantic Web Conference 2009, Heraklion, Crete. (2009)<br />
    78. 78. Disambiguation 3: Self-training<br />Classic ML scenario:<br />Lots of unlabelled data<br />Limited labelled data<br />Disambiguating identity web references is just the same!<br />Possible web citations = large<br />Social data = small<br />Semi-supervised learning is a solution<br />Train a classifier<br />Using labelled and unlabelled data!<br />Classification task is binary<br />Does this web resource refer to person X or not?<br />
    79. 79. Positive training data = seed data<br />Generate negative training data:<br />Via Rocchio classification:<br />Build centroid vectors: positive set and negative set<br />Negative set = unlabelled data<br />Compare possible web citations with vectors<br />Choose strongest negatives<br />Disambiguation 3: Self-training<br />
    80. 80. Positive training data = seed data<br />Generate negative training data:<br />Via Rocchio classification:<br />Build centroid vectors: positive set and negative set<br />Negative set = unlabelled data<br />Compare possible web citations with vectors<br />Choose strongest negatives<br />Disambiguation 3: Self-training<br />
    81. 81. Positive training data = seed data<br />Generate negative training data:<br />Via Rocchio classification:<br />Build centroid vectors: positive set and negative set<br />Negative set = unlabelled data<br />Compare possible web citations with vectors<br />Choose strongest negatives<br />Disambiguation 3: Self-training<br />
    82. 82. Positive training data = seed data<br />Generate negative training data:<br />Via Rocchio classification:<br />Build centroid vectors: positive set and negative set<br />Negative set = unlabelled data<br />Compare possible web citations with vectors<br />Choose strongest negatives<br />Disambiguation 3: Self-training<br />
    83. 83. Begin Self-training:<br />Train the Classifier<br />Classify the web resources<br />Rank classifications<br />Enlarge training sets<br />Repeat steps 1-4 <br />Disambiguation 3: Self-training<br />
    84. 84. Training/Testing data is RDF<br />Convert to a machine learning dataset<br />Features = RDF instances<br />Vary the feature similarity measure:<br />Jaccard Similarity<br />Inverse Functional Property Matching<br />RDF Entailment<br />Tested three different classifiers:<br />Perceptron<br />Support Vector Machine<br />Naïve Bayes<br />Disambiguation 3: Self-training<br />
    85. 85. Advantages<br />Directly learn from disambiguation decisions<br />Utilise abundance of unlabelled data<br />Disadvantages<br />Requires reliable negatives<br />Mistakes can reinforce themselves<br />M Rowe and F Ciravegna. Harnessing the Social Web: The Science of Identity Disambiguation. In proceedings of Web Science Conference 2010. Raleigh, USA. (2010)<br />Disambiguation 3: Self-training<br />
    86. 86. Evaluation<br />Measures:<br />Precision, Recall, F-Measure<br />Dataset<br />50 participants from the Semantic Web and Web 2.0 communities<br />~17300 web resources: 346 web resources for each participant<br />Baselines<br />Baseline 1: Person name as positive classification<br />Baseline 2: Hierarchical Clustering using Person Names<br />[Malin, 2005]<br />Baseline 3: Human Processing <br />
    87. 87. Evaluation: Inference Rules<br />High precision<br />Better than humans<br />Precise graph pattern matching<br />Low recall<br />Rules are strict<br />No room for variability<br />Hard to generalise<br />No learning from disambiguation decisions<br />
    88. 88. Evaluation: Random Walks<br />High recall<br />Higher than humans<br />Incorporates unlabelled data into random walks<br />Uses features not in the seed data<br />Precision<br />Lower than humans and rules<br />Ambiguous name literals lead to false positives<br />
    89. 89. Evaluation: Self-training<br />High Recall<br />SVM + Entailment classifies 91% of references<br />High F-Measure <br />Higher than humans<br />Perceptron + Entailment and SVM + Entailment<br />
    90. 90. Conclusions: Research Questions<br />Alleviate human processing:<br /><ul><li>Can automated techniques replace humans?</li></ul>Performance is comparable to humans<br />Suited to low web presence <br />Supervision:<br /><ul><li>Can automated techniques function independently?</li></ul>Inference Rules : Induce rules from seed data<br />Random Walks : Graph space built from models<br />Self-training : Learn + retrain a classifier<br />Seed Data:<br /><ul><li>How can this be gathered inexpensively?</li></ul>Utilise Social Web platforms<br />Digital identities are similar to real world identities<br />Interpretation:<br /><ul><li>How can automated techniques interpret information?</li></ul>Solution = Semantic Web technologies<br />Convert web resources into metadata models<br />
    91. 91. Conclusions: Claims<br />Automated disambiguation techniques are able to replace human processing<br />Techniques are comparable to humans<br />Overcome manual processing<br />Data found on Social Web platforms is representative of real identity information<br />77% of a real world social network is covered online<br />Social data provides the background knowledge required by automated disambiguation techniques<br />Techniques function using social data<br />Biographical and social network enables disambiguation<br />
    92. 92. Dissemination and Impact<br />Published 21 peer-reviewed publications<br />Paper in the Journal of Web Semantics (impact: 3.5)<br />Presented work at many international conferences<br />Program committee member for 5 international workshops<br />Invited Expert for the World Wide Web Consortium’s Social Web Incubator Group<br />Listed as one of top 100 visionaries “discussing the future of the web”<br />http://www.semanticweb.com/semanticweb100/<br />Linked Data service for the DCS<br />Best Poster at the Extended Semantic Web Conference 2010<br />http://data.dcs.shef.ac.uk<br />Tools widely used by the Semantic Web community<br />FOAF Generator<br />Social Identity Schema Mapping (SISM) Vocabulary<br />
    93. 93. Twitter: @mattroweshow<br />Web: http://www.dcs.shef.ac.uk/~mrowe<br />Email: m.rowe@dcs.shef.ac.uk<br />Questions?<br />For a condensed version of my thesis:<br />M Rowe and F Ciravegna. Disambiguating Identity Web References using Web 2.0 Data and Semantics. In Press for special issue on "Web 2.0" in the Journal of Web Semantics. (2010)<br />

    ×