PhD Viva - Disambiguating Identity Web References using Social Data
Upcoming SlideShare
Loading in...5
×
 

PhD Viva - Disambiguating Identity Web References using Social Data

on

  • 1,468 views

 

Statistics

Views

Total Views
1,468
Views on SlideShare
1,465
Embed Views
3

Actions

Likes
2
Downloads
32
Comments
0

1 Embed 3

http://www.slideshare.net 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
  • 1,580,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  • 1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  • 1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  • 1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
  • VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
  • From heterogeneous sources!
  • KnowItAll system [Etzioni et al, 2005]: identifies facts in web pages and populates a knowledge base DBPedia project [Auer et al, 2008]: extracts information from Wikipedia and builds a large machine-readable knowledge base Social Web platforms such as Facebook and MySpaceSocial dataSufficient to support automated techniques
  • Seeded techniques: [Bekkerman and McCallum, 2005] Seed data = the social network of a personCluster pages on link structure web pages are collected and clustered based on link structures Unseeded techniques:[Song et al, 2007]Aligns person names with web page topicCluster pages =Generative model built from topic list Seeded techniques suited to this thesis’ problem settingNo need to partition web citations into k clustersHandling a large amount of irrelevant informationAble to focus on disambiguating web references for a single personIn line with state of the art approaches
  • Places requirements on the technique + seed data
  • My solution to the problem settingUser-centric approachSemantic Web technologies = consistent interpretation of information
  • Web EvoluationWikipedia = wisdom of the crowdBlogging platforms = web users share thoughts/opinionsWeb = a Social WebDigital IdentityRich functionalitiesUsers build bespoke identitiesDigital identity can be divided into 3 tiersMy Identity: persistent identity information (name, date of birth, genealogical relations)Shared Identity: social networks, friend relationshipsAbstracted Identity: demographics of usage (e.g. community of practice)ID FragmentationMySpace = share/discuss musicLinkedIn = make business connectionsData PortabilityInfo in proprietary formatsHard to link togetherNeed to solve these issues
  • Social Web platforms:maintain offline relationships [Hart et al, 2008] reinforce existing offline relationships [Ellison et al, 2007]
  • Social Web platforms:maintain offline relationships [Hart et al, 2008] reinforce existing offline relationships [Ellison et al, 2007]Real world networkContains strong-tied relationships [Donath& Boyd, 2004]RelevanceRatio of strong-tied to weak-tied relationships in the digital social networkCoverageProportion of the real-world in the digital social networkResults:Relevance 23% of a digital social network contains strong-tied relationshipsCoverage77% of a participant’s real-world social network appears onlineDifferent from findings by [Subrahmanyam et al, 2008] 49% for coverage (they define it as overlap)Due to different demographics
  • My solution to the problem settingUser-centric approachSemantic Web technologies = consistent interpretation of information
  • Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
  • Explain what RDF is!!!!Graph like model of dataExport individual social graphsFrom Facebook, Twitter, etc!Overcomes data portability issues!
  • Want to detect equivalent instances!!
  • Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
  • Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
  • We now have our Seed data! It is in machine readable RDF Using FOAF + Geonames
  • Some flavours taste better to machines
  • XHTML document contains embedded semantics Both human and machine-readableWell formed structure with lightweight markup for DOM elementsE.g. Microformats, RDF in Attributes (RDFa)XSL Transformations are specified in a document’s headerThis allows the XHTML structure to be converted into RDFGRDDL is used to lift an RDF model from a given document
  • XHTML document contains embedded semantics Both human and machine-readableWell formed structure with lightweight markup for DOM elementsE.g. Microformats, RDF in Attributes (RDFa)XSL Transformations are specified in a document’s headerThis allows the XHTML structure to be converted into RDFGRDDL is used to lift an RDF model from a given document
  • Designed for human consumptionmachinesHard to build a metadata model fromHTML markup controls the arrangement and presentation of informationFormatting provides logical distinctions between pieces of information in a given HTML document
  • 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  • 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  • 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  • 1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
  • We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques 3 techniques were explored: 1. rule-based 2. graph-based 3. semi-supervised machine learning
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
  • [Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDF
  • Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  • Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  • Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  • Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  • Adjacency Matrix gives the local similarity of nodes in the spaceProbability of moving from one node to another given that t steps have been traversed
  • Commute TimeIf many paths exist between nodesAND those paths are short in lengthTHEN Commute Time decreases!Optimum Transitions
  • Kernel functions over the graph spaceWe now have similarity measures between nodes!Lower the distances/steps -> greater the similarity!
  • Kernel functions over the graph spaceWe now have similarity measures between nodes!Lower the distances/steps -> greater the similarity!
  • Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
  • Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselves
  • Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  • Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  • Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  • Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
  • Jaccard = strictIFP = Less strict, but requires certain propertiiesEntailment = allows variability
  • SNOWBALL
  • Precision =proportion of web resources which are correctly labelled as citing a personRecall = proportion of web references which are correctly disambiguated F-Measure = harmonic mean of precision and recall
  • Achieves high levels of precisionOutperforming humans and other baselinesSPARQL rules require strict literal and resource matching within the triple patternsLeads to poor recall levels howeverUnable to learn from past disambiguation decisionsAt lower-levels of web presence (where identity web references are sparse) rules outperform all baselines in terms of f-measureHumans find it difficult to detect sparse web referencesAutomating disambiguation at such levels is more suitable
  • Achieves higher levels of recall for both distance measures than human processingCommute Time yields higher precision levels than Optimum TransitionsDue to the round trip cost used to cluster web resources with the social graphPerformance improves as web presence levels increaseRandom Walks performs better where feature sets are large in sizeIndicative of large web presencePrecision levels are less than inference rulesClustering using commute time and optimum transitions leads to an increase in false positivesAmbiguous nodes in the graph-space leads to incorrect disambiguation decisionsE.g. literal in a metadata model denoting a person’s name
  • Entailment consistent achieves the highest f-measure scores for each classifierReduction in overfitting to training dataGeneralises well to new instancesCharacterised by recall level achieved with SVMTwo permutations outperform humansPerceptron and SVM with EntailmentJaccard and IFP perform well at low-levels of web presenceF-measure reduces as identity web references grow in numberStrict feature matching leads to poor recall levelsOverfitting to training dataSelf-training outperforms both Random Walks and Inference Rules for certain permutationsDirect use of disambiguation decisions allows classifiers to improve upon their initial hypothesis
  • 17 as first author5 papers accepted for publication since thesis submission

PhD Viva - Disambiguating Identity Web References using Social Data PhD Viva - Disambiguating Identity Web References using Social Data Presentation Transcript

  • Disambiguating Identity Web References using Social Data
    Matthew Rowe
    Organisations, Information and Knowledge Group
    Department of Computer Science
    University of Sheffield
  • Outline
    Problem Setting
    Research Questions
    Claims of the Thesis
    State of the Art
    Requirements for Disambiguation and Seed Data
    Disambiguating Identity Web References
    Leveraging Seed Data from the Social Web
    Generating Metadata Models
    Disambiguation Techniques
    Evaluation
    Conclusions
    Dissemination and Impact
  • Personal Information on the Web
    Personal information on the Web is disseminated:
    Voluntarily
    Involuntarily
    Increase in personal information:
    Identity Theft
    Lateral Surveillance
    Web users must discover their identity web references
    2 stage process
    Finding
    Disambiguating
    Disambiguation = reduction of web reference ambiguity
    My thesis addresses disambiguation
  • Ambiguity!
  • Matthew Rowe: Composer
  • Matthew Rowe: Cyclist
  • Matthew Rowe: Gardener
  • Matthew Rowe: Song Writer
  • Matthew Rowe: PhD Student
  • Problem Setting
    Performing disambiguation manually:
    Time consuming
    Laborious
    Handle masses of information
    Repeated often
    The Web keeps changing
    Solution = automated techniques
    Alleviate the need for humans
    Need background knowledge
    Who am I searching for?
    What makes them unique?
  • Research Questions
    How can identity web references be disambiguated automatically?
    Alleviate human processing:
    • Can automated techniques replace humans?
    Supervision:
    • Can automated techniques function independently?
    Seed Data:
    • How can this be gathered inexpensively?
    Interpretation:
    • How can automated techniques interpret information?
  • Claims of the Thesis
    • Automated disambiguation techniques are able to replace human processing
    Retrieve and process information at large-scale
    With high accuracy
    Data found on Social Web platforms is representative of real identity information
    Platforms allow users to build a digital identity
    • Social data provides the background knowledge required by automated disambiguation techniques
    Overcoming the burden of seed data generation
  • State of the Art
    Disambiguation techniques are divisible into 2 types:
    Seeded techniques
    E.g. [Bekkerman and McCallum, 2005], Commercial Services
    Pros
    Disambiguate web references for a single person
    Cons:
    Require seed data
    No explanation of how seed data is acquired
    Unseeded techniques
    E.g. [Song et al, 2007]
    Pros
    Require no background knowledge
    Cons
    Groups web references into clusters
    Need to choose the correct cluster
  • Requirements
    Requirements for Seeded Disambiguation:
    Bootstrap the disambiguation process with minimal supervision
    Achieve disambiguation accuracy comparable to human processing
    Cope with web resources not containing seed data features
    Disambiguation must be effective for all individuals
    Requirements for Seed Data:
    Produce seed data with minimal cost
    Generate reliable seed data
  • Disambiguating Identity Web References
  • Harnessing the Social Web
    WWW has evolved into a web of participation
    Digital identity is important on the Social Web
    Digital identity is fragmented across the Social Web
    Data Portability from Social Web platforms is limited
    http://www.economist.com/business/displaystory.cfm?story_id=10880936
  • Data found on Social Web platforms is representative of real identity information
  • User Study
    Data found on Social Web platforms is representative of real identity information
    50 participants from the University of Sheffield
    Consisted of 3 stages, each participant:
    List real world social network
    Extract digital social network
    Compare networks
    Relevance: 0.23
    Coverage: 0.77
    Updates previous findings
    [Subrahmanyam et al, 2008]
    M Rowe. The Credibility of Digital Identity Information on the Social Web: A User Study. In proceedings of 4th Workshop on Information Credibility on the Web, World Wide Web Conference 2010. Raleigh, USA. (2010)
  • Disambiguating Identity Web References
  • Leveraging Seed Data from the Social Web
    3. Seed Data:
    • How can this be gathered inexpensively?
  • Leveraging Seed Data from the Social Web
    Use Semantics!
    M Rowe and F Ciravegna. Getting to Me - Exporting Semantic Social Network Information from Facebook. In proceedings of Social Data on the Web Workshop, ISWC 2008, Karlsruhe, Germany. (2008)
    http://www.dcs.shef.ac.uk/~mrowe/foafgenerator.html
  • Leveraging Seed Data from the Social Web
    Link things together!
  • Leveraging Seed Data from the Social Web
    Blocking Step
    • Only compare people with the same name
    Compare values of Inverse Functional Properties
    • E.g. Homepage/Email
    Compare Geo URIs
    • E.g. Matching locations
    Compare Geo data
    • Using Linked Data sources
    M Rowe. Interlinking Distributed Social Graphs. In proceedings of Linked Data on the Web Workshop, World Wide Web Conference, Madrid, Spain. (2009)
  • Leveraging Seed Data from the Social Web
    Allows remote resource information to change
    Automated techniques:
    Follow the links
    Retrieve the instance information
  • Disambiguating Identity Web References
  • Generating Metadata Models
    Input to disambiguation techniques is a set of web resources
    Web resources come in many flavours:
    Data models
    XHTML documents containing embedded semantics
    HTML documents
    4. Interpretation:
    How can automated techniques interpret information?
    Solution = Semantic Web technologies!
    Convert web resources to RDF
    Metadata descriptions = ontology concepts
    Information is
    Consistent
    Interpretable
  • Generating RDF Models from XHTML Documents
    http://events.linkeddata.org/ldow2009/
  • Generating RDF Models from XHTML Documents
  • Generating RDF Models from HTML Documents
    Rise in use of lowercase semantics!
    However only 2.6% of web documents contain semantics
    [Mika et al, 2009]
    Majority of the web is HTML
    Bad for machines
    Must extract person information
    Then build an RDF model
    Person information is structured
    for legibility
    for segmentation
    i.e. logical distinction between elements
  • Generating RDF Models from HTML Documents
  • Generating RDF Models from HTML Documents
    • HTML is often poorly structured
    • Need a Document Object Model
    • Therefore Tidy it!
  • Generating RDF Models from HTML Documents
    • Identify document segments for extraction
    • 1 window = Info about 1 person
    • Get Xpath expression to the window
  • Generating RDF Models from HTML Documents
    • Extract information using a Hidden Markov Model
    • E.g. name, email, www, location
    • Train model parameters: Transition probs, emission probs, start probs
    • Use Viterbi algorithm to label tokens with states
    • Returns most likely state sequence
  • Generating RDF Models from HTML Documents
    M Rowe. Data.dcs: Converting Legacy Data into Linked Data. In proceedings of Linked Data on the Web Workshop, World Wide Web Conference 2010. Raleigh, USA. (2010)
  • Disambiguating Identity Web References
  • Disambiguation 1: Inference Rules
    1. Extract instances from Seed Data
    2. For each instance, build a rule:
    • Build a skeleton rule
    • Add triples to the rule
    • Create a new rule if a triple’s predicate is Inverse Functional
    3. Apply the rules to the web resources
  • Disambiguation 1: Inference Rules
    1. Extract instances from Seed Data
    2. For each instance, build a rule:
    • Build a skeleton rule
    • Add triples to the rule
    • Create a new rule is a triple’s predicate is Inverse Functional
    3. Apply the rules to the web resources
  • Disambiguation 1: Inference Rules
    PREFIX foaf:<http://xmlns.com/foaf/0.1/>
    CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url }
    WHERE {
    <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n .
    ?urlfoaf:topic ?p .
    ?pfoaf:name ?n .
    <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q .
    ?qfoaf:name ?m .
    ?urlfoaf:topic ?r .
    ?rfoaf:name ?m
    }
    1. Extract instances
    2. For each instance, build a rule:
    • Build a skeleton rule
    • Add triples to the rule
    • Create a new rule if a triple’s predicate is Inverse Functional
    3. Apply the rules to the web resources
  • Disambiguation 1: Inference Rules
    PREFIX foaf:<http://xmlns.com/foaf/0.1/>
    CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url }
    WHERE {
    <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n .
    ?urlfoaf:topic ?p .
    ?pfoaf:name ?n .
    <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q .
    ?qfoaf:homepage ?h .
    ?urlfoaf:topic ?r .
    ?rfoaf:homepage ?h
    }
    1. Extract instances
    2. For each instance, build a rule:
    • Build a skeleton rule
    • Add triples to the rule
    • Create a new rule if a triple’s predicate is Inverse Functional
    3. Apply the rules to the web resources
  • Disambiguation 1: Inference Rules
    1. Extract instances
    2. For each instance, build a rule:
    • Build a skeleton rule
    • Add triples to the rule
    • Create a new rule if a triple’s predicate is Inverse Functional
    3. Apply the rules
    PREFIX foaf:<http://xmlns.com/foaf/0.1/>
    CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url }
    WHERE {
    <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n .
    ?urlfoaf:topic ?p .
    ?pfoaf:name ?n .
    <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q .
    ?qfoaf:homepage ?h .
    ?urlfoaf:topic ?r .
    ?rfoaf:homepage ?h
    }
  • Disambiguation 1: Inference Rules
    Advantages:
    • Highly precise
    • Applies graph patterns
    Disadvantages:
    • Does not learn from past decisions (supervised)
    • Strict matching: lack of generalisation
    M Rowe. Inferring Web Citations using Social Data and SPARQL Rules. In proceedings of Linking of User Profiles and Applications in the Social Semantic Web, Extended Semantic Web Conference 2010. Heraklion, Crete. (2010)
  • Disambiguation 2: Random Walks
    Seed data and web resources are RDF
    RDF has a graph structure:
    <subject, predicate, object>
    <source_node, edge, target_node>
    Graph-based disambiguation techniques:
    E.g. [Jiang et al, 2009]
    Build a graph-space
    Partition data points in the graph-space
    Requires methods to:
    Compile a graph-space
    Compare nodes
    Cluster nodes
  • Disambiguation 2: Random Walks
    • Link the social graph with the web resources
    • Via common resources/literals
  • Disambiguation 2: Random Walks
  • Disambiguation: Random Walks
  • Disambiguation 2: Random Walks
    • Graph space may contain islands of nodes
    • Inhibit transitions through the graph space
    • Get the component containing the social graph
  • Disambiguation 2: Random Walks
    • Perform Random Walks through the graph
    Derive Adjacency Matrix
    Derive Diagonal Degree Matrix
    Compute Transition Probability Matrix
  • Disambiguation 2: Random Walks
    • Measure Distances:
    • Commute Time distance
    • Leave node i : reach node j : return to node i
    • Optimum Transitions
    • Move through the graph until probability peaks
  • Disambiguation: Random Walks
    • Measure Distances:
    • Commute Time distance
    • Leave node i : reach node j : return to node i
    • Optimum Transitions
    • Move through the graph until P peaks
  • Disambiguation 2: Random Walks
    • Group web resources with social graph
    • Via agglomerative clustering
    • Every point is in a cluster
    • Merge clusters until none can be merged
  • Disambiguation 2: Random Walks
    Advantages:
    • Semi-supervised
    • Exploits the graph structure of RDF
    Disadvantages:
    • Computationally heavy (Matrix powers!)
    • Relies on tuning clustering threshold
    M Rowe. Applying Semantic Social Graphs to Disambiguate Identity References. In proceedings of European Semantic Web Conference 2009, Heraklion, Crete. (2009)
  • Disambiguation 3: Self-training
    Classic ML scenario:
    Lots of unlabelled data
    Limited labelled data
    Disambiguating identity web references is just the same!
    Possible web citations = large
    Social data = small
    Semi-supervised learning is a solution
    Train a classifier
    Using labelled and unlabelled data!
    Classification task is binary
    Does this web resource refer to person X or not?
  • Positive training data = seed data
    Generate negative training data:
    Via Rocchio classification:
    Build centroid vectors: positive set and negative set
    Negative set = unlabelled data
    Compare possible web citations with vectors
    Choose strongest negatives
    Disambiguation 3: Self-training
  • Positive training data = seed data
    Generate negative training data:
    Via Rocchio classification:
    Build centroid vectors: positive set and negative set
    Negative set = unlabelled data
    Compare possible web citations with vectors
    Choose strongest negatives
    Disambiguation 3: Self-training
  • Positive training data = seed data
    Generate negative training data:
    Via Rocchio classification:
    Build centroid vectors: positive set and negative set
    Negative set = unlabelled data
    Compare possible web citations with vectors
    Choose strongest negatives
    Disambiguation 3: Self-training
  • Positive training data = seed data
    Generate negative training data:
    Via Rocchio classification:
    Build centroid vectors: positive set and negative set
    Negative set = unlabelled data
    Compare possible web citations with vectors
    Choose strongest negatives
    Disambiguation 3: Self-training
  • Begin Self-training:
    Train the Classifier
    Classify the web resources
    Rank classifications
    Enlarge training sets
    Repeat steps 1-4
    Disambiguation 3: Self-training
  • Training/Testing data is RDF
    Convert to a machine learning dataset
    Features = RDF instances
    Vary the feature similarity measure:
    Jaccard Similarity
    Inverse Functional Property Matching
    RDF Entailment
    Tested three different classifiers:
    Perceptron
    Support Vector Machine
    Naïve Bayes
    Disambiguation 3: Self-training
  • Advantages
    Directly learn from disambiguation decisions
    Utilise abundance of unlabelled data
    Disadvantages
    Requires reliable negatives
    Mistakes can reinforce themselves
    M Rowe and F Ciravegna. Harnessing the Social Web: The Science of Identity Disambiguation. In proceedings of Web Science Conference 2010. Raleigh, USA. (2010)
    Disambiguation 3: Self-training
  • Evaluation
    Measures:
    Precision, Recall, F-Measure
    Dataset
    50 participants from the Semantic Web and Web 2.0 communities
    ~17300 web resources: 346 web resources for each participant
    Baselines
    Baseline 1: Person name as positive classification
    Baseline 2: Hierarchical Clustering using Person Names
    [Malin, 2005]
    Baseline 3: Human Processing
  • Evaluation: Inference Rules
    High precision
    Better than humans
    Precise graph pattern matching
    Low recall
    Rules are strict
    No room for variability
    Hard to generalise
    No learning from disambiguation decisions
  • Evaluation: Random Walks
    High recall
    Higher than humans
    Incorporates unlabelled data into random walks
    Uses features not in the seed data
    Precision
    Lower than humans and rules
    Ambiguous name literals lead to false positives
  • Evaluation: Self-training
    High Recall
    SVM + Entailment classifies 91% of references
    High F-Measure
    Higher than humans
    Perceptron + Entailment and SVM + Entailment
  • Conclusions: Research Questions
    Alleviate human processing:
    • Can automated techniques replace humans?
    Performance is comparable to humans
    Suited to low web presence
    Supervision:
    • Can automated techniques function independently?
    Inference Rules : Induce rules from seed data
    Random Walks : Graph space built from models
    Self-training : Learn + retrain a classifier
    Seed Data:
    • How can this be gathered inexpensively?
    Utilise Social Web platforms
    Digital identities are similar to real world identities
    Interpretation:
    • How can automated techniques interpret information?
    Solution = Semantic Web technologies
    Convert web resources into metadata models
  • Conclusions: Claims
    Automated disambiguation techniques are able to replace human processing
    Techniques are comparable to humans
    Overcome manual processing
    Data found on Social Web platforms is representative of real identity information
    77% of a real world social network is covered online
    Social data provides the background knowledge required by automated disambiguation techniques
    Techniques function using social data
    Biographical and social network enables disambiguation
  • Dissemination and Impact
    Published 21 peer-reviewed publications
    Paper in the Journal of Web Semantics (impact: 3.5)
    Presented work at many international conferences
    Program committee member for 5 international workshops
    Invited Expert for the World Wide Web Consortium’s Social Web Incubator Group
    Listed as one of top 100 visionaries “discussing the future of the web”
    http://www.semanticweb.com/semanticweb100/
    Linked Data service for the DCS
    Best Poster at the Extended Semantic Web Conference 2010
    http://data.dcs.shef.ac.uk
    Tools widely used by the Semantic Web community
    FOAF Generator
    Social Identity Schema Mapping (SISM) Vocabulary
  • Twitter: @mattroweshow
    Web: http://www.dcs.shef.ac.uk/~mrowe
    Email: m.rowe@dcs.shef.ac.uk
    Questions?
    For a condensed version of my thesis:
    M Rowe and F Ciravegna. Disambiguating Identity Web References using Web 2.0 Data and Semantics. In Press for special issue on "Web 2.0" in the Journal of Web Semantics. (2010)