This document summarizes Matthew Rowe's PhD thesis on automatically disambiguating identity web references. The thesis claims that automated techniques can replace humans in performing disambiguation at scale and with high accuracy by leveraging seed data from social web platforms. It outlines three disambiguation techniques evaluated in the thesis: 1) inference rules, 2) random walks on graphs, and 3) self-training classifiers. Evaluation results show the self-training approach achieves recall comparable to humans while the rule-based approach has the highest precision. The document also discusses generating metadata models from web resources and requirements for seed data and disambiguation techniques.
Introduction to ArtificiaI Intelligence in Higher Education
Automated Disambiguation of Identity Web References Using Social Data
1. Disambiguating Identity Web References using Social Data Matthew Rowe Organisations, Information and Knowledge Group Department of Computer Science University of Sheffield
2. Outline Problem Setting Research Questions Claims of the Thesis State of the Art Requirements for Disambiguation and Seed Data Disambiguating Identity Web References Leveraging Seed Data from the Social Web Generating Metadata Models Disambiguation Techniques Evaluation Conclusions Dissemination and Impact
3. Personal Information on the Web Personal information on the Web is disseminated: Voluntarily Involuntarily Increase in personal information: Identity Theft Lateral Surveillance Web users must discover their identity web references 2 stage process Finding Disambiguating Disambiguation = reduction of web reference ambiguity My thesis addresses disambiguation
10. Problem Setting Performing disambiguation manually: Time consuming Laborious Handle masses of information Repeated often The Web keeps changing Solution = automated techniques Alleviate the need for humans Need background knowledge Who am I searching for? What makes them unique?
11.
12. State of the Art Disambiguation techniques are divisible into 2 types: Seeded techniques E.g. [Bekkerman and McCallum, 2005], Commercial Services Pros Disambiguate web references for a single person Cons: Require seed data No explanation of how seed data is acquired Unseeded techniques E.g. [Song et al, 2007] Pros Require no background knowledge Cons Groups web references into clusters Need to choose the correct cluster
13. Requirements Requirements for Seeded Disambiguation: Bootstrap the disambiguation process with minimal supervision Achieve disambiguation accuracy comparable to human processing Cope with web resources not containing seed data features Disambiguation must be effective for all individuals Requirements for Seed Data: Produce seed data with minimal cost Generate reliable seed data
15. Harnessing the Social Web WWW has evolved into a web of participation Digital identity is important on the Social Web Digital identity is fragmented across the Social Web Data Portability from Social Web platforms is limited http://www.economist.com/business/displaystory.cfm?story_id=10880936
16. Data found on Social Web platforms is representative of real identity information
17. User Study Data found on Social Web platforms is representative of real identity information 50 participants from the University of Sheffield Consisted of 3 stages, each participant: List real world social network Extract digital social network Compare networks Relevance: 0.23 Coverage: 0.77 Updates previous findings [Subrahmanyam et al, 2008] M Rowe. The Credibility of Digital Identity Information on the Social Web: A User Study. In proceedings of 4th Workshop on Information Credibility on the Web, World Wide Web Conference 2010. Raleigh, USA. (2010)
22. Leveraging Seed Data from the Social Web Allows remote resource information to change Automated techniques: Follow the links Retrieve the instance information
24. Generating Metadata Models Input to disambiguation techniques is a set of web resources Web resources come in many flavours: Data models XHTML documents containing embedded semantics HTML documents 4. Interpretation: How can automated techniques interpret information? Solution = Semantic Web technologies! Convert web resources to RDF Metadata descriptions = ontology concepts Information is Consistent Interpretable
27. Generating RDF Models from HTML Documents Rise in use of lowercase semantics! However only 2.6% of web documents contain semantics [Mika et al, 2009] Majority of the web is HTML Bad for machines Must extract person information Then build an RDF model Person information is structured for legibility for segmentation i.e. logical distinction between elements
53. Create a new rule if a triple’s predicate is Inverse Functional3. Apply the rules PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . ?urlfoaf:topic ?p . ?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . ?qfoaf:homepage ?h . ?urlfoaf:topic ?r . ?rfoaf:homepage ?h }
54.
55.
56. Strict matching: lack of generalisationM Rowe. Inferring Web Citations using Social Data and SPARQL Rules. In proceedings of Linking of User Profiles and Applications in the Social Semantic Web, Extended Semantic Web Conference 2010. Heraklion, Crete. (2010)
57. Disambiguation 2: Random Walks Seed data and web resources are RDF RDF has a graph structure: <subject, predicate, object> <source_node, edge, target_node> Graph-based disambiguation techniques: E.g. [Jiang et al, 2009] Build a graph-space Partition data points in the graph-space Requires methods to: Compile a graph-space Compare nodes Cluster nodes
77. Relies on tuning clustering thresholdM Rowe. Applying Semantic Social Graphs to Disambiguate Identity References. In proceedings of European Semantic Web Conference 2009, Heraklion, Crete. (2009)
78. Disambiguation 3: Self-training Classic ML scenario: Lots of unlabelled data Limited labelled data Disambiguating identity web references is just the same! Possible web citations = large Social data = small Semi-supervised learning is a solution Train a classifier Using labelled and unlabelled data! Classification task is binary Does this web resource refer to person X or not?
79. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training
80. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training
81. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training
82. Positive training data = seed data Generate negative training data: Via Rocchio classification: Build centroid vectors: positive set and negative set Negative set = unlabelled data Compare possible web citations with vectors Choose strongest negatives Disambiguation 3: Self-training
83. Begin Self-training: Train the Classifier Classify the web resources Rank classifications Enlarge training sets Repeat steps 1-4 Disambiguation 3: Self-training
84. Training/Testing data is RDF Convert to a machine learning dataset Features = RDF instances Vary the feature similarity measure: Jaccard Similarity Inverse Functional Property Matching RDF Entailment Tested three different classifiers: Perceptron Support Vector Machine Naïve Bayes Disambiguation 3: Self-training
85. Advantages Directly learn from disambiguation decisions Utilise abundance of unlabelled data Disadvantages Requires reliable negatives Mistakes can reinforce themselves M Rowe and F Ciravegna. Harnessing the Social Web: The Science of Identity Disambiguation. In proceedings of Web Science Conference 2010. Raleigh, USA. (2010) Disambiguation 3: Self-training
86. Evaluation Measures: Precision, Recall, F-Measure Dataset 50 participants from the Semantic Web and Web 2.0 communities ~17300 web resources: 346 web resources for each participant Baselines Baseline 1: Person name as positive classification Baseline 2: Hierarchical Clustering using Person Names [Malin, 2005] Baseline 3: Human Processing
87. Evaluation: Inference Rules High precision Better than humans Precise graph pattern matching Low recall Rules are strict No room for variability Hard to generalise No learning from disambiguation decisions
88. Evaluation: Random Walks High recall Higher than humans Incorporates unlabelled data into random walks Uses features not in the seed data Precision Lower than humans and rules Ambiguous name literals lead to false positives
89. Evaluation: Self-training High Recall SVM + Entailment classifies 91% of references High F-Measure Higher than humans Perceptron + Entailment and SVM + Entailment
90.
91. Conclusions: Claims Automated disambiguation techniques are able to replace human processing Techniques are comparable to humans Overcome manual processing Data found on Social Web platforms is representative of real identity information 77% of a real world social network is covered online Social data provides the background knowledge required by automated disambiguation techniques Techniques function using social data Biographical and social network enables disambiguation
92. Dissemination and Impact Published 21 peer-reviewed publications Paper in the Journal of Web Semantics (impact: 3.5) Presented work at many international conferences Program committee member for 5 international workshops Invited Expert for the World Wide Web Consortium’s Social Web Incubator Group Listed as one of top 100 visionaries “discussing the future of the web” http://www.semanticweb.com/semanticweb100/ Linked Data service for the DCS Best Poster at the Extended Semantic Web Conference 2010 http://data.dcs.shef.ac.uk Tools widely used by the Semantic Web community FOAF Generator Social Identity Schema Mapping (SISM) Vocabulary
93. Twitter: @mattroweshow Web: http://www.dcs.shef.ac.uk/~mrowe Email: m.rowe@dcs.shef.ac.uk Questions? For a condensed version of my thesis: M Rowe and F Ciravegna. Disambiguating Identity Web References using Web 2.0 Data and Semantics. In Press for special issue on "Web 2.0" in the Journal of Web Semantics. (2010)
Editor's Notes
VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
1,580,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
From heterogeneous sources!
KnowItAll system [Etzioni et al, 2005]: identifies facts in web pages and populates a knowledge base DBPedia project [Auer et al, 2008]: extracts information from Wikipedia and builds a large machine-readable knowledge base Social Web platforms such as Facebook and MySpaceSocial dataSufficient to support automated techniques
Seeded techniques: [Bekkerman and McCallum, 2005] Seed data = the social network of a personCluster pages on link structure web pages are collected and clustered based on link structures Unseeded techniques:[Song et al, 2007]Aligns person names with web page topicCluster pages =Generative model built from topic list Seeded techniques suited to this thesis’ problem settingNo need to partition web citations into k clustersHandling a large amount of irrelevant informationAble to focus on disambiguating web references for a single personIn line with state of the art approaches
Places requirements on the technique + seed data
My solution to the problem settingUser-centric approachSemantic Web technologies = consistent interpretation of information
Web EvoluationWikipedia = wisdom of the crowdBlogging platforms = web users share thoughts/opinionsWeb = a Social WebDigital IdentityRich functionalitiesUsers build bespoke identitiesDigital identity can be divided into 3 tiersMy Identity: persistent identity information (name, date of birth, genealogical relations)Shared Identity: social networks, friend relationshipsAbstracted Identity: demographics of usage (e.g. community of practice)ID FragmentationMySpace = share/discuss musicLinkedIn = make business connectionsData PortabilityInfo in proprietary formatsHard to link togetherNeed to solve these issues
Social Web platforms:maintain offline relationships [Hart et al, 2008] reinforce existing offline relationships [Ellison et al, 2007]
Social Web platforms:maintain offline relationships [Hart et al, 2008] reinforce existing offline relationships [Ellison et al, 2007]Real world networkContains strong-tied relationships [Donath& Boyd, 2004]RelevanceRatio of strong-tied to weak-tied relationships in the digital social networkCoverageProportion of the real-world in the digital social networkResults:Relevance 23% of a digital social network contains strong-tied relationshipsCoverage77% of a participant’s real-world social network appears onlineDifferent from findings by [Subrahmanyam et al, 2008] 49% for coverage (they define it as overlap)Due to different demographics
My solution to the problem settingUser-centric approachSemantic Web technologies = consistent interpretation of information
Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
Explain what RDF is!!!!Graph like model of dataExport individual social graphsFrom Facebook, Twitter, etc!Overcomes data portability issues!
Want to detect equivalent instances!!
Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
Can get digital ID info from Social WebRepresentative of real world identitiesNeed to overcome data portability issueIdentity fragmentation1. Export individual social graphsUse Semantic Web technologies: RDF2. Interlink social graphsIdentify equivalent peopleMore complete ID profile for useApproach1. Perform a blocking step by only comparing people with the same name2. Detect equivalent unique identifiers3. Reason over geographical informationSimilar to reference reconciliation SOA[Dong et al 2005]Exploiting contextual informationProduces an interlinked social graph using Linked Data principles
We now have our Seed data! It is in machine readable RDF Using FOAF + Geonames
Some flavours taste better to machines
XHTML document contains embedded semantics Both human and machine-readableWell formed structure with lightweight markup for DOM elementsE.g. Microformats, RDF in Attributes (RDFa)XSL Transformations are specified in a document’s headerThis allows the XHTML structure to be converted into RDFGRDDL is used to lift an RDF model from a given document
XHTML document contains embedded semantics Both human and machine-readableWell formed structure with lightweight markup for DOM elementsE.g. Microformats, RDF in Attributes (RDFa)XSL Transformations are specified in a document’s headerThis allows the XHTML structure to be converted into RDFGRDDL is used to lift an RDF model from a given document
Designed for human consumptionmachinesHard to build a metadata model fromHTML markup controls the arrangement and presentation of informationFormatting provides logical distinctions between pieces of information in a given HTML document
1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
1. Tidy HTMLTo avoid poorly formed tags2. Derive context windowsUsing a lightweight person name gazetteer1 window = information about 1 personUse HTML DOM to identify windows3. Extract person informationUse a HMM trained for the extractionInput: tokenized context windowViterbi algorithm calculates the most likely state sequenceUsing the parameters of the trained HMMPros:Uses clues in the inputHTML tags act as delimiters to identify the sequence of states4. Build a metadata model from the extracted information
We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques 3 techniques were explored: 1. rule-based 2. graph-based 3. semi-supervised machine learning
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
Data Base is populated with the provided seed dataData base = populated with seed dataRule base = rules built from seed dataRules allow a web citation to be INFERRED based on the presence of informationSeed data = limited in sizeBuild rules from RDF instancesRDF instance = unique object (e.g. a social network member)General-to-specific strategy inspired by FOIL [Quinlan, 1997]:
[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDF
Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
Adjacency Matrix gives the local similarity of nodes in the spaceProbability of moving from one node to another given that t steps have been traversed
Commute TimeIf many paths exist between nodesAND those paths are short in lengthTHEN Commute Time decreases!Optimum Transitions
Kernel functions over the graph spaceWe now have similarity measures between nodes!Lower the distances/steps -> greater the similarity!
Kernel functions over the graph spaceWe now have similarity measures between nodes!Lower the distances/steps -> greater the similarity!
Seed data AND web resources = RDFHas a graph structureState of the Art[Jiang et al, 2009]Graph-space = web pages and their features (e.g. person name, organisation)Cluster = maximally-connected cliques[On & Lee, 2007]Nodes = feature vectors of web pages (based on named entities)Edges = weighted based on TF/IDF similarity scoreCluster = spectral clustering of normalized cutsNo work has used existing metadata models!!E.g. RDFBuild a graph spaceGet the strongly connect componentAs graph contains islands of connected nodesTraverse the graph-space using Random WalksBased on a first order Markov chainGives probability of moving from node I to node J given T stepsWeight edgesBased on ontological properties of the conceptsMeasure distances between nodesCommute TimeOptimum TransitionsCluster root nodesBased on distance measures
Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselves
Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
Common scenario = lots of unlabelled data, limited labelled dataCommon problem of machine learningSupervised = learn from labelled dataUnsupervised = use only unlabelled data – build a generative model of the dataSemi-supervised = uses both labelled and unlabelledOvercoming the limitation of insufficient labelled dataHowever, have the tendency to snowball, if unlabelled data is used incorrectly, then mistakes reinforce themselvesSelf-trainingTrains an initial classifierClassifier is then retrained using classified instancesImproves upon original hypothesisGenerate negative training dataBinary classification = (does web resource X cite person Y or not?)Positive set = seed data collected from the Social WebUse Rocchio Classification to generate negative examplesForm of relevance feedback used for query optimisationBegin Self-trainingTrain the classifierApply the classifier to unlabelled dataEnlarge training data with strongest classificationsRetrain classifierRepeat the process until no unlabelled data remains
Jaccard = strictIFP = Less strict, but requires certain propertiiesEntailment = allows variability
SNOWBALL
Precision =proportion of web resources which are correctly labelled as citing a personRecall = proportion of web references which are correctly disambiguated F-Measure = harmonic mean of precision and recall
Achieves high levels of precisionOutperforming humans and other baselinesSPARQL rules require strict literal and resource matching within the triple patternsLeads to poor recall levels howeverUnable to learn from past disambiguation decisionsAt lower-levels of web presence (where identity web references are sparse) rules outperform all baselines in terms of f-measureHumans find it difficult to detect sparse web referencesAutomating disambiguation at such levels is more suitable
Achieves higher levels of recall for both distance measures than human processingCommute Time yields higher precision levels than Optimum TransitionsDue to the round trip cost used to cluster web resources with the social graphPerformance improves as web presence levels increaseRandom Walks performs better where feature sets are large in sizeIndicative of large web presencePrecision levels are less than inference rulesClustering using commute time and optimum transitions leads to an increase in false positivesAmbiguous nodes in the graph-space leads to incorrect disambiguation decisionsE.g. literal in a metadata model denoting a person’s name
Entailment consistent achieves the highest f-measure scores for each classifierReduction in overfitting to training dataGeneralises well to new instancesCharacterised by recall level achieved with SVMTwo permutations outperform humansPerceptron and SVM with EntailmentJaccard and IFP perform well at low-levels of web presenceF-measure reduces as identity web references grow in numberStrict feature matching leads to poor recall levelsOverfitting to training dataSelf-training outperforms both Random Walks and Inference Rules for certain permutationsDirect use of disambiguation decisions allows classifiers to improve upon their initial hypothesis
17 as first author5 papers accepted for publication since thesis submission