Your SlideShare is downloading. ×
Harnessing the Social Web: The Science of Identity Disambiguation
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Harnessing the Social Web: The Science of Identity Disambiguation


Published on

Slides from Web Science Conference 2010

Slides from Web Science Conference 2010

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Harnessing the Social Web: The Science of Identity Disambiguation
    Matthew Rowe and Fabio Ciravegna
    Organisations, Information and Knowledge Group
    University of Sheffield, UK
    Web Science 2010
  • 2. Outline
    Dissemination of personal information across the Web
    Need for automation
    Harnessing the Social Web
    Identity Disambiguation
    Inference Rules
  • 3. Problem
    Large amount of information now residing on the World Wide Web is personal information
    Disseminated voluntarily: homepages, profiles pages
    Or involuntarily: telephone directories, electoral registers
    Sensitive nature of this information has lead to:
    Identity Theft: act of stealing a person’s identity and reusing it
    Currently costs UK economy £1.2 billion
    Lateral Surveillance: act of watching someone without their knowledge
    Often performed by employers vetting potential employees
    And by socialities vetting prospective dates
    Could affect reputation if detrimental content exists
  • 4. Motivation
    To avoid such practices, web users must manually collect web resources which may cite them and then decide which do
    The latter stage of this process is referred to as disambiguation
    Decides which web resources are references and produces a unary set of identity web references for a given person
    However, this practice is
    Time consuming
    Must be repeated often as the more and more data is published on the Web
    Automated disambiguation techniques can replace this manual processing
    To function effectively however, seed data (background knowledge about a person) is required:
    Expensive to produce (e.g. filling in an extensive form)
    Must contain sufficient features describing a person’s identity
  • 5. Harnessing the Social Web
    Overcome the problem of producing seed data manually by harnessing the Social Web
    Social Web platforms such as Facebook, Twitter and MySpace allow web users to build an online persona/identity visible to others
    Sociological studies have argued of the similarity between online and offline identities
    (Hart et al, 2008) states that online social networks are merely extensions of offline lives
    (Ellison et al, 2007) states that Social Web platforms are used to reinforce established offline relationships
    A user study was conducted to assess the relationship between digital identities constructed on Social Web platforms and their real worldequivalent using 50 participants from the University of Sheffield (25 male, 25 female) with a wide age range (18 – 45)
    Study consisted of three stages
    1. Participants listed their real world social network
    2. Digital social network was extracted from Facebook for each participant
    3. Digital and real world networks were compared
    Relevance: proportion of digital social network containing strong-tied relationships
    Coverage: extent of to which the real world network is replicated online
    Results from the user study show
    Coverage range of 0.5 to 1 with an average of 0.77
    Indicating that, on average, 77% of a person’s real world social network is replicated online
    Average relevance of 0.23
    Indicating that, on average, 23% of a person’s digital social network contains strong tied relationships
  • 6. Collecting Seed Data from the Social Web
  • 7. Collecting Seed Data from the Social Web
    rdf:typefoaf:Person ;
    foaf:name "Matthew Rowe" ;
    foaf:homepage <> ;
    foaf:mbox <> ;
    foaf:based_near <> ;
    foaf:knows <> ;
    foaf:knows <> .
    rdf:typefoaf:Person ;
    foaf:name "Fabio Ciravegna";
    foaf:mbox <>;
    foaf:homepage <> .
    rdf:typefoaf:Person ;
    foaf:name "Sam Chapman" ;
    foaf:mbox <> ;
    foaf:homepage <> .
    rdf:typegeo:Feature ;
    geo:name “Sheffield” ;
    geo:inCountry “UK” .
  • 8. Identity Disambiguation: Inference Rules
    Rules provides a means to logically infer conclusions based on the presence of information
    In the context of identity disambiguation, rules replicate the cognitive process by which a human decides if a web resource refers to a given entity
    Using background knowledge known about the entity
    Uses a supervised approach by only using the provided seed data to make decisions
    Rules are built from the seed data as follows:
    RDF instances are extracted from the seed data (e.g. an instance of a person or location)
    A rule is constructed from the information in each instance description
    Rules are then added to the rule base which are then applied to a collected set of web resources to disambiguate identity web references
    If the triple pattern in the antecedent (the if part) of the rule matches the knowledge structure of a web resource then a web reference is inferred
  • 9. Identity Disambiguation: Inference Rules
  • 10. Identity Disambiguation: Self-training
    Self-training provides a semi-supervised approach to disambiguation:
    Seed data collected from the Social Web provides the positive training data
    Possible web citations provide the unlabelled data
    Negative training data is generated using Rocchio classification over the unlabelled data
    Positive and negative training data is then used to train an initial classifier
    Classifier is applied to the unlabelled data and labels each example
    Training sets (positive and negative) are enlarged with the examples from the unlabelled data which exhibit the strongest classification confidences
    Examples are removed from the unlabelled data, reducing its size
    Steps 4-7 are repeated until all unlabelled data has been classified
    Tested 3 different machine learning classifiers: Perceptron, Support Vector Machines and Naïve Bayes
    RDF models (for both the seed data and the web resources) are converted into machine learning instances
    RDF instances from the models are used as featuresfor the machine learninginstances
    This permits the variation of distinct feature similarity measures between 3 different RDF graph matching techniques:
    RDF Entailment: does one graph subsume that of another?
    Inverse Functional Property Matching: do property values match in distinct graphs where the property is inverse functional?
    Jaccard similarity (strict graph equivalence): are the graphs identical?
  • 11. Identity Disambiguation: Self-training
    Intuition is that as the classifier learns from unlabelled data it will learn from previously unknown features
    Seed data only covers a portion of a person’s identity
    Will lead to the detection of more web references
    This is similar to the cognitive process by which humans identify web citations
    Only a portion of background knowledge is known at the start
    As more web references are found, the knowledge of the person is expanded
  • 12. Evaluation
    50 members of the Semantic Web and Web 2.0 communities
    Collected seed data from Facebook and Twitter
    Collected possible web citations from searching WWW and the Semantic Web for each participant
    Converted each returned resource into an RDF model representation
    ~346 web resources to be analysed for each participant
    Evaluation Measures
    Information retrieval metrics: precision, recall and f-measure
    Web presence level: proportion of web resources that refer to each participant (e.g. 50 of 350 web resources refer to a given person, then web presence is 14%)
    Baseline Measure: Human Processing
    Group of 12 raters manually processed a portion of the dataset for each participant
    3 raters performed disambiguation for each participant, then used interrater agreement (Hripcsak & Rothschild, 2005) to calculate IR metrics
  • 13. Evaluation: Inference Rules
    Yields high levels of precision, but poor recall scores
    Specific nature of rules leads to poor application to new instances
    Consistently outperforms humans in terms of precision for all web presence levels
    At low levels of web presence, where web references are sparse, humans perform poorly
    This is characterised by a “Needle in a Haystack” problem
    Inhibited by the lack web references to learn from
  • 14. Evaluation: Self-training
    Perceptron and SVM are combined with Entailment outperform humans
    Due to the large levels of recall achieved by these permutations
    Entailment leads to a reduction in overfitting to the training data
    Precision is lowered, but recall is improved significantly
    Performance also remains consistent for all web presence levels
    Jaccard uses strict matching between RDF instances, leading to high precision levels
    Poor recall levels due to overfitting to training data
    unable to generalise to new instances
  • 15. Conclusions
    Social Web platforms provide a useful source for identity information
    Significant similarity between real world and digital social networks
    This can in turn be used to support automated disambiguation techniques
    Inference Rules, using a supervised strategy, yields high precision levels yet fails to detect a large portion of identity web references
    Self-training overcomes the limitations of supervised techniques by learning from disambiguation decisions
    High recall levels demonstrate the effectiveness of such methods to detect a large portion of web references
    Future work will look to combine these two methods together
    Enlarging the positive training data using Inference Rules – given their high precision levels
    Then applying Self-training to increase recall levels
  • 16. Twitter: @mattroweshow
    (Hart et al, 2008) - J. Hart, C. Ridley, F. Taher, C. Sas, and A. Dix. Exploring the facebook experience: a new approach to usability. In NordiCHI ’08: Proceedings of the 5th Nordic conference on Human-computer interaction, pages 471–474, New York, NY, USA, 2008. ACM
    (Ellison et al, 2007) - N. B. Ellison, C. Steinfield, and C. Lampe. Thebenefits of facebook friends: Social capital and college students’ use of online social network sites. Journal of Computer Mediated Communication, 12:1143–1168, 2007.
    (Hripcsak & Rothschild, 2005) – G. Hripcsak and A. S. Rothschild. Agreement, the f-measure, and reliability in information retrieval. Journal of American Medical Informatics Association, 12(3):296–298, 2005.