Linking data without common identifiers
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Linking data without common identifiers

on

  • 6,277 views

Talk about the

Talk about the

Statistics

Views

Total Views
6,277
Views on SlideShare
6,263
Embed Views
14

Actions

Likes
13
Downloads
98
Comments
1

4 Embeds 14

http://www.redditmedia.com 11
http://twitter.com 1
http://www.docshut.com 1
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Linking data without common identifiers Presentation Transcript

  • 1. Linking data without common identifiers
    Semantic Web Meetup, 2011-08-23
    Lars Marius Garshol, <larsga@bouvet.no>
    http://twitter.com/larsga
  • 2. About me
    Lars Marius Garshol, Bouvet consultant
    Worked with semantic technologies since 1999
    mostly Topic Maps
    Before that I worked with XML
    Also background from
    Opera Software (Unicode support)
    open source development (Java/Python)
  • 3. Agenda
    Linking data – the problem
    Record linkage theory
    Duke
    A real-world example
    Usage at Hafslund
  • 4. The problem
    Data sets from different sources generally don’t share identifiers
    names tend to be spelled every which way
    So how can they be linked?
  • 5. A real-world example
  • 6. A difficult problem
    It requires n2 comparisons for n records
    a million comparisons for 1000 records, 100 million for 10,000, ...
    Exact string comparison is not enough
    must handle misspellings, name variants, etc
    Interpreting the data can be difficult even for a human being
    is the address different because there are two different people, or because the person moved?
    ...
  • 7. Help from an unexpected quarter
    Statisticians to the rescue!
  • 8. Record linkage
    Statisticians very often must connect data sets from different sources
    They call it “record linkage”
    term coined in 19461)
    mathematical foundations laid in 19592)
    formalized in 1969 as “Fellegi-Sunter” model3)
    A whole subject area has been developed with well-known techniques, methods, and tools
    these can of course be applied outside of statistics
    http://ajph.aphapublications.org/cgi/reprint/36/12/1412
    http://www.sciencemag.org/content/130/3381/954.citation
    http://www.jstor.org/pss/2286061
  • 9. Other terms for the same thing
    Has been independently invented many times
    Under many names
    entity resolution
    identity resolution
    merge/purge
    deduplication
    ...
    This makes Googling for information an absolute nightmare
  • 10. Application areas
    Statistics (obviously)
    Data cleaning
    Data integration
    Conversion
    Fraud detection / intelligence / surveillance
  • 11. Mathematical model
  • 12. Model, simplified
    We work with records, each of which has fields with values
    before doing record linkage we’ve processed the data so that they have the same set of fields
    (fields which exist in only one of the sources get discarded, as they are no help to us)
    We compare values for the same fields one by one, then estimate the probability that records represent the same entity given these values
    probability depends on which field and what values
    estimation can be very complex
    Finally, we can compute overall probability
  • 13. Example
  • 14. String comparisons
    Must handle spelling errors and name variants
    must estimate probability that values belong to same entity despite differences
    Examples
    John Smith ≈ Jonh Smith
    J. Random Hacker ≈ James Random Hacker
    ...
    Many well-known algorithms have been developed
    no one best choice, unfortunately
    some are eager to merge, others less so
    Many are computationally expensive
    O(n2) where n is length of string is common
    this gets very costly when number of record pairs to compare is already high
  • 15. Examples of algorithms
    Levenshtein
    edit distance
    ie: number of edits needed to change string 1 into 2
    not so eager to merge
    Jaro-Winkler
    specifically for short person names
    very promiscuous
    Soundex
    essentially a phonetic hashing algorithm
    very fast, but also very crude
    Longest common subsequence / q-grams
    haven’t tried this yet
    Monge-Elkan
    one of the best, but computationally expensive
    TFIDF
    based on term frequency
    studies often find this performs best, but it’s expensive
  • 16. Existing record linkage tools
    Commercial tools
    big, sophisticated, and expensive
    have found little information on what they actually do
    presumably also effective
    Open source tools
    generally made by and for statisticians
    nice user interfaces and rich configurability
    architecture often not as flexible as it could be
  • 17. Standard algorithm
    n2comparisons for n records is unacceptable
    must reduce number of direct comparisons
    Solution
    produce a key from field values,
    sort records by the key,
    for each record, compare with n nearest neighbours
    sometimes several different keys are produced, to increase chances of finding matches
    Downsides
    requires coming up with a key
    difficult to apply incrementally
    sorting is expensive
  • 18. Good research papers
    Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas
    http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf
    Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf
    Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
  • 19. DUplicate KillEr
    Duke
  • 20. Context
    Doing a project for Hafslund where we integrate data from many sources
    Entities are duplicated both inside systems and across systems
    Suppliers
    Customers
    Customers
    Customers
    Companies
    CRM
    Billing
    ERP
  • 21. Requirements
    Must be flexible and configurable
    no way to know in advance exactly what data we will need to deduplicate
    Must scale to large data sets
    CRM alone has 1.4 million customer records
    that’s 2 trillion comparisons with naïve approach
    Must have an API
    project uses SDshare protocol everywhere
    must therefore be able to build connectors
    Must be able to work incrementally
    process data as it changes and update conclusions on the fly
  • 22. Reviewed existing tools...
    ...but didn’t find anything matching the criteria
    Therefore...
  • 23. Duke
    Java deduplication engine
    released as open source
    http://code.google.com/p/duke/
    Does not use key approach
    instead indexes data with Lucene
    does Lucene searches to find potential matches
    Still a work in progress, but
    high performance (1M records in 10 minutes),
    several data sources and comparators,
    being used for real in real projects,
    flexible architecture
  • 24. How it works
    XML configuration file defines setup
    lists data sources, properties, probabilities, etc
    Data sources provide streams of records
    Steps:
    normalize values so they are ready for comparison
    collect a batch of records, then index it with Lucene
    compare records one by one against Lucene index
    do detailed comparisons against search results
    pairs with probability above configurable threshold are considered matches
    For now, only API and command-line
  • 25. Record comparisons in detail
    Comparator compares field values
    return number in range 1.0 – 0.0
    probability for field is high probability if number is 1.0, or low probability if 0.0, otherwise in between the two
    Probability for entire record is computed using Bayes’s formula
    approach known as “naïve Bayes” in research literature
  • 26. Components
    Data sources
    CSV
    JDBC
    Sparql
    NTriples
    <plug in your own>
    Comparators
    ExactComparator
    NumericComparator
    SoundexComparator
    TokenizedComparator
    Levenshtein
    JaroWinkler
    Dice coefficient
  • 27. Features
    Fairly complete command-line tool
    with debugging support
    API for embedding the tool
    Two modes:
    deduplication (all records matched against each other)
    linking (only matching across groups of records)
    Pluggable data sources, comparators, and cleaners
    Framework for composing cleaners
    Incremental processing
    High performance
  • 28. A real-world example
    Matching Mondial and DBpedia
  • 29. Finding properties to match
    Need properties providing identity evidence
    Matching on the properties in bold below
    Extracted data to CSV for ease of use
  • 30. Configuration – data sources
    <group>
    <csv>
    <param name="input-file" value="dbpedia.csv"/>
    <param name="header-line" value="false"/>
    <column name="1" property="ID"/>
    <column name="2"
    cleaner="no.priv...CountryNameCleaner"
    property="NAME"/>
    <column name="3"
    property="AREA"/>
    <column name="4"
    cleaner="no.priv...CapitalCleaner"
    property="CAPITAL"/>
    </csv>
    </group>
    <group>
    <csv>
    <param name="input-file" value="mondial.csv"/>
    <column name="id" property="ID"/>
    <column name="country"
    cleaner="no.priv...examples.CountryNameCleaner"
    property="NAME"/>
    <column name="capital"
    cleaner="no.priv...LowerCaseNormalizeCleaner"
    property="CAPITAL"/>
    <column name="area"
    property="AREA"/>
    </csv>
    </group>
  • 31. Configuration – matching
    <schema>
    <threshold>0.65</threshold>
    <property type="id">
    <name>ID</name>
    </property>
    <property>
    <name>NAME</name>
    <comparator>no.priv.garshol.duke.Levenshtein</comparator>
    <low>0.3</low>
    <high>0.88</high>
    </property>
    <property>
    <name>AREA</name>
    <comparator>AreaComparator</comparator>
    <low>0.2</low>
    <high>0.6</high>
    </property>
    <property>
    <name>CAPITAL</name>
    <comparator>no.priv.garshol.duke.Levenshtein</comparator>
    <low>0.4</low>
    <high>0.88</high>
    </property>
    </schema>
    Duke analyzes this setup and decides
    only NAME and CAPITAL need to be
    searched on in Lucene.
    <object class="no.priv.garshol.duke.NumericComparator"
    name="AreaComparator">
    <param name="min-ratio" value="0.7"/>
    </object>
  • 32. Result
    Correct links found: 206 / 217 (94.9%)
    Wrong links found: 0 / 12 (0.0%)
    Unknown links found: 0
    Percent of links correct 100.0%, wrong 0.0%, unknown 0.0%
    Records with no link: 25
    Precision 100.0%, recall 94.9%, f-number 0.974
  • 33. Examples
  • 34. Choosing the right match
  • 35. An example of failure
    Duke doesn’t find this match
    no tokens matching exactly
    Lucene search finds nothing
    Detailed comparison gives correct result
    so only problem is Lucene search
    Lucene does have Levenshtein search, but
    in Lucene 3.x it’s very slow
    therefore not enabled now
    thinking of adding option to enable where needed
    Lucene 4.x will fix the performance problem
  • 36. The effects of value matching
  • 37. Duke in real life
    Usage at Hafslund
  • 38. The SESAM project
    Building a new archival system
    does automatic tagging of documents based on knowledge about the data
    knowledge extracted from backend systems
    Search engine-based frontend
    using Recommind for this
    Very flexible architecture
    extracted data stored in triple store (Virtuoso)
    all data transfer based on SDshare protocol
    data extracted from RDBMSs with Ontopia’s DB2TM
  • 39. The big picture
    DUPLICATES!
    360
    SDshare
    SDshare
    Virtuoso
    ERP
    Recommind
    SDshare
    SDshare
    CRM
    Billing
    Duke
    SDshare
    SDshare
    contains owl:sameAs and
    haf:possiblySameAs
  • 40. Experiences so far
    Incremental processing works fine
    links added and retracted as data changes
    Performance not an issue at all
    but then only ~50,000 records so far...
    Matching works well, but not perfect
    data are very noisy and messy
    also, matching is hard
  • 41. Duke roadmap
    0.3
    clean up the public API and document properly
    maybe some more comparators
    support for writing owl:sameAs to Sparql endpoint
    0.4
    add a web service interface
    0.5 and onwards
    more comparators
    maybe some parallelism
  • 42. Slides will appear on http://slideshare.net/larsga
    Comments/questions?