Your SlideShare is downloading. ×

Deduplication

7,259
views

Published on

Describes the basic issues of detecting duplicates in messy data and a proposed open source Java engine for solving it.

Describes the basic issues of detecting duplicates in messy data and a proposed open source Java engine for solving it.

Published in: Technology

0 Comments
14 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,259
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
129
Comments
0
Likes
14
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Deduplication
    Bouvet BigOne, 2011-04-13
    Lars Marius Garshol, <larsga@bouvet.no>
    http://twitter.com/larsga
  • 2. Getting started
    Baby steps
  • 3. The problem
    The suppliers table
  • 4. The problem – take 2
    Suppliers
    Customers
    Customers
    Customers
    Companies
    CRM
    Billing
    ERP
  • 5. But ... what about identifiers?
    No, there are no system IDs across these tables
    Yes, there are outside identifiers
    organization number for companies
    personal number for people
    But, these are problematic
    many records don't have them
    they are inconsistently formatted
    sometimes they are misspelled
    some parts of huge organizations have the same org number, but need to be treated as separate
  • 6. First attempt at solution
    I wrote a simple Python script in ~2 hours
    It does the following:
    load all records
    normalize the data
    strip extra whitespace, lowercase, remove letters from org codes...
    use Bayesian inferencing for matching
  • 7. Configuration
  • 8. Matching
    This sums out to 0.93 probability
  • 9. Problems
    The functions comparing values are still pretty primitive
    Performance is abysmal
    90 minutes to process 14,500 records
    performance is O(n2)
    total number of records is ~2.5 million
    time to process all records: 1 year 10 months
    Now what?
  • 10. An idea
    Well, we don't necessarily need to compare each record with all others if we have indexes
    we can look up the records which have matching values
    Use DBM for the indexes, for example
    Unfortunately, these only allow exact matching
    But, we can break up complex values into tokens, and index those
    Hang on, isn't this rather like a search engine?
    Bing!
    Let's try Lucene!
  • 11. Lucene-based prototype
    I whip out Jython and try it
    New script first builds Lucene index
    Then searches all records against the index
    Time to process 14,500 records: 1 minute
    Now we're talking...
  • 12. Reality sets in
    A splash of cold water to the face
  • 13. Prior art
    It turns out people have been doing this before
    They call it
    entity resolution
    identity resolution
    merge/purge
    deduplication
    record linkage
    ...
    This makes Googling for information an absolute nightmare
  • 14. Existing tools
    Several commercial tools
    they look big and expensive: we skip those
    Stian found some open source tools
    Oyster: slow, bad architecture, primitive matching
    SERF: slow, bad architecture
    So, it seems we still have to do it ourselves
  • 15. Finds in the research literature
    General
    problem is well-understood
    "naïve Bayes" is naïve
    lots of interesting work on value comparisons
    performance problem 'solved' with "blocking"
    build a key from parts of the data
    sort records by key
    compare each record with m nearest neighbours
    performance goes from O(n2) to O(n m)
    parallel processing widely used
    Swoosh paper
    compare and merge should have ICAR1 properties
    optimal algorithms for general merge found
    run-time for 14,000 records ~1.5 hours...
    1 Idempotence, commutativity, associativity, reflexivity
  • 16. DUplicate KillEr
    Duke
  • 17. Java deduplication engine
    Work in progress
    so far spent only ~10 hours on it
    only core built so far
    Based on Lucene 3.1
    Open source (on Google Code)
    http://code.google.com/p/duke/
    Blazingly fast
    14,500 records in 30 seconds before optimization
    26,000 records in 40 seconds before optimization
    40,000 records in 60 seconds before optimization
  • 18. Architecture
    data in
    equivalences out
    SDshare client
    SDshare server
    RDF frontend
    Datastore API
    Duke engine
    Lucene
    H2 database
  • 19. Architecture #2
    data in
    link file out
    Command-line client
    More frontends:
    CSV frontend
    Datastore API
    Duke engine
    Lucene
  • 23. Architecture #3
    data in
    equivalences out
    REST interface
    X frontend
    Datastore API
    Duke engine
    Lucene
    H2 database
  • 24. Weaknesses
    Tied to naïve Bayes model
    research shows more sophisticated models perform better
    non-trivial to reconcile these with index lookup
    Value comparison sophistication limited
    Lucene does support Levenshtein queries
    (these are slow, though. will be fast in 4.x)
    Haven't yet tested with millions of records
    could be that something causes it to blow up under the load
  • 25. Comments/questions?