Published on

Describes the basic issues of detecting duplicates in messy data and a proposed open source Java engine for solving it.

Published in: Technology


  1. 1. Deduplication<br />Bouvet BigOne, 2011-04-13<br />Lars Marius Garshol, <><br /><br />
  2. 2. Getting started<br />Baby steps<br />
  3. 3. The problem<br />The suppliers table<br />
  4. 4. The problem – take 2<br />Suppliers<br />Customers<br />Customers<br />Customers<br />Companies<br />CRM<br />Billing<br />ERP<br />
  5. 5. But ... what about identifiers?<br />No, there are no system IDs across these tables<br />Yes, there are outside identifiers<br />organization number for companies<br />personal number for people<br />But, these are problematic<br />many records don't have them<br />they are inconsistently formatted<br />sometimes they are misspelled<br />some parts of huge organizations have the same org number, but need to be treated as separate<br />
  6. 6. First attempt at solution<br />I wrote a simple Python script in ~2 hours<br />It does the following:<br />load all records<br />normalize the data<br />strip extra whitespace, lowercase, remove letters from org codes...<br />use Bayesian inferencing for matching<br />
  7. 7. Configuration<br />
  8. 8. Matching<br />This sums out to 0.93 probability<br />
  9. 9. Problems<br />The functions comparing values are still pretty primitive<br />Performance is abysmal<br />90 minutes to process 14,500 records<br />performance is O(n2)<br />total number of records is ~2.5 million<br />time to process all records: 1 year 10 months<br />Now what?<br />
  10. 10. An idea<br />Well, we don't necessarily need to compare each record with all others if we have indexes<br />we can look up the records which have matching values<br />Use DBM for the indexes, for example<br />Unfortunately, these only allow exact matching<br />But, we can break up complex values into tokens, and index those<br />Hang on, isn't this rather like a search engine?<br />Bing!<br />Let's try Lucene!<br />
  11. 11. Lucene-based prototype<br />I whip out Jython and try it<br />New script first builds Lucene index<br />Then searches all records against the index<br />Time to process 14,500 records: 1 minute<br />Now we're talking...<br />
  12. 12. Reality sets in<br />A splash of cold water to the face<br />
  13. 13. Prior art<br />It turns out people have been doing this before<br />They call it<br />entity resolution<br />identity resolution<br />merge/purge<br />deduplication<br />record linkage<br />...<br />This makes Googling for information an absolute nightmare<br />
  14. 14. Existing tools<br />Several commercial tools<br />they look big and expensive: we skip those<br />Stian found some open source tools<br />Oyster: slow, bad architecture, primitive matching<br />SERF: slow, bad architecture<br />So, it seems we still have to do it ourselves<br />
  15. 15. Finds in the research literature<br />General<br />problem is well-understood<br />"naïve Bayes" is naïve<br />lots of interesting work on value comparisons<br />performance problem 'solved' with "blocking"<br />build a key from parts of the data<br />sort records by key<br />compare each record with m nearest neighbours<br />performance goes from O(n2) to O(n m)<br />parallel processing widely used<br />Swoosh paper<br />compare and merge should have ICAR1 properties<br />optimal algorithms for general merge found<br />run-time for 14,000 records ~1.5 hours...<br />1 Idempotence, commutativity, associativity, reflexivity<br />
  16. 16. DUplicate KillEr<br />Duke<br />
  17. 17. Java deduplication engine<br />Work in progress<br />so far spent only ~10 hours on it<br />only core built so far<br />Based on Lucene 3.1<br />Open source (on Google Code)<br /><br />Blazingly fast<br />14,500 records in 30 seconds before optimization<br />26,000 records in 40 seconds before optimization<br />40,000 records in 60 seconds before optimization<br />
  18. 18. Architecture<br />data in<br />equivalences out<br />SDshare client<br />SDshare server<br />RDF frontend<br />Datastore API<br />Duke engine<br />Lucene<br />H2 database<br />
  19. 19. Architecture #2<br />data in<br />link file out<br />Command-line client<br />More frontends:<br /><ul><li> JDBC
  20. 20. SPARQL
  21. 21. RDF file
  22. 22. ...</li></ul>CSV frontend<br />Datastore API<br />Duke engine<br />Lucene<br />
  23. 23. Architecture #3<br />data in<br />equivalences out<br />REST interface<br />X frontend<br />Datastore API<br />Duke engine<br />Lucene<br />H2 database<br />
  24. 24. Weaknesses<br />Tied to naïve Bayes model<br />research shows more sophisticated models perform better<br />non-trivial to reconcile these with index lookup<br />Value comparison sophistication limited<br />Lucene does support Levenshtein queries<br />(these are slow, though. will be fast in 4.x)<br />Haven't yet tested with millions of records<br />could be that something causes it to blow up under the load<br />
  25. 25. Comments/questions?<br />