Linking data without common identifiers

11,063 views

Published on

Talk about the

Published in: Technology, Education
1 Comment
32 Likes
Statistics
Notes
No Downloads
Views
Total views
11,063
On SlideShare
0
From Embeds
0
Number of Embeds
174
Actions
Shares
0
Downloads
223
Comments
1
Likes
32
Embeds 0
No embeds

No notes for slide

Linking data without common identifiers

  1. 1. Linking data without common identifiers<br />Semantic Web Meetup, 2011-08-23<br />Lars Marius Garshol, <larsga@bouvet.no><br />http://twitter.com/larsga<br />
  2. 2. About me<br />Lars Marius Garshol, Bouvet consultant<br />Worked with semantic technologies since 1999<br />mostly Topic Maps<br />Before that I worked with XML<br />Also background from<br />Opera Software (Unicode support)<br />open source development (Java/Python)<br />
  3. 3. Agenda<br />Linking data – the problem<br />Record linkage theory<br />Duke<br />A real-world example<br />Usage at Hafslund<br />
  4. 4. The problem<br />Data sets from different sources generally don’t share identifiers<br />names tend to be spelled every which way<br />So how can they be linked?<br />
  5. 5. A real-world example<br />
  6. 6. A difficult problem<br />It requires n2 comparisons for n records<br />a million comparisons for 1000 records, 100 million for 10,000, ...<br />Exact string comparison is not enough<br />must handle misspellings, name variants, etc<br />Interpreting the data can be difficult even for a human being<br />is the address different because there are two different people, or because the person moved?<br />...<br />
  7. 7. Help from an unexpected quarter<br />Statisticians to the rescue!<br />
  8. 8. Record linkage<br />Statisticians very often must connect data sets from different sources<br />They call it “record linkage”<br />term coined in 19461)<br />mathematical foundations laid in 19592)<br />formalized in 1969 as “Fellegi-Sunter” model3)<br />A whole subject area has been developed with well-known techniques, methods, and tools<br />these can of course be applied outside of statistics<br />http://ajph.aphapublications.org/cgi/reprint/36/12/1412<br />http://www.sciencemag.org/content/130/3381/954.citation<br />http://www.jstor.org/pss/2286061<br />
  9. 9. Other terms for the same thing<br />Has been independently invented many times<br />Under many names<br />entity resolution<br />identity resolution<br />merge/purge<br />deduplication<br />...<br />This makes Googling for information an absolute nightmare<br />
  10. 10. Application areas<br />Statistics (obviously)<br />Data cleaning<br />Data integration<br />Conversion<br />Fraud detection / intelligence / surveillance<br />
  11. 11. Mathematical model<br />
  12. 12. Model, simplified<br />We work with records, each of which has fields with values<br />before doing record linkage we’ve processed the data so that they have the same set of fields<br />(fields which exist in only one of the sources get discarded, as they are no help to us)<br />We compare values for the same fields one by one, then estimate the probability that records represent the same entity given these values<br />probability depends on which field and what values<br />estimation can be very complex<br />Finally, we can compute overall probability<br />
  13. 13. Example<br />
  14. 14. String comparisons<br />Must handle spelling errors and name variants<br />must estimate probability that values belong to same entity despite differences<br />Examples<br />John Smith ≈ Jonh Smith<br />J. Random Hacker ≈ James Random Hacker<br />...<br />Many well-known algorithms have been developed<br />no one best choice, unfortunately<br />some are eager to merge, others less so<br />Many are computationally expensive<br />O(n2) where n is length of string is common<br />this gets very costly when number of record pairs to compare is already high<br />
  15. 15. Examples of algorithms<br />Levenshtein<br />edit distance<br />ie: number of edits needed to change string 1 into 2<br />not so eager to merge<br />Jaro-Winkler<br />specifically for short person names<br />very promiscuous<br />Soundex<br />essentially a phonetic hashing algorithm<br />very fast, but also very crude<br />Longest common subsequence / q-grams<br />haven’t tried this yet<br />Monge-Elkan<br />one of the best, but computationally expensive<br />TFIDF<br />based on term frequency<br />studies often find this performs best, but it’s expensive<br />
  16. 16. Existing record linkage tools<br />Commercial tools<br />big, sophisticated, and expensive<br />have found little information on what they actually do<br />presumably also effective<br />Open source tools<br />generally made by and for statisticians<br />nice user interfaces and rich configurability<br />architecture often not as flexible as it could be<br />
  17. 17. Standard algorithm<br />n2comparisons for n records is unacceptable<br />must reduce number of direct comparisons<br />Solution<br />produce a key from field values,<br />sort records by the key,<br />for each record, compare with n nearest neighbours<br />sometimes several different keys are produced, to increase chances of finding matches<br />Downsides<br />requires coming up with a key<br />difficult to apply incrementally<br />sorting is expensive<br />
  18. 18. Good research papers<br />Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas<br />http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf<br />Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo<br />http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf<br />Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al<br />http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf<br />
  19. 19. DUplicate KillEr<br />Duke<br />
  20. 20. Context<br />Doing a project for Hafslund where we integrate data from many sources<br />Entities are duplicated both inside systems and across systems<br />Suppliers<br />Customers<br />Customers<br />Customers<br />Companies<br />CRM<br />Billing<br />ERP<br />
  21. 21. Requirements<br />Must be flexible and configurable<br />no way to know in advance exactly what data we will need to deduplicate<br />Must scale to large data sets<br />CRM alone has 1.4 million customer records<br />that’s 2 trillion comparisons with naïve approach<br />Must have an API<br />project uses SDshare protocol everywhere<br />must therefore be able to build connectors<br />Must be able to work incrementally<br />process data as it changes and update conclusions on the fly<br />
  22. 22. Reviewed existing tools...<br />...but didn’t find anything matching the criteria<br />Therefore...<br />
  23. 23. Duke<br />Java deduplication engine<br />released as open source<br />http://code.google.com/p/duke/<br />Does not use key approach<br />instead indexes data with Lucene<br />does Lucene searches to find potential matches<br />Still a work in progress, but<br />high performance (1M records in 10 minutes),<br />several data sources and comparators,<br />being used for real in real projects,<br />flexible architecture<br />
  24. 24. How it works<br />XML configuration file defines setup<br />lists data sources, properties, probabilities, etc<br />Data sources provide streams of records<br />Steps:<br />normalize values so they are ready for comparison<br />collect a batch of records, then index it with Lucene<br />compare records one by one against Lucene index<br />do detailed comparisons against search results<br />pairs with probability above configurable threshold are considered matches<br />For now, only API and command-line<br />
  25. 25. Record comparisons in detail<br />Comparator compares field values<br />return number in range 1.0 – 0.0<br />probability for field is high probability if number is 1.0, or low probability if 0.0, otherwise in between the two<br />Probability for entire record is computed using Bayes’s formula<br />approach known as “naïve Bayes” in research literature<br />
  26. 26. Components<br />Data sources<br />CSV<br />JDBC<br />Sparql<br />NTriples<br /><plug in your own><br />Comparators<br />ExactComparator<br />NumericComparator<br />SoundexComparator<br />TokenizedComparator<br />Levenshtein<br />JaroWinkler<br />Dice coefficient<br />
  27. 27. Features<br />Fairly complete command-line tool<br />with debugging support<br />API for embedding the tool<br />Two modes:<br />deduplication (all records matched against each other)<br />linking (only matching across groups of records)<br />Pluggable data sources, comparators, and cleaners<br />Framework for composing cleaners<br />Incremental processing<br />High performance<br />
  28. 28. A real-world example<br />Matching Mondial and DBpedia<br />
  29. 29. Finding properties to match<br />Need properties providing identity evidence<br />Matching on the properties in bold below<br />Extracted data to CSV for ease of use<br />
  30. 30. Configuration – data sources<br /> <group><br /> <csv><br /> <param name="input-file" value="dbpedia.csv"/><br /> <param name="header-line" value="false"/><br /> <column name="1" property="ID"/><br /> <column name="2"<br /> cleaner="no.priv...CountryNameCleaner"<br /> property="NAME"/><br /> <column name="3"<br /> property="AREA"/><br /> <column name="4"<br /> cleaner="no.priv...CapitalCleaner"<br /> property="CAPITAL"/><br /> </csv><br /> </group><br /><group><br /> <csv><br /> <param name="input-file" value="mondial.csv"/><br /> <column name="id" property="ID"/><br /> <column name="country"<br /> cleaner="no.priv...examples.CountryNameCleaner"<br /> property="NAME"/><br /> <column name="capital"<br /> cleaner="no.priv...LowerCaseNormalizeCleaner"<br /> property="CAPITAL"/><br /> <column name="area"<br /> property="AREA"/><br /> </csv><br /> </group><br />
  31. 31. Configuration – matching<br /> <schema><br /> <threshold>0.65</threshold><br /> <property type="id"><br /> <name>ID</name><br /> </property><br /> <property><br /> <name>NAME</name> <br /> <comparator>no.priv.garshol.duke.Levenshtein</comparator><br /> <low>0.3</low><br /> <high>0.88</high><br /> </property> <br /> <property><br /> <name>AREA</name> <br /> <comparator>AreaComparator</comparator><br /> <low>0.2</low><br /> <high>0.6</high><br /> </property><br /> <property><br /> <name>CAPITAL</name> <br /> <comparator>no.priv.garshol.duke.Levenshtein</comparator><br /> <low>0.4</low><br /> <high>0.88</high><br /> </property> <br /> </schema><br />Duke analyzes this setup and decides<br />only NAME and CAPITAL need to be <br />searched on in Lucene.<br /> <object class="no.priv.garshol.duke.NumericComparator"<br /> name="AreaComparator"><br /> <param name="min-ratio" value="0.7"/><br /> </object><br />
  32. 32. Result<br />Correct links found: 206 / 217 (94.9%)<br />Wrong links found: 0 / 12 (0.0%)<br />Unknown links found: 0<br />Percent of links correct 100.0%, wrong 0.0%, unknown 0.0%<br />Records with no link: 25<br />Precision 100.0%, recall 94.9%, f-number 0.974<br />
  33. 33. Examples<br />
  34. 34. Choosing the right match<br />
  35. 35. An example of failure<br />Duke doesn’t find this match<br />no tokens matching exactly<br />Lucene search finds nothing<br />Detailed comparison gives correct result<br />so only problem is Lucene search<br />Lucene does have Levenshtein search, but<br />in Lucene 3.x it’s very slow<br />therefore not enabled now<br />thinking of adding option to enable where needed<br />Lucene 4.x will fix the performance problem<br />
  36. 36. The effects of value matching<br />
  37. 37. Duke in real life<br />Usage at Hafslund<br />
  38. 38. The SESAM project<br />Building a new archival system<br />does automatic tagging of documents based on knowledge about the data<br />knowledge extracted from backend systems<br />Search engine-based frontend<br />using Recommind for this<br />Very flexible architecture<br />extracted data stored in triple store (Virtuoso)<br />all data transfer based on SDshare protocol<br />data extracted from RDBMSs with Ontopia’s DB2TM<br />
  39. 39. The big picture<br />DUPLICATES!<br />360<br />SDshare<br />SDshare<br />Virtuoso<br />ERP<br />Recommind<br />SDshare<br />SDshare<br />CRM<br />Billing<br />Duke<br />SDshare<br />SDshare<br />contains owl:sameAs and<br />haf:possiblySameAs<br />
  40. 40. Experiences so far<br />Incremental processing works fine<br />links added and retracted as data changes<br />Performance not an issue at all<br />but then only ~50,000 records so far...<br />Matching works well, but not perfect<br />data are very noisy and messy<br />also, matching is hard<br />
  41. 41. Duke roadmap<br />0.3<br />clean up the public API and document properly<br />maybe some more comparators<br />support for writing owl:sameAs to Sparql endpoint<br />0.4<br />add a web service interface<br />0.5 and onwards<br />more comparators<br />maybe some parallelism<br />
  42. 42. Slides will appear on http://slideshare.net/larsga<br />Comments/questions?<br />

×