Paper presentation @ SEBD '09

488 views

Published on

Time-completeness trade-offs in record linkage using Adaptive query Processing

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
488
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Paper presentation @ SEBD '09

  1. 1. Time-completeness trade-offs in record linkage using Adaptive query Processing Paolo Missier, Alvaro A. A. Fernandes School of Computer Science, University of Manchester Roald Lengu, Giovanna Guerrini DISI, Universita di Genova, Italy Marco Mesiti DiCo, Universita di Milano, Italy SEBD 2009, Camogli, Italy June, 2009
  2. 2. Conclusions• Optimistic approach to online record linkage – Based on implicit referential integrity assumption – When assumption not true, goes back to pessimistic• Technical approach based on autonomic computing – Adaptive query processing – Mix of exact / approximate physical join operators• Applications: on-the-fly integration scenarios (mashups, personal dataspaces, sensor data streams)• Need to relax the referential integrity assumption -- use other sources for result size estimationP. Missier, SEBD, June 2009, Camogli, Italy
  3. 3. Context: On-the-fly data integration • “Situational Applications” • Data mashups and personal dataspaces • slight mismatches in records lead to incomplete integration – due to different encodings, conventions – due to errors in data R A=B S C D A B A Madchester Y X Manchester A case of record linkage Manchester Z Y X Manchester Manchester Z 3P. Missier, SEBD, June 2009, Camogli, Italy
  4. 4. Offline vs online linkage • Offline record linkage: – performed once before queries involving joins – 1. reconcile R and S on joining attributes R.A, S.B using your favourite record linkage technique R,S → R’,S’ – 2. perform regular equijoin on the transformed tables: R S ➡ ok for tables that can be analysed ahead of the join ➡ ok when multiple queries issued on integrated tables 4P. Missier, SEBD, June 2009, Camogli, Italy
  5. 5. Offline vs online linkage • Offline record linkage: – performed once before queries involving joins – 1. reconcile R and S on joining attributes R.A, S.B using your favourite record linkage technique R,S → R’,S’ – 2. perform regular equijoin on the transformed tables: R S ➡ ok for tables that can be analysed ahead of the join ➡ ok when multiple queries issued on integrated tables • Online linkage: – performed while answering a query – exact join similarity (or approximate) join 4P. Missier, SEBD, June 2009, Camogli, Italy
  6. 6. Record linkage and similarity joins Historical timeline: from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD 06. 5P. Missier, SEBD, June 2009, Camogli, Italy
  7. 7. Record linkage and similarity joins Historical timeline: from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD 06. 5P. Missier, SEBD, June 2009, Camogli, Italy
  8. 8. Measuring string similarity using q-grams• q-grams map string s to a set q(s) of substrings of length q: Ex.: 3-grams: q(“Manchester”) = {‘Man’, ‘anc’, ‘nch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}. q(“Madchester”) = {‘Mad’, ‘adc’, ‘dch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}. |q(s1 ) ∩ q(s2 )| sim(s1 , s2 ) = (Jaccard coefficient) |q(s1 ) ∪ q(s2 )| sim(r1 , r2 ) < θ1 → not match θ2 < sim(r1 , r2 ) → matchP. Missier, SEBD, June 2009, Camogli, Italy
  9. 9. Similarity symmetric hash join R S insert insert R hash table probe S hash table set intersection computed m x using a variation of the probe n y y r symmetric hash join x s new primitive join operator [CGK06] [R.m,S.s] [R.n, S.r] • Main sources of complexity: – overhead for storing and indexing q-grams – cost of computing set intersection Efficient relational representation: [CGK06] S. Chaudhuri, V. Ganti and R. Kaushik, “A primitive operator for similarity joins in data cleaning” (ICDE’06)‫‏‬ 7P. Missier, SEBD, June 2009, Camogli, Italy
  10. 10. Is full similarity join always necessary?• Pessimistic: always pays full complexity cost• Typical mismatch rate in real datasets around 5% Research Goal: explore optimistic approach • detect mismatches and react as you go • requires estimates of incremental join result size • statistical + reactive • expect to sacrifice join result completeness for faster execution Approach: - combine exact and approximate join operators using Adaptive Query Processing techniques 8P. Missier, SEBD, June 2009, Camogli, Italy
  11. 11. Autonomic computing framework [KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003. 9P. Missier, SEBD, June 2009, Camogli, Italy
  12. 12. Autonomic computing framework monitor respond assess 9P. Missier, SEBD, June 2009, Camogli, Italy
  13. 13. Autonomic computing framework monitor incremental join result size monitor respond assess 9P. Missier, SEBD, June 2009, Camogli, Italy
  14. 14. Autonomic computing framework monitor incremental join result size monitor estimate join result size respond assess compute divergence from exp 9P. Missier, SEBD, June 2009, Camogli, Italy
  15. 15. Autonomic computing framework monitor incremental join result size monitor estimate join result size respond assess compute swap divergence join from exp operators • when using exact join: • if observed/estimated sizes diverge “too much”, then switch to approximate join • when using approximate join: • if observed/estimated converge, then switch to exact join 9P. Missier, SEBD, June 2009, Camogli, Italy
  16. 16. Technical approach and challenges • Assess: – Need additional assumption for result size estimation – Estimating result size at specific points during join execution • Respond: – When and how can physical join operators switched? – Can we avoid loss of work? • operator replacement in pipelined query plans [EFP06] [EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254 10P. Missier, SEBD, June 2009, Camogli, Italy
  17. 17. Assessment• Expectation of matching records• Used to derive simple result size estimation model R (parent) S (child) symmetric hash c x n join, again y b d y x a• When there are no mismatches: after scanning n < |S| tuples on S: P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R| Thus, join result size On after n tuples is a binomial random variable: n On ∼ bin(n, ) |R| 11P. Missier, SEBD, June 2009, Camogli, Italy
  18. 18. Detecting divergent observed result size ¯Observation On is outlier wrt expected result size On=> divergence ¯ Pn,p(n) (On ≤ O) ≤ θoutwhere Pn,p(n) (.) is the binomial cdf with parameters n, p(n) 12
  19. 19. Detecting divergent observed result size ¯Observation On is outlier wrt expected result size On=> divergence ¯ Pn,p(n) (On ≤ O) ≤ θoutwhere Pn,p(n) (.) is the binomial cdf with parameters n, p(n) outlier detection on experimental datasets with various mismatch patterns 12
  20. 20. Detecting divergent observed result size ¯Observation On is outlier wrt expected result size On=> divergence ¯ Pn,p(n) (On ≤ O) ≤ θoutwhere Pn,p(n) (.) is the binomial cdf with parameters n, p(n) outlier detection on experimental datasets with various mismatch patterns Note: when the data does not follow our referential constraint hypothesis, the model leads to a pure similarity join 12
  21. 21. Condition for operator replacement• Goal of AQP: – switch operators without loss of intermediate work• sufficient condition: switch when the operator reaches a quiescent state [EPF06][EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for thereplacement of pipelined physical join operators in adaptive query processing. In EDBT 13Workshops 2006, LNCS 4254
  22. 22. Adaptivity: responder state machine Each state corresponds to a combination of physical join operators left: exact left: approximate right: exact right: approximate left: exact left: approximate right: approximate right: exact 14P. Missier, SEBD, June 2009, Camogli, Italy
  23. 23. Rationale for state transitions lex / rex evidence that lex / lap / evidence that leftleft and /or right rap rex and /or right input input perturbed no longer perturbed lap / rap this is formalized in terms of predicates on the observable variables - outlier detection and more (omitted)P. Missier, SEBD, June 2009, Camogli, Italy
  24. 24. Experimental evaluation• Marginal gain of hybrid algorithm: – level of completeness • R: result size for approx join only • r: result size for exact only • rabs: result size actually observed grel = (rabs – r) / (R – r)‫‏‬ 16P. Missier, SEBD, June 2009, Camogli, Italy
  25. 25. Experimental evaluation• Marginal gain of hybrid algorithm: – level of completeness • R: result size for approx join only • r: result size for exact only • rabs: result size actually observed grel = (rabs – r) / (R – r)‫‏‬ • Marginal Cost: – baseline: exact join throughout • model marginal cost of hybrid algorithm unit cost of executing one step in one state – (experimental) • number of steps in each state • unit state transition cost (experimental) • number of state transitions over entire join 16P. Missier, SEBD, June 2009, Camogli, Italy
  26. 26. Test datasets Datasets chosen as representative of 4 distinct patterns we expect our results to vary: • uniform perturbation: evidence grows slowly => slow reaction • bursty perturbation: strong evidence => timely reactionP. Missier, SEBD, June 2009, Camogli, Italy
  27. 27. ResultsP. Missier, SEBD, June 2009, Camogli, Italy
  28. 28. ResultsP. Missier, SEBD, June 2009, Camogli, Italy
  29. 29. Conclusions• Optimistic approach to online record linkage – Based on implicit referential integrity assumption – When assumption not true, goes back to pessimistic• Technical approach based on autonomic computing – Adaptive query processing – Mix of exact / approximate physical join operators• Applications: on-the-fly integration scenarios (mashups, personal dataspaces, sensor data streams)• Need to relax the referential integrity assumption -- use other sources for result size estimationP. Missier, SEBD, June 2009, Camogli, Italy

×