Approximate entity reconciliation for on-the-fly integration in data mashups     Paolo Missier, Alvaro A. A. FernandesSchoo...
Outline• New data integration scenarios:  – occasional integration with little prior knowledge about the    sources• Conte...
Early example• sources 1..n: collection of car insurance DBs  • data changes frequently  • schemas can be analysed / integ...
Early example• sources 1..n: collection of car insurance DBs  • data changes frequently  • schemas can be analysed / integ...
Mashups  The IBM view, 2006  VLDB 2006 Keynote by Anant Jhingran (CTO, Information  Management, IBM Silicon Valley Laborat...
IBM Mashup Center• IBM Mashup Center – mashup workflow – leverages Lotus, DB2 plus LDAP, Web Services, ...                ...
Yahoo pipesIs there actually a “join” in the set of operators?also google mashup editor, and more...                6
Dataspaces     7
Dataspaces     7
Dataspaces     7
Integration in dataspaces                    8
Integration in dataspaces                    8
Integration in dataspaces                    8
Assumptions– no prior knowledge of data sets (streams) to be joined– assumptions on implicit parent-child attribute relati...
The broad context: record linkage• Are two (slightly) different records two different surface  representations of the same...
The broad context: record linkage• Are two (slightly) different records two different surface  representations of the same...
Results on record linkageA mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linka...
Results on record linkageA mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linka...
Results on record linkageA mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linka...
Results on record linkageA mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linka...
Offline vs online linkage• Offline linkage:  – performed once before queries involving joins  – reconcile R and S on joini...
Offline vs online linkage• Offline linkage:  – performed once before queries involving joins  – reconcile R and S on joini...
Integration with approximate joins• Assume relational data: tables R, S• Assume schema integration is understood  – we foc...
Approximate joinsHistorical timeline:from:N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures a...
Approximate joinsHistorical timeline:from:N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures a...
Approximate joinsHistorical timeline:from:N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures a...
Edit distance / similarity functions• Core sub-problem in approximate join:  – define / choose distance function between v...
Edit distance / similarity functions• Core sub-problem in approximate join:  – define / choose distance function between v...
Measuring string similarity using q-grams• q-grams map string s to a set q(s) of substrings of length q:  Ex.: 3-grams:  q...
Online linkage using q-grams – approximate join is a θ join:                              R      θA,B   S  – where θΑ,Β in...
Efficient relational approximate joins Idea: reduce approximate join to aggregated set intersection: dis(s1 , s2 ) ≤ d if ...
Is full approximate join always necessary?• Remaining source of complexity:  – overhead for storing and indexing q-grams  ...
Adaptive query processing    Idea:    implement a hybrid join algorithm that combines    exact and approximate join  Intui...
Autonomic computing framework[KC03] J. O. Kephart and D. M. Chess. The vision ofautonomic computing. IEEE Computer, 36(1):...
Autonomic computing framework           monitorrespond              assess                                  21
Autonomic computing framework                                             incremental                                     ...
Technical approach and challenges        Need to add several new capabilities to a standard query                       pr...
Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data  • they...
Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data     •...
Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data     •...
Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data     •...
Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data     •...
Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data     •...
Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data     •...
Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data     •...
Estimating result size• Exploit implicit parent-child key assumption:  – at the end of join, we expect a result of size |S...
Detecting divergent observed result size               ¯Observation On is an outlier wrt expected result sizeOn after n tu...
Detecting divergent observed result size               ¯Observation On is an outlier wrt expected result sizeOn after n tu...
Instantiating the MAR framework                                                             On                            ...
Instantiating the MAR framework                                                             On                            ...
Instantiating the MAR framework                                                                    On                     ...
Responder’s state machine• Operator switch defined in terms of state transitions• Owing to symmetry, we can use a differen...
Rationale for state transitions                                lex /                                rex  evidence that    ...
Assessment → state transitions                ¯σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout         At,Wµi (t) ≡      ≤ θcurpert        ...
Completing the loop                                                                                       On              ...
Note on operator replacement• Details on how to switch operators on the fly are  omitted   – main point: pipelined operato...
Note on operator replacement• Details on how to switch operators on the fly are  omitted   – main point: pipelined operato...
Experimental evaluation Trade-off analysis• Benefits:  – achieved level of result completeness  – baseline: approximate jo...
Test datasetsDatasets chosen as representative of 4 distinct patterns  we expect our results to vary:• uniform perturbatio...
Parameters tuning and gain/cost models• Each of the MAR parameters tuned empirically• Experiments executed using the best ...
Cost modelunit cost of executing one step in state i: wi  – weights determined experimentally• number of steps in each sta...
Results
Results
Discussion• Results similar across different variant patterns  – good!• Transition cost is not overwhelming:  – we never p...
Conclusions• An exact / approximate hybrid approach to join with  violations to implicit referential integrity across tabl...
References used in the presentation• A. Halevy and D. Maier, Dataspaces: the Tutorial, VLDB 2008  tutorial, Auckland, NZ, ...
Upcoming SlideShare
Loading in …5
×

Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

434 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
434
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

  1. 1. Approximate entity reconciliation for on-the-fly integration in data mashups Paolo Missier, Alvaro A. A. FernandesSchool of Computer Science, University of Manchester Roald Lengu, Giovanna Guerrini DISI, Universita di Genova, Italy Marco Mesiti DiCo, Universita di Milano, Italy
  2. 2. Outline• New data integration scenarios: – occasional integration with little prior knowledge about the sources• Context: Data mashups and personal dataspaces• How to ensure that we are not missing any data in the process? – how costly (i.e. response time) is it to guarantee completeness? – can we trade completeness for response time?• Technically speaking: convergence of – record linkage (an old data quality favourite) – approximate joins – adaptive query processing 2
  3. 3. Early example• sources 1..n: collection of car insurance DBs • data changes frequently • schemas can be analysed / integrated using traditional techniques• source n+1: reference street atlas 3
  4. 4. Early example• sources 1..n: collection of car insurance DBs • data changes frequently • schemas can be analysed / integrated using traditional techniques• source n+1: reference street atlas • target app: mapping accidents hotspots • alert service to drivers, for example • useful information for decision makers 3 (image from housingmaps.com)
  5. 5. Mashups The IBM view, 2006 VLDB 2006 Keynote by Anant Jhingran (CTO, Information Management, IBM Silicon Valley Laboratory, San Jose, CA): Enterprise information mashups: integrating information, simply Situational Applications• Applications that come together for solving some immediate business problems• constructed “on the fly” for some transient need and possibly short-lasting• Data never seen before, consumed on the spot– would take too long for the IT department to provide them– RSS feeds / data streams 4
  6. 6. IBM Mashup Center• IBM Mashup Center – mashup workflow – leverages Lotus, DB2 plus LDAP, Web Services, ... 5
  7. 7. Yahoo pipesIs there actually a “join” in the set of operators?also google mashup editor, and more... 6
  8. 8. Dataspaces 7
  9. 9. Dataspaces 7
  10. 10. Dataspaces 7
  11. 11. Integration in dataspaces 8
  12. 12. Integration in dataspaces 8
  13. 13. Integration in dataspaces 8
  14. 14. Assumptions– no prior knowledge of data sets (streams) to be joined– assumptions on implicit parent-child attribute relationships– no guarantee of matching values• sources 1..n: collection of car insurance DBs• source n+1: reference street atlas• target app: mapping accidents hotspots 9
  15. 15. The broad context: record linkage• Are two (slightly) different records two different surface representations of the same real-world entity? Name: John Smith Name: John Smith Record values incomplete SSN: SSN: 123-45-6789 Address: 477 Cedar Street Address: Brendan Hughes Brenda Hughes Twins or typo? Address: 564 Hickory Pl. Address: 564 Hickory Pl. Name: Jean Smith Name: Conflict between forenames Phone #: (337) 555-6676 Phone #: (337) 555 5676 and phone number Name: Alice Jones Names: Lois Avon Same SSN, different SSN: 123-45-6789 SSN: 123-45-6789 names:?? 10
  16. 16. The broad context: record linkage• Are two (slightly) different records two different surface representations of the same real-world entity? Name: John Smith Name: John Smith Record values incomplete SSN: SSN: 123-45-6789 Address: 477 Cedar Street Address: Brendan Hughes Brenda Hughes Twins or typo? Address: 564 Hickory Pl. Address: 564 Hickory Pl. Name: Jean Smith Name: Conflict between forenames Phone #: (337) 555-6676 Phone #: (337) 555 5676 and phone number Name: Alice Jones Names: Lois Avon Same SSN, different SSN: 123-45-6789 SSN: 123-45-6789 names:?? • A difficult / uncertain decision process • which attributes should I consider for matching • what are the different weights • context: relative frequency of values? • external knowledge, user input 10
  17. 17. Results on record linkageA mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007 11
  18. 18. Results on record linkageA mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007 Record Linkage: Similarity Measures and Algorithms Nick Koudas (University of Toronto) Sunita Sarawagi (IIT Bombay) Divesh Srivastava (AT&T Labs-Research) Sigmod 2006 Data Quality tutorial 11
  19. 19. Results on record linkageA mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007 Application: Merging Lists ! Application: merge address lists (customer lists, company lists) Record Linkage: to avoid redundancy Similarity Measures and ! Current status: “standardize”, Algorithms different values treated as distinct for analysis ! Lot of heterogeneity Nick Koudas (University of Toronto) ! Need approximate joins Sunita Sarawagi (IIT Bombay) Divesh Srivastava (AT&T Labs-Research) ! Relevant technologies ! Approximate joins ! Clustering/partitioning 7/3/06 Sigmod 2006 Data Quality tutorial 6 11
  20. 20. Results on record linkageA mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007 Application: Merging Lists ! Application: merge address lists (customer lists, company lists) Record Linkage: to avoid redundancy Similarity Measures and ! Current status: “standardize”, Algorithms different values treated as distinct for analysis ! Lot of heterogeneity Nick Koudas (University of Toronto) ! Need approximate joins Sunita Sarawagi (IIT Bombay) Divesh Srivastava (AT&T Labs-Research) ! Relevant technologies ! Approximate joins ! Clustering/partitioning 7/3/06 Sigmod 2006 Data Quality tutorial 6 11
  21. 21. Offline vs online linkage• Offline linkage: – performed once before queries involving joins – reconcile R and S on joining attributes R.A, S.B using your favourite record linkage technique R → R ,S → S – perform regular equijoin on the transformed tables: R S ➡ok for tables that can be analysed ahead of the join ➡suitable when multiple queries issued on integrated tables 12
  22. 22. Offline vs online linkage• Offline linkage: – performed once before queries involving joins – reconcile R and S on joining attributes R.A, S.B using your favourite record linkage technique R → R ,S → S – perform regular equijoin on the transformed tables: R S ➡ok for tables that can be analysed ahead of the join ➡suitable when multiple queries issued on integrated tables• Online linkage: – performed just-in-time before a query – exact join approximate join 12
  23. 23. Integration with approximate joins• Assume relational data: tables R, S• Assume schema integration is understood – we focus on data integration only• Ultimately, data integration involves joining tables R A=B S C D A B A Mcrosoft • ordinary “exact” match Y X Microsoft Microsoft Z misses out on the similar values • compromises integration completeness Y X Microsoft Microsoft Z 13
  24. 24. Approximate joinsHistorical timeline:from:N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures andalgorithms. Tutorial in SIGMOD 06. 14
  25. 25. Approximate joinsHistorical timeline:from:N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures andalgorithms. Tutorial in SIGMOD 06. 14
  26. 26. Approximate joinsHistorical timeline:from:N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures andalgorithms. Tutorial in SIGMOD 06. 14
  27. 27. Edit distance / similarity functions• Core sub-problem in approximate join: – define / choose distance function between values in pairs of joining attributes1. Similarity function sim(r1 , r2 ) between record pairs r1 , r22. Decision rules of the form sim(r1 , r2 ) < θ1 → not match θ1 ≤ sim(r1 , r2 ) ≤ θ2 → unknown θ2 < sim(r1 , r2 ) → match 15
  28. 28. Edit distance / similarity functions• Core sub-problem in approximate join: – define / choose distance function between values in pairs of joining attributes1. Similarity function sim(r1 , r2 ) between record pairs r1 , r22. Decision rules of the form sim(r1 , r2 ) < θ1 → not match θ1 ≤ sim(r1 , r2 ) ≤ θ2 → unknown θ2 < sim(r1 , r2 ) → match A common choice of similarity function in the context of approximate joins is one based on string q-grams 15
  29. 29. Measuring string similarity using q-grams• q-grams map string s to a set q(s) of substrings of length q: Ex.: 3-grams: q(“Microsoft Corporation”) = {‘Mic’, ‘icr’, ‘cro’, ‘ros’, ‘oso’, ‘sof ’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’ }. q(“Mcrosoft Corporation”) = {‘Mcr’, ‘cro’, ‘ros’, ‘oso’, ‘sof’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’, ‘rp#’ }. |q(s1 ) ∩ q(s2 )| sim(s1 , s2 ) = (Jaccard coefficient) |q(s1 ) ∪ q(s2 )| This is a commonly used measure of string similarity
  30. 30. Online linkage using q-grams – approximate join is a θ join: R θA,B S – where θΑ,Β incorporates a similarity measure, eg Jaccard• Naïve method: for each record pair, compute similarity score – I/O and CPU intensive, not scalable• Goal: reduce O(n2) cost to O(n*w), where w << n – Reduce number of pairs on which similarity is computed – Take advantage of efficient relational join methods 17
  31. 31. Efficient relational approximate joins Idea: reduce approximate join to aggregated set intersection: dis(s1 , s2 ) ≤ d if |(s1 ) ∩ q(s2 )| ≥ max (|s1 |, |s2 |) − (d − 1) × q − 1In practice:• known similarity measures can be used to compare pairsof records• cheap filters (length, count, position) to prune non-matches• Implementation using standard SQL • cost-based join methods Efficient relational representation: [CGK06] S. Chaudhuri, V. Ganti and R. Kaushik, “A primitive operator for similarity joins in data cleaning” (ICDE’06)‫‏‬ 18
  32. 32. Is full approximate join always necessary?• Remaining source of complexity: – overhead for storing and indexing q-grams – cost of computing set intersection• Typical mismatch rate in real datasets around 5%• Complexity of full-fledged approximate join not fully justified Research hypothesis: time-completeness trade-offs Offer users the option to trade completeness of integration with the time required to complete the join 19
  33. 33. Adaptive query processing Idea: implement a hybrid join algorithm that combines exact and approximate join Intuition: leverage known results on Adaptive Query Processing – developed in the context of query re-optimization – switch physical join operators in mid-flight[DIR07] A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing.Foundations and Trends in Databases, 1(1):1–140, 2007See also VLDB 2007 Tutorial athttp://www.vldb2007.org/program/slides/s1426-deshpande.pdf 20
  34. 34. Autonomic computing framework[KC03] J. O. Kephart and D. M. Chess. The vision ofautonomic computing. IEEE Computer, 36(1):41–50, 2003. 21
  35. 35. Autonomic computing framework monitorrespond assess 21
  36. 36. Autonomic computing framework incremental result size monitor estimate result size respond assess switch join compute operators divergencestart with an exact join (optimistically)at step t during the execution:• estimate the expected size of the join result Ōt at that point• monitor the actual size Ot of the result • when using exact join: if Ōt and Ot diverge “too much”, then switch to approximate join • when using approximate join: if Ōt and Ot are very close, then switch to exact join 21
  37. 37. Technical approach and challenges Need to add several new capabilities to a standard query processing infrastructure• Assess: – estimating result size at specific points during join execution• Respond: – switching between join operators at specific points during execution • Adaptive Query Processing (AQP): operator replacement in pipelined query plans [EFP06] – adding an approximate join operator to the query processor [CGK06][EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for thereplacement of pipelined physical join operators in adaptive query processing. In EDBTWorkshops 2006, LNCS 4254[CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins indata cleaning. In ICDE 2006, p. 5. 22
  38. 38. Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.– a pipelined operator ← this is a key requirement for use in AQP R S 23
  39. 39. Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S build m x n yR hash table 23
  40. 40. Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S build build m x y r n y x sR hash table S hash table 23
  41. 41. Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S when a tuple appears at either input, it is incrementally added to the build build corresponding hash table and probed against the opposite hash table. m x y r n y x sR hash table S hash table 23
  42. 42. Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S when a tuple appears at either input, it is incrementally added to the build build corresponding hash table and probed against the opposite hash probe table. m x y r n y x sR hash table S hash table 23
  43. 43. Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S when a tuple appears at either input, it is incrementally added to the build build corresponding hash table and probed against the opposite hash probe table. m x y r n y x sR hash table S hash table [R.m,S.s] 23
  44. 44. Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S when a tuple appears at either input, it is incrementally added to the build build corresponding hash table and probed against the opposite hash probe table. m x probe y r n y x sR hash table S hash table [R.m,S.s] 23
  45. 45. Symmetric hash joinWell-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S when a tuple appears at either input, it is incrementally added to the build build corresponding hash table and probed against the opposite hash probe table. m x probe y r n y x sR hash table S hash table [R.m,S.s] [R.n, S.r] 23
  46. 46. Estimating result size• Exploit implicit parent-child key assumption: – at the end of join, we expect a result of size |S| R (parent) S (child) c x n y b d y x a• When there are no mismatches: after scanning n < |S| tuples on S: P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R| Thus, join result size On is a binomial random variable: n On ∼ bin(n, ) |R| 24
  47. 47. Detecting divergent observed result size ¯Observation On is an outlier wrt expected result sizeOn after n tuples have been scanned, if: ¯ Pn,p(n) (On ≤ O) ≤ θoutwhere Pn,p(n) (.) is the cumulative distribution function fora binomial with parameters n, p(n) 25
  48. 48. Detecting divergent observed result size ¯Observation On is an outlier wrt expected result sizeOn after n tuples have been scanned, if: ¯ Pn,p(n) (On ≤ O) ≤ θoutwhere Pn,p(n) (.) is the cumulative distribution function fora binomial with parameters n, p(n) 25
  49. 49. Instantiating the MAR framework On incremental result size ✔ monitor estimate ✔ result size respond assess switch compute join divergenceoperators predicates 26
  50. 50. Instantiating the MAR framework On incremental result size ✔ monitor estimate ✔ result size respond assess switch join compute divergence ✔operators predicates 26
  51. 51. Instantiating the MAR framework On incremental result size ✔ monitor estimate ✔ result size respond assess switch join compute divergence ✔operators predicates σ(t), µ(t), π(t) ¯ σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout Discrepancy detected At,W µi (t) ≡ ≤ θcurpert Current perturbations on W left/right? 26 πi (t) ≡ I(µi (t )) ≤ θpastpert Past perturbations on left/ t <t right?
  52. 52. Responder’s state machine• Operator switch defined in terms of state transitions• Owing to symmetry, we can use a different operator on each of the two tables left: exact left: approximate right: exact right: approximate left: exact left: approximate right: approximate right: exact 27
  53. 53. Rationale for state transitions lex / rex evidence that lex / lap / evidence that leftleft and /or right rap rex and /or right input input perturbed no longer perturbed lap / rap predicates σ(t), µ(t), π(t) provide the evidence needed to drive the transitions
  54. 54. Assessment → state transitions ¯σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout At,Wµi (t) ≡ ≤ θcurpert Wπi (t) ≡ I(µi (t )) ≤ θpastpert t <t ϕ0 (t) = ¬σ(t) ∧ µleft (t) ∧ µright (t) ϕ1 (t) = σ(t) ∧ ¬µleft (t) ∧ ¬µright (t) ϕ2 (t) = σ(t) ∧ ¬µleft (t) ∧ µright (t) ∧ πleft (t) 29
  55. 55. Completing the loop On incremental δadapt result size ✔ monitor estimate ✔ result size ✔ respond assess switch compute ✔ join operators divergenceϕ0 (t) = ¬σ(t) ∧ µleft (t) ∧ µright (t) ¯ σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θoutϕ1 (t) = σ(t) ∧ ¬µleft (t) ∧ ¬µright (t) At,W µi (t) ≡ ≤ θcurpertϕ2 (t) = σ(t) ∧ ¬µleft (t) ∧ µright (t) ∧ πleft (t) W 30 πi (t) ≡ I(µi (t )) ≤ θpastpert t <t
  56. 56. Note on operator replacement• Details on how to switch operators on the fly are omitted – main point: pipelined operators expose specific quiescent states where replacement can take place with no loss of work [EPF06][EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for thereplacement of pipelined physical join operators in adaptive query processing. In 31EDBT Workshops 2006, LNCS 4254
  57. 57. Note on operator replacement• Details on how to switch operators on the fly are omitted – main point: pipelined operators expose specific quiescent states where replacement can take place with no loss of work [EPF06][EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for thereplacement of pipelined physical join operators in adaptive query processing. In 31EDBT Workshops 2006, LNCS 4254
  58. 58. Experimental evaluation Trade-off analysis• Benefits: – achieved level of result completeness – baseline: approximate join throughout • model marginal gain of hybrid algorithm• Cost – baseline: exact join throughout • model marginal cost of hybrid algorithm 32
  59. 59. Test datasetsDatasets chosen as representative of 4 distinct patterns we expect our results to vary:• uniform perturbation: evidence grows slowly => slow reaction• bursty perturbation: strong evidence => timely reaction
  60. 60. Parameters tuning and gain/cost models• Each of the MAR parameters tuned empirically• Experiments executed using the best possible configuration• Nice result: parameter setting is quite independent from the specific variant patternRelative gain grel:• R: result size for approx join only• r: result size for exact only• rabs: result size actually observed grel = (rabs – r) / (R – r)‫‏‬(details on cost model omitted)
  61. 61. Cost modelunit cost of executing one step in state i: wi – weights determined experimentally• number of steps in each state ti• unit state transition cost – experimental: vi• number of state transitions tritotal absolute cost: cabs = sumi(sci) + sumi(tci)‫‏‬relative cost:c: best cost (exact only)‫‏‬C: worst cost (approx only)‫‏‬ crel = cabs / (C - c)‫‏‬
  62. 62. Results
  63. 63. Results
  64. 64. Discussion• Results similar across different variant patterns – good!• Transition cost is not overwhelming: – we never pay more for hybrid than for approx – this gives us a good space for trade-offs – we could let users tune the algorithm without fear of “breaking” it
  65. 65. Conclusions• An exact / approximate hybrid approach to join with violations to implicit referential integrity across tables – relational setting• Approach based on autonomic computing principles – Adaptive query processing techniques• Application: on-the-fly integration scenarios (mashups, personal dataspaces)• Results: cost / completeness trade-off analysis – initial encouraging experimental conclusions Study requires additional testing on real datasets
  66. 66. References used in the presentation• A. Halevy and D. Maier, Dataspaces: the Tutorial, VLDB 2008 tutorial, Auckland, NZ, Aug 2008• N. Koudas, S. Sarawagi, D.Srivastava, Record Linkage: Similarity Measures and Algorithms, VLDB 2006 tutorial, Seoul, Corea, 2006• [FS69] I.P. Fellegi and A.B. Sunter, A Theory for Record Linkage, J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969• [EIV07] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Duplicate Record Detection: A Survey, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007• [KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003.• EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254• [CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive 39 operator for similarity joins in data cleaning. In ICDE 2006, p. 5

×