Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

133 views

Published on

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To pro- vide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

Published in: Science
  • Be the first to comment

  • Be the first to like this

2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

  1. 1. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro University of Toronto, Illinois Institute of Technology, Università della Basilicata, Arizona State University Sep 7th 2016
  2. 2. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Overview 2 ‣ Motivations and Goals ‣ Main Ideas ‣ Optimizations ‣ Experimental Results
  3. 3. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Motivation • Data quality is a crucial task in data management • Many automatic and semi-automatic data- cleaning algorithm have been proposed 3 constraint-based Beskales et al. VLDB10 Bohannon et al. SIGMOD05 Chu et al. ICDE13 Cong et al. VLDB07 Geerts et al. VLDB14 … statistics-based Berti-Equille et al. ICDE1 Dasu et al. VLDB12 Prokoshyna et al. VLDB1 Yakout et al. SIGMOD13 …
  4. 4. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Motivation • Data quality is a crucial task in data management • Many automatic and semi-automatic data- cleaning algorithm have been proposed 4 constraint-based Beskales et al. VLDB10 Bohannon et al. SIGMOD05 Chu et al. ICDE13 Cong et al. VLDB07 Geerts et al. VLDB14 … statistics-based Berti-Equille et al. ICDE1 Dasu et al. VLDB12 Prokoshyna et al. VLDB1 Yakout et al. SIGMOD13 … “What is the right tool for my data-cleaning task?”
  5. 5. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Challenges • No openly-available tools or datasets for benchmarking data-cleaning algorithms • Usually approaches are evaluated by using either • manually generated errors: very expensive! • automatically introduced errors in clean data: algorithms are highly sensitive to the characteristics of the errors! • Need for scalable and robust evaluation 5
  6. 6. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Contribution • Benchmarking Algorithms for data Repairing and Translation • open-source error-generation system with an high level of control over the errors • Input: a clean database wrt a set of data-quality rules and a set of configuration parameters • Output: a dirty database (using a set of cell changes) and an estimate of how hard it will be to restore the original values 6
  7. 7. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Overview 7 ‣ Motivations and Goals ‣ Main Ideas ‣ Optimizations ‣ Experimental Results ‣ Detectability ‣ Repairability ‣ Violation-Generation Queries
  8. 8. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th A Motivating Example 8 Player Name Season Team Stadium Goal s t1 Giovinco 2013- 14 Juventu s Juventus Stadium 3 t2 Giovinco 2014- 15 Toronto BMO Field 23 t3 Pirlo 2014- 15 Juventu s Juventus Stadium 5 t4 Pirlo 2015- 16 N.Y. City Yankee St. 0 t5 Vidal 2014- 15 Juventu s Juventus Stadium 5 t6 Vidal 2015- 16 Bayern Allianz Arena 3 functional dependency Name, Season → Team Team → Stadium Quality Rules Represented as Denial Constraints a very expressive language to capture most data-quality rules used for data repairing: FDs, CFDs, Cleaning EGDs, Editing Rules, Fixing Rules, Ordering Constraints dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ ) dc2: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ ) Violation An instance I violates ¬(φ(x)) if there is an assignment m s.t. I ⊨ φ(m(x)) 1 2 2 1
  9. 9. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th A Motivating Example 9 Player Name Season Team Stadium Goal s t1 Giovinco 2013- 14 Juventu s Juventus Stadium 3 t2 Giovinco 2014- 15 Toronto BMO Field 23 t3 Pirlo 2014- 15 Juventu s Juventus Stadium 5 t4 Pirlo 2015- 16 N.Y. City Yankee St. 0 t5 Vidal 2014- 15 Juventu s Juventus Stadium 5 t6 Vidal 2015- 16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ ) dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ ) Camp Nou Cell Changes ch1: t5. Stadium := “Camp Nou” ✔ ch1 is a detectable change: dc2 is violated since t1, t3 and t5 have same team, but different stadiums we call {t1, t3, t5} context equivalence class ✔ easy to correct: the original value “Juventus Stadium” appears in t1,t3 Repairability: the probability of restoring t5.Stadium to its original value by uniformly at random picking a Stadium value from its context equivalence class Rep = 2 / 3 = 0.66 functional dependency Name, Season → Team Team → Stadium 1 2
  10. 10. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th A Motivating Example 10 Player Name Season Team Stadium Goal s t1 Giovinco 2013- 14 Juventu s Juventus Stadium 3 t2 Giovinco 2014- 15 Toronto BMO Field 23 t3 Pirlo 2014- 15 Juventu s Juventus Stadium 5 t4 Pirlo 2015- 16 N.Y. City Yankee St. 0 t5 Vidal 2014- 15 Juventu s Juventus Stadium 5 t6 Vidal 2015- 16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ ) dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ ) Cell Changes ch2: t1. Season:= “2014-15” ✔ ch2 is a detectable change: dc1 is violated: t1 and t2 have same name and season, but different teams, stadium and goals 2014- 15 ✘ hard to correct: the original value “2013-14” disappears from the instance Repairability: 0 / 2 = 0 functional dependency Name, Season → Team Team → Stadium 1 2
  11. 11. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th A Motivating Example 11 Player Name Season Team Stadium Goal s t1 Giovinco 2013- 14 Juventu s Juventus Stadium 3 t2 Giovinco 2014- 15 Toronto BMO Field 23 t3 Pirlo 2014- 15 Juventu s Juventus Stadium 5 t4 Pirlo 2015- 16 N.Y. City Yankee St. 0 t5 Vidal 2014- 15 Juventu s Juventus Stadium 5 t6 Vidal 2015- 16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ ) dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ ) Cell Changes ch3: t5. Name:= “Pirlo” ✘ is a undetectable change Pirlo ch2: t1. Season:= “2014-15” ✔ 2014- 15 ch4: t3.Name:= “Pirlo” ✔ Pirlo ✘ 2014- 15 We need to keep track of the context of each change
  12. 12. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Violation-Generation Queries • Each comparison of a dc suggests a different strategy for finding cells to modify to generate detectable errors • Starting from a dc we generate a set of vio-gen queries 12 Name Season Team t1 Giovinco 2013-14 Juventus t2 Giovinco 2013-14 Juventus t3 Pirlo 2013-14 N.Y. City dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ ) Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t = t’ Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n ≠ n’, s=s’, t ≠ t’ vio-gen query vio-gen query Result of the query: t1, t2 We’ll have a detectable change by making t1.Team and t2.Team different t1. Team:= “Juve” ✔ Result of the query: t2, t3 We’ll have a detectable change by making t2.Name and t3.Name equal t3. Name:= “Giovinco” ✔
  13. 13. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Error-Generation Task 13 • S: relational schema • Σ: a set of denial constraints over S • I: an instance over schema S clean wrt Σ • CONF: configuration parameters • % of detectable errors, % of random errors • Theorem 1: Generating the requested number of detectable errors is NP-Complete (data complexity) EG-Task E={S, Σ, I, CONF}
  14. 14. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Overview 14 ‣ Motivations and Goals ‣ Main Ideas ‣ Optimizations ‣ Experimental Results
  15. 15. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Optimizations • Greedy PTIME algorithm • two cell changes cannot share a context • sound but not complete • in practice for low error ratios (~10-20%) the probability of success is very high • Main cost factor • executing vio-gen queries on DBMS • optimizations for symmetric constraints and cross-products 15
  16. 16. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Symmetric Constraints • Computing joins may be expensive! • We identify a class of DCs (that includes FDs and most of CFDs) where group-by can be used to reduce the size of join inputs • Idea: to find and execute isomorphic subqueries to avoid redundant work 16 Player(n, s, t, st), Player(n’, s’, t’, st’), n=n’, s=s’, t ≠ t’ 1. Formula Graph Player n s t st Player t’ s’ n’st’ = = ≠ 2. Reduced Formula with adornments Player(n=, s=, t ≠, st) 3. Group-By Query SELECT name, season, team FROM player WHERE name, season IN (SELECT name, season FROM player GROUP BY name, season HAVING count(DISTINCT team) > 1) ORDER BY name, season
  17. 17. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Cross Products 17 A Common Pattern dc4: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ ) Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t ≠ t’, st ≠ st’ The result of the vio-gen query will be all possible pairs of players with different team and different stadium  quadratic cost However: we are typically only interested in a small set of cells Solution: we materialize a random sample of the tuples in Player in main- memory and compute the cross product to identify cells to change and their contexts
  18. 18. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Overview 18 ‣ Motivations and Goals ‣ Main Ideas ‣ Optimizations ‣ Experimental Results
  19. 19. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Evaluation of the Tools Tools - Llunatic: Geerts et al. VLDB14 - Holistic: Chu et al. ICDE13 - Greedy: Bohannon et al. SIGMOD05, Cong et al. VLDB07 - Sampling: Beskales et al. VLDB10 Tasks - Constraint-based with 5% errors and different repairability levels: High (~ 0.8), Med (~0.5), and Low (~0.25) 19
  20. 20. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Scalability Results 20
  21. 21. P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. VLDB 2016 - Sep 7th Lessons Learned • Automated tools are essential for robust and broad empirical evaluations • Data-repairing is not yet mature: no definitive automatic data-repairing algorithm yet • Repairability matters • We need to document our dirty data • Algorithms are sensitive to error characteristics! • Generating errors is hard 21
  22. 22. 2 2

×