Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2016 VLDB - The iBench Integration Metadata Generator

170 views

Published on

Given the maturity of the data integration field it is surprising that rigorous empirical evaluations of research ideas are so scarce. We identify a major roadblock for empirical work - the lack of comprehensive metadata generators that can be used to create benchmarks for different integration tasks. This makes it difficult to compare integration solutions, understand their generality, and understand their performance. We present iBench, the first metadata generator that can be used to evaluate a wide-range of integration tasks (data exchange, mapping creation, mapping composition, schema evolution, among many others). iBench permits control over the size and characteristics of the metadata it generates (schemas, constraints, and mappings). Our evaluation demonstrates that iBench can efficiently generate very large, complex, yet realistic scenarios with different characteristics. We also present an evaluation of three mapping creation systems using iBench and show that the intricate control that iBench provides over metadata scenarios can reveal new and important empirical insights. iBench is an open-source, extensible tool that we are providing to the community. We believe it will raise the bar for empirical evaluation and comparison of data integration systems.

Published in: Science
  • Be the first to comment

  • Be the first to like this

2016 VLDB - The iBench Integration Metadata Generator

  1. 1. The iBench Integration Metadata Generator Patricia C. Arocena2, Boris Glavic1, Radu Ciucanu3, Renée J. Miller2 IIT DBGroup1 University of Toronto2 Université Blaise Pascal3
  2. 2. Outline 1) Introduction 2) The Metadata Generation Problem 3) The iBench Metadata Generator 4) Experiments 5) Conclusions and Future Work 2 VLDB 2016 - The iBench Integration Metadata Generator
  3. 3. Overview • Challenges of evaluating integration systems – Metadata is central • Various types of metadata used by integration tasks – Quality is as (or maybe more) important than performance • Often requires “gold standard” solution • Goal: make empirical evaluations … • … more robust, repeatable, shareable, and broad • … less painful and time-consuming • Contributions – iBench – a flexible metadata generator for evaluating integration systems – Evaluate iBench performance + use it in evaluations 3 VLDB 2016 - The iBench Integration Metadata Generator
  4. 4. Integration Tasks Many integration tasks work with metadata: • Data Exchange – Input: Schemas, Constraints, (Source Instance), Mappings – Output: Executable Transformations, (Target Instance) • Schema Mapping Generation – Input: Schemas, Constraints, Instance Data, Correspondences – Output: Mappings, Transformations • Schema Matching – Input: Schemas, (Instance Data), (Constraints) – Output: Correspondences • Virtual Data Integration – Input: Schemas, Instance Data, Mappings, Queries – Output: Rewritten Queries, Certain Query Results • … and many others (e.g., Mapping Operators, Schema Evolution, Data Cleaning, ...) 4 VLDB 2016 - The iBench Integration Metadata Generator Integration Scenario Metadata: Schemas, Constraints, Correspondences, Mappings Data: Source Instance, Target Instance
  5. 5. Empirical Evaluation • What are important input parameters? – Ability to create input where each parameter is varied systematically as an independent variable – What are realistic values for these parameters? • How is output performance measured? – Measures of scalability (throughput, response time) – Measures of quality • Given input, what is the best output (gold standard)? – Many quality metrics measure “deviation from gold standard” 5 VLDB 2016 - The iBench Integration Metadata Generator
  6. 6. Running Example • Alberto has designed a new mapping inversion algorithm called mapping restoration – Not all mappings have a perfect restoration • A partial restoration leads to a mapping that is close to an inverse but not equivalent – He has implemented his algorithm – Alberto would like to publish (deadline in 3 weeks) • Alberto’s would like to do a comprehensive evaluation to test … – Hypothesis 1: many mappings do have a restoration – Hypothesis 2: the quality of the partial inverses is good – Hypothesis 3: although the algorithm has exponential complexity he believes that some heuristic optimizations will make it feasible to compute inverses for mappings over large schemas • Alberto’s Dilemma – How to test his hypotheses (in short time)? 6 VLDB 2016 - The iBench Integration Metadata Generator Given (S, T, ΣST) find (T, S, ΣTS)
  7. 7. State-of-the-art • How are integration systems typically evaluated? • Small real-world integration scenarios – Advantages: • Realistic ;-) – Disadvantages: • Not possible to scale (schema-size, data-size, …) • Not possible to vary parameters (e.g., mapping complexity) • Ad-hoc synthetic scenarios – Advantages: • Can influence scale and characteristics – Disadvantages: • Often not very realistic metadata • Diversity requires huge effort 7 VLDB 2016 - The iBench Integration Metadata Generator
  8. 8. Requirements • Alberto needs a tool to generate realistic metadata! – Scalability • Generate large integration scenarios efficiently • Requires low user effort – Control over the characteristics of the generated metadata • Size • Structure • … – Generate inputs as well as gold standard outputs – Generate data that matches the generated meta-data • Data should obey integrity constraints – Promote reproducibility • Enable other researchers to regenerate metadata to repeat an experiment • Support researchers in understanding the characteristics of generated metadata • Enable researchers to reuse generated integration scenarios 8 VLDB 2016 - The iBench Integration Metadata Generator
  9. 9. Related Work • STBenchmark [Alexe et al. PVLDB ‘08] – Pioneered the primitive approach • Generate metadata by combining typical micro scenarios – No support for controlling reuse among primitives – No support for flexible value invention, scaling real-world scenarios, … • Data generators – PDGF, Myriad 9 VLDB 2016 - The iBench Integration Metadata Generator
  10. 10. Outline 1) Introduction 2) The Metadata Generation Problem 3) The iBench Metadata Generator 4) Experiments 5) Conclusions and Future Work 10 VLDB 2016 - The iBench Integration Metadata Generator
  11. 11. Integration Scenarios • Integration Scenario – M = (S,T, ΣS, ΣT, Σ, I, J, 𝛕) – Source schema S with instance I – Target schema T with instance J – Source constraints ΣS and target constraints ΣT • Instance I fulfills ΣS and instance J fulfills ΣT – Schema mapping Σ • Instances (I,J) fulfill Σ – Transformations 𝛕 11 VLDB 2016 - The iBench Integration Metadata Generator I J S T Σ 𝛕 ΣS ΣT I J S T Σ 𝛕 ΣS ΣT
  12. 12. MD Task • Metadata generation task (MD task) – Scenario parameters П • Number of source relations • Number of attributes of target relations – MD task Γ • Specifies min/max constraints over some 𝛑∊П • E.g., source relations should have between 3 and 10 attributes • Solutions – A integration scenario M is a solution for an MD task Γ if M fulfills the constraints of Γ 12 VLDB 2016 - The iBench Integration Metadata Generator
  13. 13. Example - MD Task • MD Task • Example solution (mappings) • S1(A,B,C),S2(C,D,E) —> T(A,E) • S3(A,B,C,D),S4(E,A,B) —> ∃X,Y,Z T1(A,X,X), T2(A,Y,C),T3(C,B,Y,Z) • Limited usefulness in practice – Can we generate “realistic” scenarios? 13 VLDB 2016 - The iBench Integration Metadata Generator Parameter 𝛑 Source Target Number Relations 2-4 1-3 Number Attributes 2-10 2-10 Number of Join Attr 1-2 1-2 Number of Existentials 0-3
  14. 14. Mapping Primitives • Mapping Primitives – Template micro-scenarios that encode a typical schema mapping/evolution operations • Vertical partitioning a source relation – Used as building blocks for generating scenarios • Comprehensive Set of Primitives – Schema Evolution Primitives • Mapping Adaptation [Yu, Popa VLDB05] • Mapping Composition [Bernstein et al. VLDBJ08] – Schema Mapping Primitives • STBenchmark [Alexe, Tan, Velegrakis PVLDB08] – First to propose parameterized primitives 14 VLDB 2016 - The iBench Integration Metadata Generator
  15. 15. Scenario Primitives 15 VLDB 2016 - The iBench Integration Metadata Generator
  16. 16. Primitive Parameters • What are primitive parameters? – For a given set of primitive types • 𝛉p = number of instances of primitive p that have to be included in a solution for an MD task – Primitives are templates • 𝛑 parameters affect instances of primitives – e.g., number of attributes per source table • 𝛌 parameters that determine primitive specific settings – Specified as min/max constraints – e.g., number of partitions in vertical partitioning 16 VLDB 2016 - The iBench Integration Metadata Generator
  17. 17. MD task with Primitives • Solutions – Contain requested number 𝛉p of instances of primitive p • 𝛌 parameter constraints are obeyed – 𝛑 parameters constraints are fulfilled 17 VLDB 2016 - The iBench Integration Metadata Generator
  18. 18. Sharing among Primitives • Primitives occur in the real world • However: – Assuming independence of schema elements across primitive instances is unrealistic • Solution – Scenario parameters that control sharing among primitives 18 VLDB 2016 - The iBench Integration Metadata Generator
  19. 19. Analysis • Complexity – Finding a solution for an MD task is NP-hard • Unsatisfiability – There are MD tasks that have no solutions • Implications – Any PTIME algorithm will have to either be unsound (may not fulfill MD task) or incomplete (may not always return a solution) – Users may accidentally specify unsatisfiable tasks • How to deal with that? 19 VLDB 2016 - The iBench Integration Metadata Generator
  20. 20. Outline 1) Introduction 2) The Metadata Generation Problem 3) The iBench Metadata Generator 4) Experiments 5) Conclusions and Future Work 20 VLDB 2016 - The iBench Integration Metadata Generator
  21. 21. iBench Overview • MDGen Algorithm – PTIME greedy algorithm solving the metadata generation problem • Relaxes scenario parameters if necessary to avoid backtracking and guarantee that a solution can be generated • Input configuration – A configuration file (1-2 pages of text) – Scenario and Primitive Parameters – What types of metadata to produce? – Generate Data? • Output – An XML file storing the metadata – Data as CSV files (optional) 21 VLDB 2016 - The iBench Integration Metadata Generator
  22. 22. MDGen Algorithm • Instantiate Primitives to fulfill primitive parameters – If necessary relax 𝛑 constraints to fulfill 𝛌 and 𝛉 constraints • e.g., the user requests 3 target relations (𝛑), one vertical partitioning primitive instance (𝛉) with 5 fragments (𝛌) – We would ensure that 5 fragments are created • Generate random mappings to fulfill 𝛑 constraints if necessary – e.g., not enough source relations have been created • Customize generated scenario – Adapt value invention – Generate random constraints • Generate Data 22 VLDB 2016 - The iBench Integration Metadata Generator
  23. 23. The iBench Metadata Generator • Primitive Generation Example – I want 1 copy and 1 vertical partitioning 23 VLDB 2016 - The iBench Integration Metadata Generator Source Custaaaa a Nameaaa Addraaaa Empaaaaaa Nameaaa Company Targetaaaaa Customer Nameaa Addraaa Loyaltyi Personaa Idaa aaa Nameaa WorksAta EmpRec Firmaaa Idaa aaa
  24. 24. Outline 1) Introduction 2) The Metadata Generation Problem 3) The iBench Metadata Generator 4) Experiments 5) Conclusions and Future Work 24 VLDB 2016 - The iBench Integration Metadata Generator
  25. 25. Evaluation Goals • Scalability of iBench – Can we generate large and complex integration scenarios in reasonable time? • Effectiveness – Can we enable new types of evaluations that would be hard or impossible without having iBench 25 VLDB 2016 - The iBench Integration Metadata Generator
  26. 26. iBench Performance 26 VLDB 2016 - The iBench Integration Metadata Generator • RED • no sharing/constraints • GREEN • 25% constraints • BLUE • 25% source sharing • TURQUOISE • 25% target sharing • No Sharing : 1M attributes in secs • Sharing: 1M attributes in 2-3 mins • non-linear trend & high std deviance (12-14 sec) due to best-effort sharing
  27. 27. Case Study - Data Exchange 27 VLDB 2016 - The iBench Integration Metadata Generator • Compare two data exchange systems • Mapping generation time • Data exchange performance • Target instance quality (redundancy) • Clio - Toronto/IBM Almaden collaboration • creates universal solutions • [Popa, Velegrakis, M-, Hernandez, Fagin VLDB02] • MapMerge - UC Santa Cruz/IBM Almaden collaboration • mapping correlation to remove redundancy in Clio mappings • [Alexe, Hernandez, Popa, Tan VLDBJ12] • Original MapMerge Evaluation • two real-life scenarios 14 mappings • one synthetic: one source relation, up to 256 mappings • Our goal: • test the original experimental observations over more diverse and complex metadata scenarios
  28. 28. Case Study – Results 28 VLDB 2016 - The iBench Integration Metadata Generator • s Result Quality Mapping Generation Data Exchange New insights • Larger improvement for ontology scenarios (Ont) that exhibit more incompleteness • Comparable mapping execution time (data exchange) with slight benefits for MapMerge over larger scenarios
  29. 29. Success stories • Value Invention in Data Exchange – [Arocena, Glavic, Miller, SIGMOD 2013] – Generate a large number of mappings (12.5 M) • Vagabond – [us] – Use iBench to scale real world scenarios and evaluate the effectiveness and performance of our approach for • Mapping Discovery – [Miller, Getoor, Memory] – Use iBench to generate schemas and data as input for learning mappings between schemas using statistical techniques and compare against iBench ground truth mappings • Functional Dependencies Unleashed for Scalable Data Exchange – [Bonifati, Ileana, Linardi - arXiv preprint arXiv:1602.00563, 2016] – Used iBench to compare a new chase-based data exchange algorithm to SQL-based exchange algorithm of ++Spicy • Approximation Algorithms for Schema-Mapping Discovery from Data – [ten Cate, Kolaitis, Qian, Tan AMW 2015] – Approximate the Gottlob-Senellart notion – Kun Qian currently using iBench to evaluate effectiveness of approximation • Comparative Evaluation of Chase engines – [Università della Basilicata, University of Oxford] – Using iBench to generate schemas, constraints 29 VLDB 2016 - The iBench Integration Metadata Generator
  30. 30. Outline 1) Introduction 2) The Metadata Generation Problem 3) The iBench Metadata Generator 4) Experiments 5) Conclusions and Future Work 30 VLDB 2016 - The iBench Integration Metadata Generator
  31. 31. Conclusions • Goals – Improve empirical evaluation of integration systems • Help researchers evaluate their systems • iBench – Comprehensive metadata generator – Produces inputs and outputs (gold standards) for a variety of integration tasks – Control over metadata characteristics 31 VLDB 2016 - The iBench Integration Metadata Generator
  32. 32. Future Work • Give more control over data generation • Orchestration of mappings – Serial and parallel • Improve control over naming of schema elements – e.g., to better support schema matching • Incorporate implementations of quality measures 32 VLDB 2016 - The iBench Integration Metadata Generator
  33. 33. Questions? • Homepage: Boris: http://www.cs.iit.edu/~glavic/ Radu: http://ws.isima.fr/~ciucanu/ Patricia: http://dblab.cs.toronto.edu/~prg/ Renee: http://dblab.cs.toronto.edu/~miller/ • iBench Project Webpage: http://dblab.cs.toronto.edu/project/iBench/ Code: https://bitbucket.org/ibencher/ibench/ Public Scenario Repo: https://bitbucket.org/ibencher/ibenchconfigurationsandscenarios 33 VLDB 2016 - The iBench Integration Metadata Generator

×