Data Integration with CloverETL


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Global Schema (mediated schema): It’s called global since we are trying to unify a number of local schemata. In some other cases, this global schema can also be a local schema for other II systems.Wrapper: Wrappers make sources accessible. They transform data from the source native format to something acceptable to the mediator.Mediators: Translate queries, combine answers of wrappers and mediators.
  • The approach is completely virtual: we never create a database the conforms to the global schema.
  • How to pose queries? Simply unfold the user query by substituting the view definition for global schema relations.
  • Corresponds to the process from sources to Global Schema
  • CloverETL Designer is a member of the family of CloverETL software products developed by Javlin. It is a powerful Java-based standalone application for data extraction, transformation and loading.CloverETL Designer builds upon extensible Eclipse platform. See with CloverETL Designer is much simpler than writing code for data parsing. Its graphical user interface makes creating and running graphs easier and comfortable.CloverETL Designer can be used to work with CloverETL Server. These two products are fully integrated. You can use CloverETL Designer to connect to and communicate with CloverETL Server, create projects, graphs and all other resources on CloverETL Server in the same way as if you were working with the standard CloverETL Designer only locally.CloverETL Server allows to achieve:StatisticsMonitoringCentralized ETL job managementIntegration into enterprise workflowsMulti-user environmentParallel execution of graphsTracking of executions of graphsScheduling tasksClustering and distributed execution of graphsLaunch servicesLoad balancing and failover
  • Transformation graphs are created in CloverETL Designer from graph elements and executed by CloverETL Engine. The most important graph elements are components (nodes). They all serve to process data. Most of them have ports through which they can receive data and/or send the processed data out. Most components work only when edges are connected to these ports. Each edge in a graph connected to some port must have metadata assigned to it. Metadata describes the structure of data flowing through the edge from one component to another.
  • Demo scenario in general
  • Data sources have different schemata and formats.We assume that students are disjoint over unis. Also, Student IDs are unique ini Italy.
  • GAV Mapping
  • As for the third query, we assume that nationality attribute is strong enough to be an identifier.GROUP BY is not expressible by CQs.
  • Logical plan = what to do OR what you want (declarative)We unfold the query here.
  • Execution plan = how to do it (procedural)From ( way that a statement can be physically executed is called an execution plan or a query plan.An execution plan is composed of primitive operations. Examples of primitive operations are: reading a table completely, using an index, performing a nested loop or a hash join, etc.
  • Data Integration with CloverETL

    1. 1. The Practical Side ofInformation Integration with Fariz Darari (FU Bolzano) 1
    2. 2. Outline1. Information Integration2. CloverETL3. Demo – Global Schema – Data Sources – Queries Fariz Darari (FU Bolzano) 2
    3. 3. INFORMATION INTEGRATION Fariz Darari (FU Bolzano) 3
    4. 4. Information IntegrationII has the aim to provide uniform access to data that are stored in a number of autonomous and heterogeneous sources. Fariz Darari (FU Bolzano) 4
    5. 5. Challenges• Different data models (structured, semi-structured, text)• Different schemata• Differences in the representation of – values (km vs. miles, USD vs. EUR) – entities (addresses, dates, etc.)• Inconsistencies among the data Fariz Darari (FU Bolzano) 5
    6. 6. Components• Consists of: 1. Global Schema The unifying schema among local schemata. 2. Wrappers Wrappers make sources accessible. 3. Mediators Translate queries, combine answers of wrappers and other mediators. Fariz Darari (FU Bolzano) 6
    7. 7. Information Integration - GAV• An approach of mapping source schemata and global schema• GAV = relations in the global schema are views of the sources• Views are virtual relations, the global schema describes a virtual DB Fariz Darari (FU Bolzano) 7
    8. 8. Information Integration - GAV Fariz Darari (FU Bolzano) 8
    9. 9. Information Integration - ETL Fariz Darari (FU Bolzano) 9
    10. 10. Information Integration - ETL Products Fariz Darari (FU Bolzano) 10
    11. 11. CLOVER ETL Fariz Darari (FU Bolzano) 11
    12. 12. CloverETL• An Open Source based platform for information integration.• Data can be: – extracted from any number of sources – validated and modified along the way – written to one or more destinations. Fariz Darari (FU Bolzano) 12
    13. 13. CloverETL - Company Fariz Darari (FU Bolzano) 13
    14. 14. CloverETL - Architecture Fariz Darari (FU Bolzano) 14
    15. 15. CloverETL - Designer Fariz Darari (FU Bolzano) 15
    16. 16. CloverETL - Designer• Transformation graphs are created in CloverETL Designer.• Tranformation graphs are divided into: – Extract (Green) – Transformation (Yellow) – Load (Blue)• The edges correspond to the data flows from data sources to data targets. Fariz Darari (FU Bolzano) 16
    17. 17. DEMO Fariz Darari (FU Bolzano) 17
    18. 18. Global Schema Fariz Darari (FU Bolzano) 18
    19. 19. Global Schema - Example• Student(sid, sname, age, nationality)• Country(cid, cname, currency) Fariz Darari (FU Bolzano) 19
    20. 20. Data Sources• Unibz (Bolzano), from Relational DB – StudentBZ(id, name, sex, age, nationality, address)• Unitr (Trento), from XML – StudentTR(id, full_name, age, nationality)• Unimi (Milan), from CSV – StudentMI(student_id, name, gender, age, citizenship)• UN (United Nations), from Excel – CountryUN(id, country_name, population, capital, currency) Fariz Darari (FU Bolzano) 20
    21. 21. Data Sources - Mapping• Student(sid, sname, age, nationality) :- StudentBZ(sid, sname, _, age, nationality, _)• Student(sid, sname, age, nationality) :- StudentTR(sid, sname, age, nationality)• Student(sid, sname, age, nationality) :- StudentMI(sid, sname, _, age, nationality)• Country (cid, cname, currency) :- CountryUN(cid, cname, _, _, currency) Fariz Darari (FU Bolzano) 21
    22. 22. Queries1. All students with their information. q(sid, sname, age, nationality) :- Student(sid, sname, age, nationality).2. All students whose age is more than 22. q(sid, sname) :- Student(sid, sname, age, nationality), age > 22.3. All students with their nationality’s currency. q(sid, sname, age, nationality, currency) :- Student(sid, sname, age, nationality), Country(cid, nationality, currency).4. The number of students per country. SELECT nationality, count(sid) FROM Student GROUP BY nationality Fariz Darari (FU Bolzano) 22
    23. 23. Demo• Query:q(sid, sname) :- Student(sid, sname, age, nationality), age > 22.• Logical Plans:q(sid, sname) :- StudentBZ(sid, sname, _, age, nationality, _), age > 22.q(sid, sname) :- StudentTR(sid, sname, age, nationality), age > 22.q(sid, sname) :- StudentMI(sid, sname, _, age, nationality), age > 22. Fariz Darari (FU Bolzano) 23
    24. 24. Demo - Execution Plan Fariz Darari (FU Bolzano) 24
    25. 25. References•• Fariz Darari (FU Bolzano) 25