Data Quality Integration (ETL) Open Source


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Data Profiling: proceso de examinar los datos que existen en las fuentes de origen y recopilar estadísticas e información sobre los mismos. Data Cleansing: proceso de detectar y corregir datos corruptos, incoherentes o erróneos. Data Integrity: proceso de analizar la consistencia de los datos y las relaciones entre los diferentes conjuntos de datos. Data Validation: proceso de aplicar reglas de validación a los datos basándose en diccionarios de datos y/o reglas de negocio. Master Data Management: conjunto de procesos, políticas, estándares y herramientas que sirven para gestionar Datos Maestros de una organización (normalmente información no transaccional). Data Auditing: proceso de gestionar cómo los datos se ajustan a los propósitos definidos por la organización. Es necesario establecer las políticas necesarias. Actuar + Vigilar. Data Governance: concepto que engloba a todos los procesos anteriores y que permite a una organización disponer de una información confiable.
  • Data Quality Integration (ETL) Open Source

    1. 1. Data Integration & Data QualityData Integration & Data Quality Your open source based BI solution!! by
    2. 2. Introduction to Data Quality What is Data Quality? Why Data Quality? Concepts Data Quality advantages Data Quality & Business Intelligence BI Tenets Data integration Best practices Open Source & Data Quality Data Quality & Pentaho Data Integration (PDI) PDI / ETLs / Integrity / Validation Data Cleaner Integration Data Cleaner and PDI Table of contents
    3. 3. Initial Contact
    4. 4. Customer Successes Private Sector Public Sector
    5. 5. Introduction to Data QualityIntroduction to Data Quality
    6. 6. Introducción What is Data Quality?What is Data Quality? Non-standard definition “The processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria” Search of attributes on data: Accuracy Consistency Integrity Validity
    7. 7. Introduction Why Data Quality?Why Data Quality?
    8. 8. Introduction ConceptsConcepts
    9. 9. Data governance Strategic decision making improved and faster Managing data quality: a critical issue Introduction Data Quality tasks must be performed in data integration stage
    10. 10. Data Quality benefitsData Quality benefits Introduction Suitable Customer Segmentation  Customer Satisfaction Avoid processing unreliable data  Cost reduction Trustable and valuable information Improving Business Processes Increase profits
    11. 11. & Business& Business IntelligenceIntelligence
    12. 12. What is Business Intelligence? (BI) The ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal Data Quality & Business Intelligence Visual tools for optimal and simple analysis Robust and Trustable data Business Intelligence TenetsBusiness Intelligence Tenets Processes involved: •Data integration •Efficient usage of company information
    13. 13. Data IntegrationData Integration Key for any BI project ETL = Extract, Transform and Load Data Integration process involves data moving from different sources, data transformation and storing in unified databases: data warehouse / data marts. Data Quality & Business Intelligence Main tasks: Extract data from multiple sources Ensuring clean consistent data Combining data Load data in a DW CRM ERP BPM CMS
    14. 14. Data Quality & Business Intelligence CHALLENGES: Heterogeneous data sources Large data volumes Improve operational efficiency Data source synchronization Scalability Data integration and Data Quality, closely related conceptsData integration and Data Quality, closely related concepts Data IntegrationData Integration
    15. 15. Data Quality process can be performed in different ways: Manual  Ad-hoc queries, file searching, etc… Automated  Included in data integration process Both are complementary though: Data Quality tasks as a part of Data Integration process (ETL)Data Quality tasks as a part of Data Integration process (ETL) Data Quality & Business Intelligence Data integrationData integration
    16. 16. Best ETL practicesBest ETL practices Centralize procedures: Ensure homogeneity and consistency of data from a great variety of sources. Avoid redundant calculations: if a calculation has been calculated previously, avoid repeating the same operation. Improves performance and avoids possible inconsistencies. Establish points of “quality control”: ensures the execution of the process at key points and allows recording track data for future audits. Implement information reloading processes: useful to avoid initial loading issues/failures. Use intermediate structures: Eases monitoring and process monitoring Data Quality & Business Intelligence
    17. 17. Best ETL practicesBest ETL practices Data Quality & Business Intelligence Centralized and standardized processes Checkpoints and registrations Intermediate structures Apply BI techniques to data quality process Analyze and take the best of data quality results Allows
    18. 18. Open SourceOpen Source &&
    19. 19. ETL tools and Data QualityETL tools and Data Quality Pentaho Data Integration Talend Open Studio DataCleaner Talend Data Quality Google Refine Open Source & Data Quality Data Quality Open Source solutions: Main ETL Open Source solutions
    20. 20. Data Quality & Pentaho Data IntegrationData Quality & Pentaho Data Integration Intuitive ETL tool based in jobs and transformations Freedom to decide where and how performs tasks: profiling, cleansing, integrity, validation; base on metadata; Data Quality oriented components available on PDI transformations. Not a pure profiling tool, however DataCleaner can be integrated Plug-in architecture that allows expanding its functionalities. Open Source & Data Quality
    21. 21. Open Source & Data Quality Component variety: Cleansing Scripting (sql, javascript) Validation Statistics Etc… Data Quality & Pentaho Data IntegrationData Quality & Pentaho Data Integration
    22. 22. Data Quality & Pentaho Data IntegrationData Quality & Pentaho Data Integration Open Source & Data Quality An accurate ETL divided in several phases is essential: 1. Preparation process 2. Data receipt 3. Data processing 4. Final Load 5. Result reports 6. Activity control This approach allows: Standardizing processes in an organization Scale better by increasing the amount of sources Centralized control of process results
    23. 23. Data CleanerData Cleaner Open Source & Data Quality Profiling tool recommended by Pentaho Alternative tools: Desktop tools Web tools PDI Plugin
    24. 24. Data Cleaner DesktopData Cleaner Desktop Open Source & Data Quality Functionalities: Data Cleansing Data dictionaries definition Search for patterns, duplicates, null check, etc. Monitoring Complete execution stats Etc.
    25. 25. Data Cleaner Monitor (web)Data Cleaner Monitor (web) Open Source & Data Quality Functionalities: Centralized monitoring Smart visualization Schedule execution of Data Cleaner and PDI jobs Create custom metrics Etc.
    26. 26. Integration Data Cleaner / PDIIntegration Data Cleaner / PDI Open Source & Data Quality After installing PDI Data Cleaner plug-in, there are two usage possibilities: Option A Profile data using a PDI step
    27. 27. Integration Data Cleaner / PDIIntegration Data Cleaner / PDI Open Source & Data Quality After installing PDI Data Cleaner plug-in, there are two usage possibilities: Option B Executing a Data Cleaner job
    28. 28. References International Association for Information and Data Quality: Pentaho Data Integration: Data Cleaner:
    29. 29. About us More information: Tfno: 91.788.34.10 MadridMadrid: Pº de la Castellana, 164, 1º BarcelonaBarcelona: C/ Valencia, 63 BrasilBrasil:: Av. Paulista, 37 4 andar