1. Data Integration & Data QualityData Integration & Data Quality
Your open source based BI solution!!
by
2. Introduction to Data Quality
What is Data Quality?
Why Data Quality?
Concepts
Data Quality advantages
Data Quality & Business Intelligence
BI Tenets
Data integration
Best practices
Open Source & Data Quality
Data Quality & Pentaho Data Integration (PDI)
PDI / ETLs / Integrity / Validation
Data Cleaner
Integration Data Cleaner and PDI
Table of contents
5. Introduction to Data QualityIntroduction to Data Quality
http://optimizeyourdataquality.wordpress.com/
6. Introducción
What is Data Quality?What is Data Quality?
Non-standard definition
“The processes and technologies involved in
ensuring the conformance of data values to
business requirements and acceptance criteria”
Search of attributes on data:
Accuracy
Consistency
Integrity
Validity
http://unitar.org
9. Data governance
Strategic decision making
improved and faster
Managing data
quality: a critical issue
Introduction
Data Quality tasks must be performed in data
integration stage
10. Data Quality benefitsData Quality benefits
Introduction
Suitable Customer Segmentation Customer Satisfaction
Avoid processing unreliable data Cost reduction
Trustable and valuable information
Improving Business Processes Increase profits
12. What is Business Intelligence?
(BI)
The ability to apprehend the
interrelationships of presented
facts in such a way as to guide
action towards a desired goal
Data Quality & Business Intelligence
Visual tools for optimal and simple
analysis
Robust and Trustable data
Business Intelligence TenetsBusiness Intelligence Tenets
Processes involved:
•Data integration
•Efficient usage of company information
13. Data IntegrationData Integration
Key for any BI project
ETL = Extract, Transform and Load
Data Integration process involves data moving from different
sources, data transformation and storing in unified databases: data
warehouse / data marts.
Data Quality & Business Intelligence
Main tasks:
Extract data from multiple sources
Ensuring clean consistent data
Combining data
Load data in a DW
http://blog.bootstraptoday.com
CRM
ERP
BPM
CMS
14. Data Quality & Business Intelligence
CHALLENGES:
Heterogeneous data sources
Large data volumes
Improve operational efficiency
Data source synchronization
Scalability
Data integration and Data Quality, closely related conceptsData integration and Data Quality, closely related concepts
Data IntegrationData Integration
15. Data Quality process can be performed in different ways:
Manual Ad-hoc queries, file searching, etc…
Automated Included in data integration process
Both are complementary though:
Data Quality tasks as a part of Data Integration process (ETL)Data Quality tasks as a part of Data Integration process (ETL)
Data Quality & Business Intelligence
Data integrationData integration
16. Best ETL practicesBest ETL practices
Centralize procedures: Ensure homogeneity and consistency of data from a
great variety of sources.
Avoid redundant calculations: if a calculation has been calculated
previously, avoid repeating the same operation. Improves performance and
avoids possible inconsistencies.
Establish points of “quality control”: ensures the execution of the process at
key points and allows recording track data for future audits.
Implement information reloading processes: useful to avoid initial loading
issues/failures.
Use intermediate structures: Eases monitoring and process monitoring
Data Quality & Business Intelligence
17. Best ETL practicesBest ETL practices
Data Quality & Business Intelligence
Centralized and
standardized processes
Checkpoints and
registrations
Intermediate structures
Apply BI techniques to data
quality process
Analyze and take the best of
data quality results
Allows
19. ETL tools and Data QualityETL tools and Data Quality
Pentaho Data Integration
Talend Open Studio
DataCleaner
Talend Data Quality
Google Refine
Open Source & Data Quality
Data Quality Open Source solutions:
Main ETL Open Source solutions
20. Data Quality & Pentaho Data IntegrationData Quality & Pentaho Data Integration
Intuitive ETL tool based in jobs and transformations
Freedom to decide where and how performs tasks: profiling, cleansing,
integrity, validation; base on metadata;
Data Quality oriented components available on PDI transformations.
Not a pure profiling tool, however DataCleaner can be integrated
Plug-in architecture that allows expanding its functionalities.
Open Source & Data Quality
21. Open Source & Data Quality
Component variety:
Cleansing
Scripting (sql, javascript)
Validation
Statistics
Etc…
Data Quality & Pentaho Data IntegrationData Quality & Pentaho Data Integration
22. Data Quality & Pentaho Data IntegrationData Quality & Pentaho Data Integration
Open Source & Data Quality
An accurate ETL divided in several phases is essential:
1. Preparation process
2. Data receipt
3. Data processing
4. Final Load
5. Result reports
6. Activity control
This approach allows:
Standardizing processes in an organization
Scale better by increasing the amount of sources
Centralized control of process results
23. Data CleanerData Cleaner
Open Source & Data Quality
Profiling tool recommended by Pentaho
Alternative tools:
Desktop tools
Web tools
PDI Plugin
24. Data Cleaner DesktopData Cleaner Desktop
Open Source & Data Quality
Functionalities:
Data Cleansing
Data dictionaries
definition
Search for patterns,
duplicates, null check,
etc.
Monitoring
Complete execution
stats
Etc.
25. Data Cleaner Monitor (web)Data Cleaner Monitor (web)
Open Source & Data Quality
Functionalities:
Centralized monitoring
Smart visualization
Schedule execution of
Data Cleaner and PDI
jobs
Create custom metrics
Etc.
26. Integration Data Cleaner / PDIIntegration Data Cleaner / PDI
Open Source & Data Quality
After installing PDI Data Cleaner plug-in, there are two usage possibilities:
Option A Profile data using a PDI step
27. Integration Data Cleaner / PDIIntegration Data Cleaner / PDI
Open Source & Data Quality
After installing PDI Data Cleaner plug-in, there are two usage possibilities:
Option B Executing a Data Cleaner job
28. References
International Association for Information and Data
Quality:
http://iaidq.org/
Pentaho Data Integration:
http://www.pentaho.com/explore/pentaho-data-integration/
Data Cleaner:
http://datacleaner.org/
Data Profiling: proceso de examinar los datos que existen en las fuentes de origen y recopilar estadísticas e información sobre los mismos. Data Cleansing: proceso de detectar y corregir datos corruptos, incoherentes o erróneos. Data Integrity: proceso de analizar la consistencia de los datos y las relaciones entre los diferentes conjuntos de datos. Data Validation: proceso de aplicar reglas de validación a los datos basándose en diccionarios de datos y/o reglas de negocio. Master Data Management: conjunto de procesos, políticas, estándares y herramientas que sirven para gestionar Datos Maestros de una organización (normalmente información no transaccional). Data Auditing: proceso de gestionar cómo los datos se ajustan a los propósitos definidos por la organización. Es necesario establecer las políticas necesarias. Actuar + Vigilar. Data Governance: concepto que engloba a todos los procesos anteriores y que permite a una organización disponer de una información confiable.