Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Automating Data
Quality Processes at
Reckitt
Karol Sawicz
IT Business Analyst
Richard Chadwick
Data Engineer
Agenda
§ Who are Reckitt?
§ Project background
§ Project architecture
§ Reducing complexity with a
metadata driven ETL fra...
Who are Reckitt?
§ FMCG company with global presence
§ 43000+ employees
§ Wide data landscape
▪ Various systems in every r...
Karol Sawicz
§ IT Business Analyst at RB
§ Build end to end reporting solutions
§ Manage rollout of global reporting
platf...
Project background – Sales Execution Reporting
▪ Automate data
ingestion
▪ Make data mapping
& cleansing easy
▪ Sustainabl...
Project architecture
Build on azure platform and databricks
Bronze
• Separate common archive enviroment
• Source to all DE...
Project architecture
Build on azure platform and databricks
Silver
• Metadata driven ETL process
• Data harmonized into a ...
Project architecture
Build on azure platform and databricks
Gold
• Materialized data with no downtime
• Bad quality data i...
Richard Chadwick
§ Data Engineering consultant at
Cervello, a Kearney Company.
§ Service end to end data journey for
Sales...
Metadata driven ETL framework
▪ Metadata configuration
table:
▪ File path
▪ File type
▪ Partition structure
▪ Schema
▪ Spa...
Benefits of a metadata driven ETL framework
• Configurations for ETL processes double up as documentation.
• Low code exec...
Turning a data set into a data product
• Accurate data dictionary.
• Data objects have correct naming conventions, data ty...
Demo
github.com/richchad/data_quality_databricks
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Automating Data Quality Processes at Reckitt

Download to read offline

Reckitt is a fast-moving consumer goods company with a portfolio of famous brands and over 30k employees worldwide. With that scale small projects can quickly grow into big datasets, and processing and cleaning all that data can become a challenge. To solve that challenge we have created a metadata driven ETL framework for orchestrating data transformations through parametrised SQL scripts. It allows us to create various paths for our data as well as easily version control them. The approach of standardising incoming datasets and creating reusable SQL processes has proven to be a winning formula. It has helped simplify complicated landing/stage/merge processes and allowed them to be self-documenting.



But this is only half the battle, we also want to create data products. Documented, quality assured data sets that are intuitive to use. As we move to a CI/CD approach, increasing the frequency of deployments, the demand of keeping documentation and data quality assessments up to date becomes increasingly challenging. To solve this problem, we have expanded our ETL framework to include SQL processes that automate data quality activities. Using the Hive metastore as a starting point, we have leveraged this framework to automate the maintenance of a data dictionary and reduce documenting, model refinement, testing data quality and filtering out bad data to a box filling exercise. In this talk we discuss our approach to maintaining high quality data products and share examples of how we automate data quality processes.

  • Be the first to like this

Automating Data Quality Processes at Reckitt

  1. 1. Automating Data Quality Processes at Reckitt Karol Sawicz IT Business Analyst Richard Chadwick Data Engineer
  2. 2. Agenda § Who are Reckitt? § Project background § Project architecture § Reducing complexity with a metadata driven ETL framework § Turning a data set into a data product § Demo Data Quality Processes
  3. 3. Who are Reckitt? § FMCG company with global presence § 43000+ employees § Wide data landscape ▪ Various systems in every region ▪ 50+ Sales CRMs worldwide ▪ 1000s of sales representatives ▪ Many global & local reporting platforms ▪ 100s of data lakes Our Brands Reckitt in numbers
  4. 4. Karol Sawicz § IT Business Analyst at RB § Build end to end reporting solutions § Manage rollout of global reporting platforms § B.Eng in Computer Science from the Polish-Japanese Academy of Information Technology § Builds electric bikes in spare time • Add head shot karol.sawicz@rb.com
  5. 5. Project background – Sales Execution Reporting ▪ Automate data ingestion ▪ Make data mapping & cleansing easy ▪ Sustainably check data quality ▪ No need to rebuild for every new CRM • Harmonization goals • Challenges ▪ Standardized data from many systems ▪ Clean data for analysis ▪ Tools to maintain data mapping and quality checking • Project deliverables ▪ Enabling future data science projects by having reliable datasets ▪ Encouragement of ad-hoc analysis ▪ Cross dataset analysis • Next level analytics ▪ To enable a solid reporting base and analytics capability for Pharmacy & Medical data globally ▪ Data siloed locally ▪ Analysts work on basics and don’t reuse across region ▪ New sales CRMs render existing reports obsolete
  6. 6. Project architecture Build on azure platform and databricks Bronze • Separate common archive enviroment • Source to all DEV/QA/PROD environments • Configuration driven ingestion from salesforce
  7. 7. Project architecture Build on azure platform and databricks Silver • Metadata driven ETL process • Data harmonized into a single data model • Local to global value mapping done during the pipeline • Data quality checks performed using rule-based data quality framework
  8. 8. Project architecture Build on azure platform and databricks Gold • Materialized data with no downtime • Bad quality data is filtered out through views on top of silver data • Reporting views are defined on this layer
  9. 9. Richard Chadwick § Data Engineering consultant at Cervello, a Kearney Company. § Service end to end data journey for Sales Execution: archiving, ETL, validation and deployment. § BSc in Mathematics, The University of Edinburgh. § Previously worked as professional poker player. rchadwick@mycervello.com
  10. 10. Metadata driven ETL framework ▪ Metadata configuration table: ▪ File path ▪ File type ▪ Partition structure ▪ Schema ▪ Spark options ▪ Land all data as temporary views • SQL scripts sourced from repository • Parametrized SQL scripts • SQL process configuration table • All transformations applied to temporary views SQL Process Land Data • Mix and match any number of land and SQL processes to create an executable end to end ETL plan • Configure branch, partition and SQL dictionary arguments for entire ETL plan • Single Databricks notebook supports executing any ETL process ETL land = LandData(config_table) land.land_data(land_key, p_dict) sql_p = SQLProcess(config_table) sql_p.run_sql(sql_key, sql_dict, branch) etl = ETL(config_table) etl.run_etl(etl_key, p_dict, sql_dict, branch)
  11. 11. Benefits of a metadata driven ETL framework • Configurations for ETL processes double up as documentation. • Low code execution enables wide range of stakeholders to contribute to ETL processes. • Reduces the complexity of developing and executing lots of similar transformation processes. • Reduces the complexity of orchestration pipelines (Azure Data Factory).
  12. 12. Turning a data set into a data product • Accurate data dictionary. • Data objects have correct naming conventions, data types and a sensible ordering to columns. • Local market values conformed to global ones e.g. translation or category mapping. • Data quality tested against expectations. • Strategy for data that fails data quality tests. • Adhere to any service-level agreements.
  13. 13. Demo github.com/richchad/data_quality_databricks
  14. 14. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Reckitt is a fast-moving consumer goods company with a portfolio of famous brands and over 30k employees worldwide. With that scale small projects can quickly grow into big datasets, and processing and cleaning all that data can become a challenge. To solve that challenge we have created a metadata driven ETL framework for orchestrating data transformations through parametrised SQL scripts. It allows us to create various paths for our data as well as easily version control them. The approach of standardising incoming datasets and creating reusable SQL processes has proven to be a winning formula. It has helped simplify complicated landing/stage/merge processes and allowed them to be self-documenting. But this is only half the battle, we also want to create data products. Documented, quality assured data sets that are intuitive to use. As we move to a CI/CD approach, increasing the frequency of deployments, the demand of keeping documentation and data quality assessments up to date becomes increasingly challenging. To solve this problem, we have expanded our ETL framework to include SQL processes that automate data quality activities. Using the Hive metastore as a starting point, we have leveraged this framework to automate the maintenance of a data dictionary and reduce documenting, model refinement, testing data quality and filtering out bad data to a box filling exercise. In this talk we discuss our approach to maintaining high quality data products and share examples of how we automate data quality processes.

Views

Total views

184

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

128

Shares

0

Comments

0

Likes

0

×