SlideShare a Scribd company logo
Data architecture
principles to accelerate
your data strategy
Jan Slechta
Senior Data Engineer @ CloverDX
Breaking down complex processes
Avoiding duplicate functionalities
Consistency
Data quality
Documentation
Key principles
Maintenance over time
o Development team productivity
o Cost-effectiveness
Trust in process and in data
o Transparency
o Completeness of the process
Why do these matter?
Breakdown complex process
into simple elements
Data pipelines maintainable in long-term
Completeness of the process
Development team productivity
Better test coverage  Robust solution
Trust in process
Why is this important?
Maintainability
o Our stored procedures are too complex, and the author left the company.
Real world issues
Maintainability
o Our stored procedures are too complex, and the author left the company.
Efficiency
o Team of four developers is slow and cannot work in parallel.
Real world issues
Maintainability
o Our stored procedures are too complex, and the author left the company.
Efficiency
o Team of four developers is slow and cannot work in parallel.
Completeness
o We forgot to implement auditing and we don’t know how to add it to the existing process.
Real world issues
Maintainability
o Our stored procedures are too complex, and the author left the company.
Efficiency
o Team of four developers is slow and cannot work in parallel.
Completeness
o We forgot to implement auditing and we don’t know how to add it to the existing process.
Trust
o Often after deployment of new feature, our pipelines unexpectedly break.
Real world issues
Large jobs are common sign of bad architecture
How to break the job into smaller pieces?
Transfer files
to cloud
Load into
Snowflake
Build Models
Identify individual components of data pipelines
Each job should deal with a single task
How to break the job into smaller pieces?
Log
Ingest
Log Log Log
Validate Transform Deliver
Transfer files
to cloud
Load into
Snowflake
Build Models
Ask questions
o What is the purpose of the process, and what is its business impact?
o What interfaces are you going to use?
o How would you like to automate the process?
o What are the weak points?
o How to handle errors?
How to break the job into smaller pieces?
Ask questions
o What is the purpose of the process, and what is its business impact?
o What interfaces are you going to use?
o How would you like to automate the process?
o What are the weak points?
o How to handle errors?
Identify patterns
o Repeatable and configurable code sections
o Logging, monitoring, automation, …
How to break the job into smaller pieces?
Reuse functionality
Avoid copy-paste by building and reusing generic jobs for common tasks
Logger
Beauty of keeping it small
<10 steps in each design helps understand and maintain the process
Parse Data File
Avoid duplicating functionality
Standardize process
Increased developer productivity
Faster turnaround
Increased trust
Reduced cost of business processes
Why avoid duplicating functionality?
Productivity
o Implementing a single change to our core process
required updates to nearly 80 jobs.
Real world issues
Productivity
o Implementing a single change to our core process
required updates to nearly 80 jobs.
Consistency
o During internal audit, we realized that auditing
components do not log at the same level of detail.
Real world issues
Avoid duplication by modular design
Source Business logic Target
Avoid duplication by modular design
Source Business logic Target
New
Source
No additional cost of adapting new source
Avoid duplication by modular design
Source
Other
Business logic
Other
Target
Build new pipeline with the same source
Functional reusability in CloverDX
Pipeline 1 – PII detection
Pipeline 2 – Publishing to web
Shared Source
Aim for consistency
Help you understand the jobs among them team
Prevent data issues
Will help you identify errors easier  Help meet SLAs
Why strive for consistency?
Data quality
o Some data fields are not populated although the data is in the source.
Real world issues
Data quality
o Some data fields are not populated although the data is in the source.
Team productivity
o We don’t have good approach for team collaboration. Before each release we
spend days fixing the conflicts when all teams deliver their work.
Real world issues
Data quality
o Some data fields are not populated although the data is in the source.
Team productivity
o We don’t have good approach for change management. Before each release
we spend days fixing the conflicts when all teams deliver their work.
Consistency
o Each developer approaches the task differently and the jobs are difficult to
monitor in production.
Real world issues
Naming conventions
Documentation conventions
Development conventions
o Break down where customization is expected
o Versioning and teamwork related conventions
Set expectations and provide training
o Trainings will increase productivity (data integration platform, version control, etc.)
Define conventions
Manage data quality
Bad data = Cost
o Correction
o Penalties
o Lost business
Accurate data to support business
Efficient data process
Adaptability and recoverability from data issues
Why data quality matters?
Distort data reports
o Because we did not check data set quality, we not only had to build another
complicated clean up process, but we were also running our business based on
wrong sales results.
Real world issues
Distort data reports
o Because we did not check data set quality, we not only had to build another
complicated clean up process, but we were also running our business based on
wrong sales results.
Unable to deliver
o We have identified an issue in the pipeline, but we can’t fix the data as we do not
store delta sources from our transactional systems. We can’t implement our new
use case.
Real world issues
Distort data reports
o Because we did not check data set quality, we not only had to build another
complicated clean up process, but we were also running our business based on
wrong sales results.
Unable to deliver
o We have identified an issue in the pipeline, but we can’t fix the data as we do not
store delta sources from our transactional systems. We can’t implement our new
use case.
Data quality check is too slow
o Profiling source helps us deliver better data, but the process is too slow; and we
cannot meet our SLA. Do we remove data quality checks?
Real world issues
Always expect poor data quality
Validate early to keep SLA and reduce downstream burden
Avoid unnecessary validation
Reuse validation rules for consistency
Data quality basic principles
Fixing the data may require original source and human review
Keep the source data in staging environment
Delta records might be sufficient
Prioritize business critical data in storage
Keep source data
Provide documentation
Data processes evolve over time
People forget or leave
Quickly understand the process
Maintain more effectively over many years
Why is documentation important?
Job design is documentation too – smaller jobs are easier to understand
Document wisely and to the point
Pay special attention to interfaces and reused jobs
Set documentation conventions
Documentation
Quick recap
Key principles
Breakdown complex processes
Avoid duplicating functionality
Aim for consistency
Maintain data quality
Documentation
Upcoming Webinars
5 Characteristics of modern data
architecture that drive innovation
March 23rd
Q&A
www.cloverdx.com/webinars

More Related Content

Similar to Data architecture principles to accelerate your data strategy

SQLBits VI - Improving database performance by removing the database
SQLBits VI - Improving database performance by removing the databaseSQLBits VI - Improving database performance by removing the database
SQLBits VI - Improving database performance by removing the database
Simon Munro
 
2019 State of DevOps Report: Database Best Practices for Strong DevOps
2019 State of DevOps Report: Database Best Practices for Strong DevOps2019 State of DevOps Report: Database Best Practices for Strong DevOps
2019 State of DevOps Report: Database Best Practices for Strong DevOps
DevOps.com
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryXebiaLabs
 
Case Study: Nationwide Building Society's CA Test Data Manager Success Story
Case Study: Nationwide Building Society's CA Test Data Manager Success StoryCase Study: Nationwide Building Society's CA Test Data Manager Success Story
Case Study: Nationwide Building Society's CA Test Data Manager Success Story
CA Technologies
 
Jeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdfJeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdf
QA or the Highway
 
Upgrade Preparation Best Practices & Templates | INNOVATE16
Upgrade Preparation Best Practices & Templates | INNOVATE16Upgrade Preparation Best Practices & Templates | INNOVATE16
Upgrade Preparation Best Practices & Templates | INNOVATE16
Abraic, Inc.
 
Who Owns the “S” in S&OP?
Who Owns the “S” in S&OP?Who Owns the “S” in S&OP?
Who Owns the “S” in S&OP?
Steelwedge
 
Web Performance Analysis - TCF Pro 2009
Web Performance Analysis - TCF Pro 2009Web Performance Analysis - TCF Pro 2009
Web Performance Analysis - TCF Pro 2009
Guy Ferraiolo
 
Data cleansing steps you must follow for better data health
Data cleansing steps you must follow for better data healthData cleansing steps you must follow for better data health
Data cleansing steps you must follow for better data health
Gen Leads
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Databricks
 
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineRsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
Sanjana Chowdhury
 
Best Practices for Rating and Policy Administration System Replacement
Best Practices for Rating and Policy Administration System ReplacementBest Practices for Rating and Policy Administration System Replacement
Best Practices for Rating and Policy Administration System Replacement
Edgewater
 
Bringing Continuous Delivery to Dell.com: A Retrospective
Bringing Continuous Delivery to Dell.com: A RetrospectiveBringing Continuous Delivery to Dell.com: A Retrospective
Bringing Continuous Delivery to Dell.com: A Retrospective
TechWell
 
Advance ALM and DevOps Practices with Continuous Improvement
Advance ALM and DevOps Practices with Continuous ImprovementAdvance ALM and DevOps Practices with Continuous Improvement
Advance ALM and DevOps Practices with Continuous Improvement
TechWell
 
Testing – Why We Do It Badly2
Testing – Why We Do It Badly2Testing – Why We Do It Badly2
Testing – Why We Do It Badly2adevney
 
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
XebiaLabs
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
Ryan Gross
 
Data Platform at Liv Up
Data Platform at Liv UpData Platform at Liv Up
Data Platform at Liv Up
Luiz Arakaki
 
How Can You Implement DataOps In Your Existing Workflow?
How Can You Implement DataOps In Your Existing Workflow?How Can You Implement DataOps In Your Existing Workflow?
How Can You Implement DataOps In Your Existing Workflow?
Enov8
 

Similar to Data architecture principles to accelerate your data strategy (20)

SQLBits VI - Improving database performance by removing the database
SQLBits VI - Improving database performance by removing the databaseSQLBits VI - Improving database performance by removing the database
SQLBits VI - Improving database performance by removing the database
 
2019 State of DevOps Report: Database Best Practices for Strong DevOps
2019 State of DevOps Report: Database Best Practices for Strong DevOps2019 State of DevOps Report: Database Best Practices for Strong DevOps
2019 State of DevOps Report: Database Best Practices for Strong DevOps
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
 
Case Study: Nationwide Building Society's CA Test Data Manager Success Story
Case Study: Nationwide Building Society's CA Test Data Manager Success StoryCase Study: Nationwide Building Society's CA Test Data Manager Success Story
Case Study: Nationwide Building Society's CA Test Data Manager Success Story
 
Jeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdfJeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdf
 
Upgrade Preparation Best Practices & Templates | INNOVATE16
Upgrade Preparation Best Practices & Templates | INNOVATE16Upgrade Preparation Best Practices & Templates | INNOVATE16
Upgrade Preparation Best Practices & Templates | INNOVATE16
 
Who Owns the “S” in S&OP?
Who Owns the “S” in S&OP?Who Owns the “S” in S&OP?
Who Owns the “S” in S&OP?
 
Web Performance Analysis - TCF Pro 2009
Web Performance Analysis - TCF Pro 2009Web Performance Analysis - TCF Pro 2009
Web Performance Analysis - TCF Pro 2009
 
Data cleansing steps you must follow for better data health
Data cleansing steps you must follow for better data healthData cleansing steps you must follow for better data health
Data cleansing steps you must follow for better data health
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
 
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineRsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
 
Best Practices for Rating and Policy Administration System Replacement
Best Practices for Rating and Policy Administration System ReplacementBest Practices for Rating and Policy Administration System Replacement
Best Practices for Rating and Policy Administration System Replacement
 
Bringing Continuous Delivery to Dell.com: A Retrospective
Bringing Continuous Delivery to Dell.com: A RetrospectiveBringing Continuous Delivery to Dell.com: A Retrospective
Bringing Continuous Delivery to Dell.com: A Retrospective
 
Advance ALM and DevOps Practices with Continuous Improvement
Advance ALM and DevOps Practices with Continuous ImprovementAdvance ALM and DevOps Practices with Continuous Improvement
Advance ALM and DevOps Practices with Continuous Improvement
 
Testing – Why We Do It Badly2
Testing – Why We Do It Badly2Testing – Why We Do It Badly2
Testing – Why We Do It Badly2
 
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
 
Data Platform at Liv Up
Data Platform at Liv UpData Platform at Liv Up
Data Platform at Liv Up
 
How Can You Implement DataOps In Your Existing Workflow?
How Can You Implement DataOps In Your Existing Workflow?How Can You Implement DataOps In Your Existing Workflow?
How Can You Implement DataOps In Your Existing Workflow?
 

More from CloverDX

How to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineHow to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipeline
CloverDX
 
Automating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and ExcelAutomating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and Excel
CloverDX
 
CloverDX 6.2 Release
CloverDX 6.2 ReleaseCloverDX 6.2 Release
CloverDX 6.2 Release
CloverDX
 
How to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy AppsHow to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy Apps
CloverDX
 
Deploying ETL to Cloud
Deploying ETL to CloudDeploying ETL to Cloud
Deploying ETL to Cloud
CloverDX
 
Moving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskMoving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid Risk
CloverDX
 
Starting Your Modern DataOps Journey
Starting Your Modern DataOps JourneyStarting Your Modern DataOps Journey
Starting Your Modern DataOps Journey
CloverDX
 
CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX
 
Modern management of data pipelines made easier
Modern management of data pipelines made easierModern management of data pipelines made easier
Modern management of data pipelines made easier
CloverDX
 
Removing Danger From Data
Removing Danger From DataRemoving Danger From Data
Removing Danger From Data
CloverDX
 
Data Anonymization For Better Software Testing
Data Anonymization For Better Software TestingData Anonymization For Better Software Testing
Data Anonymization For Better Software Testing
CloverDX
 
How to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data ServicesHow to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data Services
CloverDX
 
Moving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really TakesMoving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really Takes
CloverDX
 

More from CloverDX (13)

How to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineHow to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipeline
 
Automating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and ExcelAutomating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and Excel
 
CloverDX 6.2 Release
CloverDX 6.2 ReleaseCloverDX 6.2 Release
CloverDX 6.2 Release
 
How to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy AppsHow to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy Apps
 
Deploying ETL to Cloud
Deploying ETL to CloudDeploying ETL to Cloud
Deploying ETL to Cloud
 
Moving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskMoving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid Risk
 
Starting Your Modern DataOps Journey
Starting Your Modern DataOps JourneyStarting Your Modern DataOps Journey
Starting Your Modern DataOps Journey
 
CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)
 
Modern management of data pipelines made easier
Modern management of data pipelines made easierModern management of data pipelines made easier
Modern management of data pipelines made easier
 
Removing Danger From Data
Removing Danger From DataRemoving Danger From Data
Removing Danger From Data
 
Data Anonymization For Better Software Testing
Data Anonymization For Better Software TestingData Anonymization For Better Software Testing
Data Anonymization For Better Software Testing
 
How to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data ServicesHow to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data Services
 
Moving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really TakesMoving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really Takes
 

Recently uploaded

SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 

Recently uploaded (20)

SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

Data architecture principles to accelerate your data strategy

  • 1. Data architecture principles to accelerate your data strategy Jan Slechta Senior Data Engineer @ CloverDX
  • 2. Breaking down complex processes Avoiding duplicate functionalities Consistency Data quality Documentation Key principles
  • 3. Maintenance over time o Development team productivity o Cost-effectiveness Trust in process and in data o Transparency o Completeness of the process Why do these matter?
  • 5. Data pipelines maintainable in long-term Completeness of the process Development team productivity Better test coverage  Robust solution Trust in process Why is this important?
  • 6. Maintainability o Our stored procedures are too complex, and the author left the company. Real world issues
  • 7. Maintainability o Our stored procedures are too complex, and the author left the company. Efficiency o Team of four developers is slow and cannot work in parallel. Real world issues
  • 8. Maintainability o Our stored procedures are too complex, and the author left the company. Efficiency o Team of four developers is slow and cannot work in parallel. Completeness o We forgot to implement auditing and we don’t know how to add it to the existing process. Real world issues
  • 9. Maintainability o Our stored procedures are too complex, and the author left the company. Efficiency o Team of four developers is slow and cannot work in parallel. Completeness o We forgot to implement auditing and we don’t know how to add it to the existing process. Trust o Often after deployment of new feature, our pipelines unexpectedly break. Real world issues
  • 10. Large jobs are common sign of bad architecture How to break the job into smaller pieces? Transfer files to cloud Load into Snowflake Build Models
  • 11. Identify individual components of data pipelines Each job should deal with a single task How to break the job into smaller pieces? Log Ingest Log Log Log Validate Transform Deliver Transfer files to cloud Load into Snowflake Build Models
  • 12. Ask questions o What is the purpose of the process, and what is its business impact? o What interfaces are you going to use? o How would you like to automate the process? o What are the weak points? o How to handle errors? How to break the job into smaller pieces?
  • 13. Ask questions o What is the purpose of the process, and what is its business impact? o What interfaces are you going to use? o How would you like to automate the process? o What are the weak points? o How to handle errors? Identify patterns o Repeatable and configurable code sections o Logging, monitoring, automation, … How to break the job into smaller pieces?
  • 14. Reuse functionality Avoid copy-paste by building and reusing generic jobs for common tasks Logger
  • 15. Beauty of keeping it small <10 steps in each design helps understand and maintain the process Parse Data File
  • 17. Standardize process Increased developer productivity Faster turnaround Increased trust Reduced cost of business processes Why avoid duplicating functionality?
  • 18. Productivity o Implementing a single change to our core process required updates to nearly 80 jobs. Real world issues
  • 19. Productivity o Implementing a single change to our core process required updates to nearly 80 jobs. Consistency o During internal audit, we realized that auditing components do not log at the same level of detail. Real world issues
  • 20. Avoid duplication by modular design Source Business logic Target
  • 21. Avoid duplication by modular design Source Business logic Target New Source No additional cost of adapting new source
  • 22. Avoid duplication by modular design Source Other Business logic Other Target Build new pipeline with the same source
  • 23. Functional reusability in CloverDX Pipeline 1 – PII detection Pipeline 2 – Publishing to web Shared Source
  • 25. Help you understand the jobs among them team Prevent data issues Will help you identify errors easier  Help meet SLAs Why strive for consistency?
  • 26. Data quality o Some data fields are not populated although the data is in the source. Real world issues
  • 27. Data quality o Some data fields are not populated although the data is in the source. Team productivity o We don’t have good approach for team collaboration. Before each release we spend days fixing the conflicts when all teams deliver their work. Real world issues
  • 28. Data quality o Some data fields are not populated although the data is in the source. Team productivity o We don’t have good approach for change management. Before each release we spend days fixing the conflicts when all teams deliver their work. Consistency o Each developer approaches the task differently and the jobs are difficult to monitor in production. Real world issues
  • 29. Naming conventions Documentation conventions Development conventions o Break down where customization is expected o Versioning and teamwork related conventions Set expectations and provide training o Trainings will increase productivity (data integration platform, version control, etc.) Define conventions
  • 31. Bad data = Cost o Correction o Penalties o Lost business Accurate data to support business Efficient data process Adaptability and recoverability from data issues Why data quality matters?
  • 32. Distort data reports o Because we did not check data set quality, we not only had to build another complicated clean up process, but we were also running our business based on wrong sales results. Real world issues
  • 33. Distort data reports o Because we did not check data set quality, we not only had to build another complicated clean up process, but we were also running our business based on wrong sales results. Unable to deliver o We have identified an issue in the pipeline, but we can’t fix the data as we do not store delta sources from our transactional systems. We can’t implement our new use case. Real world issues
  • 34. Distort data reports o Because we did not check data set quality, we not only had to build another complicated clean up process, but we were also running our business based on wrong sales results. Unable to deliver o We have identified an issue in the pipeline, but we can’t fix the data as we do not store delta sources from our transactional systems. We can’t implement our new use case. Data quality check is too slow o Profiling source helps us deliver better data, but the process is too slow; and we cannot meet our SLA. Do we remove data quality checks? Real world issues
  • 35. Always expect poor data quality Validate early to keep SLA and reduce downstream burden Avoid unnecessary validation Reuse validation rules for consistency Data quality basic principles
  • 36. Fixing the data may require original source and human review Keep the source data in staging environment Delta records might be sufficient Prioritize business critical data in storage Keep source data
  • 38. Data processes evolve over time People forget or leave Quickly understand the process Maintain more effectively over many years Why is documentation important?
  • 39. Job design is documentation too – smaller jobs are easier to understand Document wisely and to the point Pay special attention to interfaces and reused jobs Set documentation conventions Documentation
  • 40. Quick recap Key principles Breakdown complex processes Avoid duplicating functionality Aim for consistency Maintain data quality Documentation
  • 41. Upcoming Webinars 5 Characteristics of modern data architecture that drive innovation March 23rd Q&A www.cloverdx.com/webinars

Editor's Notes

  1. Maintainability: Your process will become extensible Completeness You will not forget about other critical elements of the process Efficient development process Enables teamwork Shorter development phase Smaller code base
  2. Split responsibilities between components
  3. Split responsibilities between components
  4. Ideal pipeline has up to 15 components One job should not do multiple things
  5. Ideal pipeline has up to 15 components One job should not do multiple things
  6. Multi-layer architecture Abstraction with possibilities to drill-down to more details
  7. Removes redundancy  Smaller code base Standardize process  Increased transparency and trust Shorter time to deliver updates  Saves time and costs Easier scalability
  8. Process reusability – framework Configuration in DB or ERP, CRM etc. Pipeline reusability Subprocess (e.g. Data staging) Functional reusability Single unit / function reused in pipelines of different purpose etc.
  9. Three levels of reusability Process reusability (i.e., set of pipelines configured via external configuration) Pipeline reusability (e.g., sub-process reusability) Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…) Process reusability – framework Set of pipelines configured via external configuration Configuration in DB or ERP, CRM etc. Pipeline reusability Subprocess reusability (e.g. Data staging) Functional reusability Logger, notifier, transformer, formatter, encryptor,… Single unit / function reused in pipelines of different purpose etc.
  10. Modular design – you can easily change parts of the process without affecting the rest Three levels of reusability Process reusability (i.e., set of pipelines configured via external configuration) Pipeline reusability (e.g., sub-process reusability) Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…) Process reusability – framework Set of pipelines configured via external configuration Configuration in DB or ERP, CRM etc. Pipeline reusability Subprocess reusability (e.g. Data staging) Functional reusability Logger, notifier, transformer, formatter, encryptor,… Single unit / function reused in pipelines of different purpose etc.
  11. For example you can replace the source with a new source (e.g. you replace your CRM with a different product, you switch cloud providers, etc.) With good modular design you only implement the source change and WON’T HAVE TO touch the rest of the pipeline  time & cost savings Three levels of reusability Process reusability (i.e., set of pipelines configured via external configuration) Pipeline reusability (e.g., sub-process reusability) Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…) Process reusability – framework Set of pipelines configured via external configuration Configuration in DB or ERP, CRM etc. Pipeline reusability Subprocess reusability (e.g. Data staging) Functional reusability Logger, notifier, transformer, formatter, encryptor,… Single unit / function reused in pipelines of different purpose etc.
  12. Or, you can use individual parts of your pipelines elsewhere – for example in here, I’m using the Source from the previous pipeline in a new one – but it’s the same source Three levels of reusability Process reusability (i.e., set of pipelines configured via external configuration) Pipeline reusability (e.g., sub-process reusability) Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…) Process reusability – framework Set of pipelines configured via external configuration Configuration in DB or ERP, CRM etc. Pipeline reusability Subprocess reusability (e.g. Data staging) Functional reusability Logger, notifier, transformer, formatter, encryptor,… Single unit / function reused in pipelines of different purpose etc.
  13. What it looks like in a product like CloverDX? In here you can see the same source, called DonationsReader, being used in two different pipelines.
  14. Prevent issues: in dynamic transformations
  15. Data quality SILENT error, automatic mapping issue Code review Automated built-in checks etc.
  16. Data quality SILENT error, automatic mapping issue Code review Automated built-in checks etc.
  17. Data quality SILENT error, automatic mapping issue Code review Automated built-in checks etc.
  18. Naming conventions for files, processes, …
  19. Ask yourself a question what the data means to your business and why you collect them?  it is worth checking data quality Poor data quality  Inaccurate reporting  wrong business decisions RWI: Incomplete records Process fails Missing alternative path? Do you backup source delta records to rebuild history in case of an error?
  20. Efficient data process * Not spending too much time on something that is not worth it
  21. Sooner File type check Profile data Are you expecting an XML file? Check it is an XML file at first. Profile data (if necessary) before you start individual record validation. Unnecessary validation Big data profiling may lead to unnecessary read operations (few lines might be enough? Or leave it for later) Create libraries or custom components and reuse them as often as possible Handle exceptions -
  22. Backup data that you will not be able to retrieve again Especially those that are business critical Typically, this would be data from: Transactional systems Third party systems
  23. Efficient team work too…
  24. Document wisely: Notes in a pipeline should only deal with the code in the pipeline
  25. Document wisely: Notes in a pipeline should only deal with the code in the pipeline