SlideShare a Scribd company logo
1 of 41
Data architecture
principles to accelerate
your data strategy
Jan Slechta
Senior Data Engineer @ CloverDX
Breaking down complex processes
Avoiding duplicate functionalities
Consistency
Data quality
Documentation
Key principles
Maintenance over time
o Development team productivity
o Cost-effectiveness
Trust in process and in data
o Transparency
o Completeness of the process
Why do these matter?
Breakdown complex process
into simple elements
Data pipelines maintainable in long-term
Completeness of the process
Development team productivity
Better test coverage  Robust solution
Trust in process
Why is this important?
Maintainability
o Our stored procedures are too complex, and the author left the company.
Real world issues
Maintainability
o Our stored procedures are too complex, and the author left the company.
Efficiency
o Team of four developers is slow and cannot work in parallel.
Real world issues
Maintainability
o Our stored procedures are too complex, and the author left the company.
Efficiency
o Team of four developers is slow and cannot work in parallel.
Completeness
o We forgot to implement auditing and we don’t know how to add it to the existing process.
Real world issues
Maintainability
o Our stored procedures are too complex, and the author left the company.
Efficiency
o Team of four developers is slow and cannot work in parallel.
Completeness
o We forgot to implement auditing and we don’t know how to add it to the existing process.
Trust
o Often after deployment of new feature, our pipelines unexpectedly break.
Real world issues
Large jobs are common sign of bad architecture
How to break the job into smaller pieces?
Transfer files
to cloud
Load into
Snowflake
Build Models
Identify individual components of data pipelines
Each job should deal with a single task
How to break the job into smaller pieces?
Log
Ingest
Log Log Log
Validate Transform Deliver
Transfer files
to cloud
Load into
Snowflake
Build Models
Ask questions
o What is the purpose of the process, and what is its business impact?
o What interfaces are you going to use?
o How would you like to automate the process?
o What are the weak points?
o How to handle errors?
How to break the job into smaller pieces?
Ask questions
o What is the purpose of the process, and what is its business impact?
o What interfaces are you going to use?
o How would you like to automate the process?
o What are the weak points?
o How to handle errors?
Identify patterns
o Repeatable and configurable code sections
o Logging, monitoring, automation, …
How to break the job into smaller pieces?
Reuse functionality
Avoid copy-paste by building and reusing generic jobs for common tasks
Logger
Beauty of keeping it small
<10 steps in each design helps understand and maintain the process
Parse Data File
Avoid duplicating functionality
Standardize process
Increased developer productivity
Faster turnaround
Increased trust
Reduced cost of business processes
Why avoid duplicating functionality?
Productivity
o Implementing a single change to our core process
required updates to nearly 80 jobs.
Real world issues
Productivity
o Implementing a single change to our core process
required updates to nearly 80 jobs.
Consistency
o During internal audit, we realized that auditing
components do not log at the same level of detail.
Real world issues
Avoid duplication by modular design
Source Business logic Target
Avoid duplication by modular design
Source Business logic Target
New
Source
No additional cost of adapting new source
Avoid duplication by modular design
Source
Other
Business logic
Other
Target
Build new pipeline with the same source
Functional reusability in CloverDX
Pipeline 1 – PII detection
Pipeline 2 – Publishing to web
Shared Source
Aim for consistency
Help you understand the jobs among them team
Prevent data issues
Will help you identify errors easier  Help meet SLAs
Why strive for consistency?
Data quality
o Some data fields are not populated although the data is in the source.
Real world issues
Data quality
o Some data fields are not populated although the data is in the source.
Team productivity
o We don’t have good approach for team collaboration. Before each release we
spend days fixing the conflicts when all teams deliver their work.
Real world issues
Data quality
o Some data fields are not populated although the data is in the source.
Team productivity
o We don’t have good approach for change management. Before each release
we spend days fixing the conflicts when all teams deliver their work.
Consistency
o Each developer approaches the task differently and the jobs are difficult to
monitor in production.
Real world issues
Naming conventions
Documentation conventions
Development conventions
o Break down where customization is expected
o Versioning and teamwork related conventions
Set expectations and provide training
o Trainings will increase productivity (data integration platform, version control, etc.)
Define conventions
Manage data quality
Bad data = Cost
o Correction
o Penalties
o Lost business
Accurate data to support business
Efficient data process
Adaptability and recoverability from data issues
Why data quality matters?
Distort data reports
o Because we did not check data set quality, we not only had to build another
complicated clean up process, but we were also running our business based on
wrong sales results.
Real world issues
Distort data reports
o Because we did not check data set quality, we not only had to build another
complicated clean up process, but we were also running our business based on
wrong sales results.
Unable to deliver
o We have identified an issue in the pipeline, but we can’t fix the data as we do not
store delta sources from our transactional systems. We can’t implement our new
use case.
Real world issues
Distort data reports
o Because we did not check data set quality, we not only had to build another
complicated clean up process, but we were also running our business based on
wrong sales results.
Unable to deliver
o We have identified an issue in the pipeline, but we can’t fix the data as we do not
store delta sources from our transactional systems. We can’t implement our new
use case.
Data quality check is too slow
o Profiling source helps us deliver better data, but the process is too slow; and we
cannot meet our SLA. Do we remove data quality checks?
Real world issues
Always expect poor data quality
Validate early to keep SLA and reduce downstream burden
Avoid unnecessary validation
Reuse validation rules for consistency
Data quality basic principles
Fixing the data may require original source and human review
Keep the source data in staging environment
Delta records might be sufficient
Prioritize business critical data in storage
Keep source data
Provide documentation
Data processes evolve over time
People forget or leave
Quickly understand the process
Maintain more effectively over many years
Why is documentation important?
Job design is documentation too – smaller jobs are easier to understand
Document wisely and to the point
Pay special attention to interfaces and reused jobs
Set documentation conventions
Documentation
Quick recap
Key principles
Breakdown complex processes
Avoid duplicating functionality
Aim for consistency
Maintain data quality
Documentation
Upcoming Webinars
5 Characteristics of modern data
architecture that drive innovation
March 23rd
Q&A
www.cloverdx.com/webinars

More Related Content

Similar to Data architecture principles to accelerate your data strategy

SQLBits VI - Improving database performance by removing the database
SQLBits VI - Improving database performance by removing the databaseSQLBits VI - Improving database performance by removing the database
SQLBits VI - Improving database performance by removing the databaseSimon Munro
 
2019 State of DevOps Report: Database Best Practices for Strong DevOps
2019 State of DevOps Report: Database Best Practices for Strong DevOps2019 State of DevOps Report: Database Best Practices for Strong DevOps
2019 State of DevOps Report: Database Best Practices for Strong DevOpsDevOps.com
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryXebiaLabs
 
Case Study: Nationwide Building Society's CA Test Data Manager Success Story
Case Study: Nationwide Building Society's CA Test Data Manager Success StoryCase Study: Nationwide Building Society's CA Test Data Manager Success Story
Case Study: Nationwide Building Society's CA Test Data Manager Success StoryCA Technologies
 
Jeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdfJeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdfQA or the Highway
 
Upgrade Preparation Best Practices & Templates | INNOVATE16
Upgrade Preparation Best Practices & Templates | INNOVATE16Upgrade Preparation Best Practices & Templates | INNOVATE16
Upgrade Preparation Best Practices & Templates | INNOVATE16Abraic, Inc.
 
Who Owns the “S” in S&OP?
Who Owns the “S” in S&OP?Who Owns the “S” in S&OP?
Who Owns the “S” in S&OP?Steelwedge
 
Web Performance Analysis - TCF Pro 2009
Web Performance Analysis - TCF Pro 2009Web Performance Analysis - TCF Pro 2009
Web Performance Analysis - TCF Pro 2009Guy Ferraiolo
 
Data cleansing steps you must follow for better data health
Data cleansing steps you must follow for better data healthData cleansing steps you must follow for better data health
Data cleansing steps you must follow for better data healthGen Leads
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Databricks
 
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineRsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineSanjana Chowdhury
 
Best Practices for Rating and Policy Administration System Replacement
Best Practices for Rating and Policy Administration System ReplacementBest Practices for Rating and Policy Administration System Replacement
Best Practices for Rating and Policy Administration System ReplacementEdgewater
 
Bringing Continuous Delivery to Dell.com: A Retrospective
Bringing Continuous Delivery to Dell.com: A RetrospectiveBringing Continuous Delivery to Dell.com: A Retrospective
Bringing Continuous Delivery to Dell.com: A RetrospectiveTechWell
 
Advance ALM and DevOps Practices with Continuous Improvement
Advance ALM and DevOps Practices with Continuous ImprovementAdvance ALM and DevOps Practices with Continuous Improvement
Advance ALM and DevOps Practices with Continuous ImprovementTechWell
 
Testing – Why We Do It Badly2
Testing – Why We Do It Badly2Testing – Why We Do It Badly2
Testing – Why We Do It Badly2adevney
 
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...XebiaLabs
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsRyan Gross
 
Data Platform at Liv Up
Data Platform at Liv UpData Platform at Liv Up
Data Platform at Liv UpLuiz Arakaki
 
How Can You Implement DataOps In Your Existing Workflow?
How Can You Implement DataOps In Your Existing Workflow?How Can You Implement DataOps In Your Existing Workflow?
How Can You Implement DataOps In Your Existing Workflow?Enov8
 

Similar to Data architecture principles to accelerate your data strategy (20)

SQLBits VI - Improving database performance by removing the database
SQLBits VI - Improving database performance by removing the databaseSQLBits VI - Improving database performance by removing the database
SQLBits VI - Improving database performance by removing the database
 
2019 State of DevOps Report: Database Best Practices for Strong DevOps
2019 State of DevOps Report: Database Best Practices for Strong DevOps2019 State of DevOps Report: Database Best Practices for Strong DevOps
2019 State of DevOps Report: Database Best Practices for Strong DevOps
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
 
Case Study: Nationwide Building Society's CA Test Data Manager Success Story
Case Study: Nationwide Building Society's CA Test Data Manager Success StoryCase Study: Nationwide Building Society's CA Test Data Manager Success Story
Case Study: Nationwide Building Society's CA Test Data Manager Success Story
 
Jeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdfJeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdf
 
Upgrade Preparation Best Practices & Templates | INNOVATE16
Upgrade Preparation Best Practices & Templates | INNOVATE16Upgrade Preparation Best Practices & Templates | INNOVATE16
Upgrade Preparation Best Practices & Templates | INNOVATE16
 
Who Owns the “S” in S&OP?
Who Owns the “S” in S&OP?Who Owns the “S” in S&OP?
Who Owns the “S” in S&OP?
 
Web Performance Analysis - TCF Pro 2009
Web Performance Analysis - TCF Pro 2009Web Performance Analysis - TCF Pro 2009
Web Performance Analysis - TCF Pro 2009
 
Data cleansing steps you must follow for better data health
Data cleansing steps you must follow for better data healthData cleansing steps you must follow for better data health
Data cleansing steps you must follow for better data health
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
 
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineRsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
 
Best Practices for Rating and Policy Administration System Replacement
Best Practices for Rating and Policy Administration System ReplacementBest Practices for Rating and Policy Administration System Replacement
Best Practices for Rating and Policy Administration System Replacement
 
Bringing Continuous Delivery to Dell.com: A Retrospective
Bringing Continuous Delivery to Dell.com: A RetrospectiveBringing Continuous Delivery to Dell.com: A Retrospective
Bringing Continuous Delivery to Dell.com: A Retrospective
 
Advance ALM and DevOps Practices with Continuous Improvement
Advance ALM and DevOps Practices with Continuous ImprovementAdvance ALM and DevOps Practices with Continuous Improvement
Advance ALM and DevOps Practices with Continuous Improvement
 
Testing – Why We Do It Badly2
Testing – Why We Do It Badly2Testing – Why We Do It Badly2
Testing – Why We Do It Badly2
 
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
Measure Your DevOps Success: Using Goal-based KPIs to Drive Results and Demon...
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
 
Data Platform at Liv Up
Data Platform at Liv UpData Platform at Liv Up
Data Platform at Liv Up
 
How Can You Implement DataOps In Your Existing Workflow?
How Can You Implement DataOps In Your Existing Workflow?How Can You Implement DataOps In Your Existing Workflow?
How Can You Implement DataOps In Your Existing Workflow?
 

More from CloverDX

How to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineHow to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineCloverDX
 
Automating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and ExcelAutomating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and ExcelCloverDX
 
CloverDX 6.2 Release
CloverDX 6.2 ReleaseCloverDX 6.2 Release
CloverDX 6.2 ReleaseCloverDX
 
How to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy AppsHow to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy AppsCloverDX
 
Deploying ETL to Cloud
Deploying ETL to CloudDeploying ETL to Cloud
Deploying ETL to CloudCloverDX
 
Moving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskMoving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskCloverDX
 
Starting Your Modern DataOps Journey
Starting Your Modern DataOps JourneyStarting Your Modern DataOps Journey
Starting Your Modern DataOps JourneyCloverDX
 
CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)CloverDX
 
Modern management of data pipelines made easier
Modern management of data pipelines made easierModern management of data pipelines made easier
Modern management of data pipelines made easierCloverDX
 
Removing Danger From Data
Removing Danger From DataRemoving Danger From Data
Removing Danger From DataCloverDX
 
Data Anonymization For Better Software Testing
Data Anonymization For Better Software TestingData Anonymization For Better Software Testing
Data Anonymization For Better Software TestingCloverDX
 
How to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data ServicesHow to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data ServicesCloverDX
 
Moving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really TakesMoving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really TakesCloverDX
 

More from CloverDX (13)

How to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineHow to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipeline
 
Automating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and ExcelAutomating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and Excel
 
CloverDX 6.2 Release
CloverDX 6.2 ReleaseCloverDX 6.2 Release
CloverDX 6.2 Release
 
How to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy AppsHow to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy Apps
 
Deploying ETL to Cloud
Deploying ETL to CloudDeploying ETL to Cloud
Deploying ETL to Cloud
 
Moving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskMoving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid Risk
 
Starting Your Modern DataOps Journey
Starting Your Modern DataOps JourneyStarting Your Modern DataOps Journey
Starting Your Modern DataOps Journey
 
CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)
 
Modern management of data pipelines made easier
Modern management of data pipelines made easierModern management of data pipelines made easier
Modern management of data pipelines made easier
 
Removing Danger From Data
Removing Danger From DataRemoving Danger From Data
Removing Danger From Data
 
Data Anonymization For Better Software Testing
Data Anonymization For Better Software TestingData Anonymization For Better Software Testing
Data Anonymization For Better Software Testing
 
How to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data ServicesHow to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data Services
 
Moving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really TakesMoving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really Takes
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Data architecture principles to accelerate your data strategy

  • 1. Data architecture principles to accelerate your data strategy Jan Slechta Senior Data Engineer @ CloverDX
  • 2. Breaking down complex processes Avoiding duplicate functionalities Consistency Data quality Documentation Key principles
  • 3. Maintenance over time o Development team productivity o Cost-effectiveness Trust in process and in data o Transparency o Completeness of the process Why do these matter?
  • 5. Data pipelines maintainable in long-term Completeness of the process Development team productivity Better test coverage  Robust solution Trust in process Why is this important?
  • 6. Maintainability o Our stored procedures are too complex, and the author left the company. Real world issues
  • 7. Maintainability o Our stored procedures are too complex, and the author left the company. Efficiency o Team of four developers is slow and cannot work in parallel. Real world issues
  • 8. Maintainability o Our stored procedures are too complex, and the author left the company. Efficiency o Team of four developers is slow and cannot work in parallel. Completeness o We forgot to implement auditing and we don’t know how to add it to the existing process. Real world issues
  • 9. Maintainability o Our stored procedures are too complex, and the author left the company. Efficiency o Team of four developers is slow and cannot work in parallel. Completeness o We forgot to implement auditing and we don’t know how to add it to the existing process. Trust o Often after deployment of new feature, our pipelines unexpectedly break. Real world issues
  • 10. Large jobs are common sign of bad architecture How to break the job into smaller pieces? Transfer files to cloud Load into Snowflake Build Models
  • 11. Identify individual components of data pipelines Each job should deal with a single task How to break the job into smaller pieces? Log Ingest Log Log Log Validate Transform Deliver Transfer files to cloud Load into Snowflake Build Models
  • 12. Ask questions o What is the purpose of the process, and what is its business impact? o What interfaces are you going to use? o How would you like to automate the process? o What are the weak points? o How to handle errors? How to break the job into smaller pieces?
  • 13. Ask questions o What is the purpose of the process, and what is its business impact? o What interfaces are you going to use? o How would you like to automate the process? o What are the weak points? o How to handle errors? Identify patterns o Repeatable and configurable code sections o Logging, monitoring, automation, … How to break the job into smaller pieces?
  • 14. Reuse functionality Avoid copy-paste by building and reusing generic jobs for common tasks Logger
  • 15. Beauty of keeping it small <10 steps in each design helps understand and maintain the process Parse Data File
  • 17. Standardize process Increased developer productivity Faster turnaround Increased trust Reduced cost of business processes Why avoid duplicating functionality?
  • 18. Productivity o Implementing a single change to our core process required updates to nearly 80 jobs. Real world issues
  • 19. Productivity o Implementing a single change to our core process required updates to nearly 80 jobs. Consistency o During internal audit, we realized that auditing components do not log at the same level of detail. Real world issues
  • 20. Avoid duplication by modular design Source Business logic Target
  • 21. Avoid duplication by modular design Source Business logic Target New Source No additional cost of adapting new source
  • 22. Avoid duplication by modular design Source Other Business logic Other Target Build new pipeline with the same source
  • 23. Functional reusability in CloverDX Pipeline 1 – PII detection Pipeline 2 – Publishing to web Shared Source
  • 25. Help you understand the jobs among them team Prevent data issues Will help you identify errors easier  Help meet SLAs Why strive for consistency?
  • 26. Data quality o Some data fields are not populated although the data is in the source. Real world issues
  • 27. Data quality o Some data fields are not populated although the data is in the source. Team productivity o We don’t have good approach for team collaboration. Before each release we spend days fixing the conflicts when all teams deliver their work. Real world issues
  • 28. Data quality o Some data fields are not populated although the data is in the source. Team productivity o We don’t have good approach for change management. Before each release we spend days fixing the conflicts when all teams deliver their work. Consistency o Each developer approaches the task differently and the jobs are difficult to monitor in production. Real world issues
  • 29. Naming conventions Documentation conventions Development conventions o Break down where customization is expected o Versioning and teamwork related conventions Set expectations and provide training o Trainings will increase productivity (data integration platform, version control, etc.) Define conventions
  • 31. Bad data = Cost o Correction o Penalties o Lost business Accurate data to support business Efficient data process Adaptability and recoverability from data issues Why data quality matters?
  • 32. Distort data reports o Because we did not check data set quality, we not only had to build another complicated clean up process, but we were also running our business based on wrong sales results. Real world issues
  • 33. Distort data reports o Because we did not check data set quality, we not only had to build another complicated clean up process, but we were also running our business based on wrong sales results. Unable to deliver o We have identified an issue in the pipeline, but we can’t fix the data as we do not store delta sources from our transactional systems. We can’t implement our new use case. Real world issues
  • 34. Distort data reports o Because we did not check data set quality, we not only had to build another complicated clean up process, but we were also running our business based on wrong sales results. Unable to deliver o We have identified an issue in the pipeline, but we can’t fix the data as we do not store delta sources from our transactional systems. We can’t implement our new use case. Data quality check is too slow o Profiling source helps us deliver better data, but the process is too slow; and we cannot meet our SLA. Do we remove data quality checks? Real world issues
  • 35. Always expect poor data quality Validate early to keep SLA and reduce downstream burden Avoid unnecessary validation Reuse validation rules for consistency Data quality basic principles
  • 36. Fixing the data may require original source and human review Keep the source data in staging environment Delta records might be sufficient Prioritize business critical data in storage Keep source data
  • 38. Data processes evolve over time People forget or leave Quickly understand the process Maintain more effectively over many years Why is documentation important?
  • 39. Job design is documentation too – smaller jobs are easier to understand Document wisely and to the point Pay special attention to interfaces and reused jobs Set documentation conventions Documentation
  • 40. Quick recap Key principles Breakdown complex processes Avoid duplicating functionality Aim for consistency Maintain data quality Documentation
  • 41. Upcoming Webinars 5 Characteristics of modern data architecture that drive innovation March 23rd Q&A www.cloverdx.com/webinars

Editor's Notes

  1. Maintainability: Your process will become extensible Completeness You will not forget about other critical elements of the process Efficient development process Enables teamwork Shorter development phase Smaller code base
  2. Split responsibilities between components
  3. Split responsibilities between components
  4. Ideal pipeline has up to 15 components One job should not do multiple things
  5. Ideal pipeline has up to 15 components One job should not do multiple things
  6. Multi-layer architecture Abstraction with possibilities to drill-down to more details
  7. Removes redundancy  Smaller code base Standardize process  Increased transparency and trust Shorter time to deliver updates  Saves time and costs Easier scalability
  8. Process reusability – framework Configuration in DB or ERP, CRM etc. Pipeline reusability Subprocess (e.g. Data staging) Functional reusability Single unit / function reused in pipelines of different purpose etc.
  9. Three levels of reusability Process reusability (i.e., set of pipelines configured via external configuration) Pipeline reusability (e.g., sub-process reusability) Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…) Process reusability – framework Set of pipelines configured via external configuration Configuration in DB or ERP, CRM etc. Pipeline reusability Subprocess reusability (e.g. Data staging) Functional reusability Logger, notifier, transformer, formatter, encryptor,… Single unit / function reused in pipelines of different purpose etc.
  10. Modular design – you can easily change parts of the process without affecting the rest Three levels of reusability Process reusability (i.e., set of pipelines configured via external configuration) Pipeline reusability (e.g., sub-process reusability) Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…) Process reusability – framework Set of pipelines configured via external configuration Configuration in DB or ERP, CRM etc. Pipeline reusability Subprocess reusability (e.g. Data staging) Functional reusability Logger, notifier, transformer, formatter, encryptor,… Single unit / function reused in pipelines of different purpose etc.
  11. For example you can replace the source with a new source (e.g. you replace your CRM with a different product, you switch cloud providers, etc.) With good modular design you only implement the source change and WON’T HAVE TO touch the rest of the pipeline  time & cost savings Three levels of reusability Process reusability (i.e., set of pipelines configured via external configuration) Pipeline reusability (e.g., sub-process reusability) Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…) Process reusability – framework Set of pipelines configured via external configuration Configuration in DB or ERP, CRM etc. Pipeline reusability Subprocess reusability (e.g. Data staging) Functional reusability Logger, notifier, transformer, formatter, encryptor,… Single unit / function reused in pipelines of different purpose etc.
  12. Or, you can use individual parts of your pipelines elsewhere – for example in here, I’m using the Source from the previous pipeline in a new one – but it’s the same source Three levels of reusability Process reusability (i.e., set of pipelines configured via external configuration) Pipeline reusability (e.g., sub-process reusability) Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…) Process reusability – framework Set of pipelines configured via external configuration Configuration in DB or ERP, CRM etc. Pipeline reusability Subprocess reusability (e.g. Data staging) Functional reusability Logger, notifier, transformer, formatter, encryptor,… Single unit / function reused in pipelines of different purpose etc.
  13. What it looks like in a product like CloverDX? In here you can see the same source, called DonationsReader, being used in two different pipelines.
  14. Prevent issues: in dynamic transformations
  15. Data quality SILENT error, automatic mapping issue Code review Automated built-in checks etc.
  16. Data quality SILENT error, automatic mapping issue Code review Automated built-in checks etc.
  17. Data quality SILENT error, automatic mapping issue Code review Automated built-in checks etc.
  18. Naming conventions for files, processes, …
  19. Ask yourself a question what the data means to your business and why you collect them?  it is worth checking data quality Poor data quality  Inaccurate reporting  wrong business decisions RWI: Incomplete records Process fails Missing alternative path? Do you backup source delta records to rebuild history in case of an error?
  20. Efficient data process * Not spending too much time on something that is not worth it
  21. Sooner File type check Profile data Are you expecting an XML file? Check it is an XML file at first. Profile data (if necessary) before you start individual record validation. Unnecessary validation Big data profiling may lead to unnecessary read operations (few lines might be enough? Or leave it for later) Create libraries or custom components and reuse them as often as possible Handle exceptions -
  22. Backup data that you will not be able to retrieve again Especially those that are business critical Typically, this would be data from: Transactional systems Third party systems
  23. Efficient team work too…
  24. Document wisely: Notes in a pipeline should only deal with the code in the pipeline
  25. Document wisely: Notes in a pipeline should only deal with the code in the pipeline