Data architecture principles to accelerate your data strategy

Data architecture
principles to accelerate
your data strategy
Jan Slechta
Senior Data Engineer @ CloverDX

Breaking down complex processes
Avoiding duplicate functionalities
Consistency
Data quality
Documentation
Key principles

Maintenance over time
o Development team productivity
o Cost-effectiveness
Trust in process and in data
o Transparency
o Completeness of the process
Why do these matter?

Breakdown complex process
into simple elements

Data pipelines maintainable in long-term
Completeness of the process
Development team productivity
Better test coverage  Robust solution
Trust in process
Why is this important?

Maintainability
o Our stored procedures are too complex, and the author left the company.
Real world issues

Maintainability
Efficiency
o Team of four developers is slow and cannot work in parallel.
Real world issues

Maintainability
Efficiency
Completeness
o We forgot to implement auditing and we don’t know how to add it to the existing process.
Real world issues

Maintainability
Efficiency
Completeness
o We forgot to implement auditing and we don’t know how to add it to the existing process.
Trust
o Often after deployment of new feature, our pipelines unexpectedly break.
Real world issues

Large jobs are common sign of bad architecture
How to break the job into smaller pieces?
Transfer files
to cloud
Load into
Snowflake
Build Models

Identify individual components of data pipelines
Each job should deal with a single task
Log
Ingest
Log Log Log
Validate Transform Deliver
Transfer files
to cloud
Load into
Snowflake
Build Models

Ask questions
o What is the purpose of the process, and what is its business impact?
o What interfaces are you going to use?
o How would you like to automate the process?
o What are the weak points?
o How to handle errors?

Ask questions
o What is the purpose of the process, and what is its business impact?
o What interfaces are you going to use?
o How would you like to automate the process?
o What are the weak points?
o How to handle errors?
Identify patterns
o Repeatable and configurable code sections
o Logging, monitoring, automation, …

Reuse functionality
Avoid copy-paste by building and reusing generic jobs for common tasks
Logger

Beauty of keeping it small
<10 steps in each design helps understand and maintain the process
Parse Data File

Avoid duplicating functionality

Standardize process
Increased developer productivity
Faster turnaround
Increased trust
Reduced cost of business processes
Why avoid duplicating functionality?

Productivity
o Implementing a single change to our core process
required updates to nearly 80 jobs.
Real world issues

Productivity
o Implementing a single change to our core process
required updates to nearly 80 jobs.
Consistency
o During internal audit, we realized that auditing
components do not log at the same level of detail.
Real world issues

Avoid duplication by modular design
Source Business logic Target

Source Business logic Target
New
Source
No additional cost of adapting new source

Source
Other
Business logic
Other
Target
Build new pipeline with the same source

Functional reusability in CloverDX
Pipeline 1 – PII detection
Pipeline 2 – Publishing to web
Shared Source

Help you understand the jobs among them team
Prevent data issues
Will help you identify errors easier  Help meet SLAs
Why strive for consistency?

Data quality
o Some data fields are not populated although the data is in the source.
Real world issues

Data quality
Team productivity
o We don’t have good approach for team collaboration. Before each release we
spend days fixing the conflicts when all teams deliver their work.
Real world issues

Data quality
Team productivity
o We don’t have good approach for change management. Before each release
we spend days fixing the conflicts when all teams deliver their work.
Consistency
o Each developer approaches the task differently and the jobs are difficult to
monitor in production.
Real world issues

Naming conventions
Documentation conventions
Development conventions
o Break down where customization is expected
o Versioning and teamwork related conventions
Set expectations and provide training
o Trainings will increase productivity (data integration platform, version control, etc.)
Define conventions

Bad data = Cost
o Correction
o Penalties
o Lost business
Accurate data to support business
Efficient data process
Adaptability and recoverability from data issues
Why data quality matters?

Distort data reports
o Because we did not check data set quality, we not only had to build another
complicated clean up process, but we were also running our business based on
wrong sales results.
Real world issues

Unable to deliver
o We have identified an issue in the pipeline, but we can’t fix the data as we do not
store delta sources from our transactional systems. We can’t implement our new
use case.
Real world issues

Unable to deliver
o We have identified an issue in the pipeline, but we can’t fix the data as we do not
store delta sources from our transactional systems. We can’t implement our new
use case.
Data quality check is too slow
o Profiling source helps us deliver better data, but the process is too slow; and we
cannot meet our SLA. Do we remove data quality checks?
Real world issues

Always expect poor data quality
Validate early to keep SLA and reduce downstream burden
Avoid unnecessary validation
Reuse validation rules for consistency
Data quality basic principles

Fixing the data may require original source and human review
Keep the source data in staging environment
Delta records might be sufficient
Prioritize business critical data in storage
Keep source data

Data processes evolve over time
People forget or leave
Quickly understand the process
Maintain more effectively over many years
Why is documentation important?

Job design is documentation too – smaller jobs are easier to understand
Document wisely and to the point
Pay special attention to interfaces and reused jobs
Set documentation conventions
Documentation

Quick recap
Key principles
Breakdown complex processes
Avoid duplicating functionality
Aim for consistency
Maintain data quality
Documentation

Upcoming Webinars
5 Characteristics of modern data
architecture that drive innovation
March 23rd
Q&A
www.cloverdx.com/webinars

Data architecture principles to accelerate your data strategy

Recommended

Recommended

More Related Content

Similar to Data architecture principles to accelerate your data strategy

Similar to Data architecture principles to accelerate your data strategy (20)

More from CloverDX

More from CloverDX (13)

Recently uploaded

Recently uploaded (20)

Data architecture principles to accelerate your data strategy

Editor's Notes