What are the data architecture principles you should be applying to your project design to ensure a successful outcome?
In this session (see link to full webinar at the bottom) we're walking through some of the basic elements of data architecture and some of the common patterns we’ve seen in projects. And we’ll show you how you can make your projects easier to maintain and improve as your data needs evolve.
Some of the key principles include:
Data validation at the point of data entry – how to ensure your projects aren’t derailed by bad data
Consistency – how and why you should be documenting your architecture and development practices
Avoiding duplication – how you should be thinking about reusing code to improve project maintainability
Watch the full webinar at https://www.cloverdx.com/webinars/data-architecture-principles-to-accelerate-data-strategy
2. Breaking down complex processes
Avoiding duplicate functionalities
Consistency
Data quality
Documentation
Key principles
3. Maintenance over time
o Development team productivity
o Cost-effectiveness
Trust in process and in data
o Transparency
o Completeness of the process
Why do these matter?
5. Data pipelines maintainable in long-term
Completeness of the process
Development team productivity
Better test coverage Robust solution
Trust in process
Why is this important?
7. Maintainability
o Our stored procedures are too complex, and the author left the company.
Efficiency
o Team of four developers is slow and cannot work in parallel.
Real world issues
8. Maintainability
o Our stored procedures are too complex, and the author left the company.
Efficiency
o Team of four developers is slow and cannot work in parallel.
Completeness
o We forgot to implement auditing and we don’t know how to add it to the existing process.
Real world issues
9. Maintainability
o Our stored procedures are too complex, and the author left the company.
Efficiency
o Team of four developers is slow and cannot work in parallel.
Completeness
o We forgot to implement auditing and we don’t know how to add it to the existing process.
Trust
o Often after deployment of new feature, our pipelines unexpectedly break.
Real world issues
10. Large jobs are common sign of bad architecture
How to break the job into smaller pieces?
Transfer files
to cloud
Load into
Snowflake
Build Models
11. Identify individual components of data pipelines
Each job should deal with a single task
How to break the job into smaller pieces?
Log
Ingest
Log Log Log
Validate Transform Deliver
Transfer files
to cloud
Load into
Snowflake
Build Models
12. Ask questions
o What is the purpose of the process, and what is its business impact?
o What interfaces are you going to use?
o How would you like to automate the process?
o What are the weak points?
o How to handle errors?
How to break the job into smaller pieces?
13. Ask questions
o What is the purpose of the process, and what is its business impact?
o What interfaces are you going to use?
o How would you like to automate the process?
o What are the weak points?
o How to handle errors?
Identify patterns
o Repeatable and configurable code sections
o Logging, monitoring, automation, …
How to break the job into smaller pieces?
17. Standardize process
Increased developer productivity
Faster turnaround
Increased trust
Reduced cost of business processes
Why avoid duplicating functionality?
18. Productivity
o Implementing a single change to our core process
required updates to nearly 80 jobs.
Real world issues
19. Productivity
o Implementing a single change to our core process
required updates to nearly 80 jobs.
Consistency
o During internal audit, we realized that auditing
components do not log at the same level of detail.
Real world issues
25. Help you understand the jobs among them team
Prevent data issues
Will help you identify errors easier Help meet SLAs
Why strive for consistency?
26. Data quality
o Some data fields are not populated although the data is in the source.
Real world issues
27. Data quality
o Some data fields are not populated although the data is in the source.
Team productivity
o We don’t have good approach for team collaboration. Before each release we
spend days fixing the conflicts when all teams deliver their work.
Real world issues
28. Data quality
o Some data fields are not populated although the data is in the source.
Team productivity
o We don’t have good approach for change management. Before each release
we spend days fixing the conflicts when all teams deliver their work.
Consistency
o Each developer approaches the task differently and the jobs are difficult to
monitor in production.
Real world issues
29. Naming conventions
Documentation conventions
Development conventions
o Break down where customization is expected
o Versioning and teamwork related conventions
Set expectations and provide training
o Trainings will increase productivity (data integration platform, version control, etc.)
Define conventions
31. Bad data = Cost
o Correction
o Penalties
o Lost business
Accurate data to support business
Efficient data process
Adaptability and recoverability from data issues
Why data quality matters?
32. Distort data reports
o Because we did not check data set quality, we not only had to build another
complicated clean up process, but we were also running our business based on
wrong sales results.
Real world issues
33. Distort data reports
o Because we did not check data set quality, we not only had to build another
complicated clean up process, but we were also running our business based on
wrong sales results.
Unable to deliver
o We have identified an issue in the pipeline, but we can’t fix the data as we do not
store delta sources from our transactional systems. We can’t implement our new
use case.
Real world issues
34. Distort data reports
o Because we did not check data set quality, we not only had to build another
complicated clean up process, but we were also running our business based on
wrong sales results.
Unable to deliver
o We have identified an issue in the pipeline, but we can’t fix the data as we do not
store delta sources from our transactional systems. We can’t implement our new
use case.
Data quality check is too slow
o Profiling source helps us deliver better data, but the process is too slow; and we
cannot meet our SLA. Do we remove data quality checks?
Real world issues
35. Always expect poor data quality
Validate early to keep SLA and reduce downstream burden
Avoid unnecessary validation
Reuse validation rules for consistency
Data quality basic principles
36. Fixing the data may require original source and human review
Keep the source data in staging environment
Delta records might be sufficient
Prioritize business critical data in storage
Keep source data
38. Data processes evolve over time
People forget or leave
Quickly understand the process
Maintain more effectively over many years
Why is documentation important?
39. Job design is documentation too – smaller jobs are easier to understand
Document wisely and to the point
Pay special attention to interfaces and reused jobs
Set documentation conventions
Documentation
Maintainability:
Your process will become extensible
Completeness
You will not forget about other critical elements of the process
Efficient development process
Enables teamwork
Shorter development phase
Smaller code base
Split responsibilities between components
Split responsibilities between components
Ideal pipeline has up to 15 components
One job should not do multiple things
Ideal pipeline has up to 15 components
One job should not do multiple things
Multi-layer architecture
Abstraction with possibilities to drill-down to more details
Removes redundancy Smaller code base
Standardize process Increased transparency and trust
Shorter time to deliver updates Saves time and costs
Easier scalability
Process reusability – framework
Configuration in DB or ERP, CRM etc.
Pipeline reusability
Subprocess (e.g. Data staging)
Functional reusability
Single unit / function reused in pipelines of different purpose etc.
Three levels of reusability
Process reusability (i.e., set of pipelines configured via external configuration)
Pipeline reusability (e.g., sub-process reusability)
Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…)
Process reusability – framework
Set of pipelines configured via external configuration
Configuration in DB or ERP, CRM etc.
Pipeline reusability
Subprocess reusability (e.g. Data staging)
Functional reusability
Logger, notifier, transformer, formatter, encryptor,…
Single unit / function reused in pipelines of different purpose etc.
Modular design – you can easily change parts of the process without affecting the rest
Three levels of reusability
Process reusability (i.e., set of pipelines configured via external configuration)
Pipeline reusability (e.g., sub-process reusability)
Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…)
Process reusability – framework
Set of pipelines configured via external configuration
Configuration in DB or ERP, CRM etc.
Pipeline reusability
Subprocess reusability (e.g. Data staging)
Functional reusability
Logger, notifier, transformer, formatter, encryptor,…
Single unit / function reused in pipelines of different purpose etc.
For example you can replace the source with a new source (e.g. you replace your CRM with a different product, you switch cloud providers, etc.)
With good modular design you only implement the source change and WON’T HAVE TO touch the rest of the pipeline time & cost savings
Three levels of reusability
Process reusability (i.e., set of pipelines configured via external configuration)
Pipeline reusability (e.g., sub-process reusability)
Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…)
Process reusability – framework
Set of pipelines configured via external configuration
Configuration in DB or ERP, CRM etc.
Pipeline reusability
Subprocess reusability (e.g. Data staging)
Functional reusability
Logger, notifier, transformer, formatter, encryptor,…
Single unit / function reused in pipelines of different purpose etc.
Or, you can use individual parts of your pipelines elsewhere – for example in here, I’m using the Source from the previous pipeline in a new one – but it’s the same source
Three levels of reusability
Process reusability (i.e., set of pipelines configured via external configuration)
Pipeline reusability (e.g., sub-process reusability)
Functionality reusability (e.g., logger, notifier, transformer, formatter, encryptor,…)
Process reusability – framework
Set of pipelines configured via external configuration
Configuration in DB or ERP, CRM etc.
Pipeline reusability
Subprocess reusability (e.g. Data staging)
Functional reusability
Logger, notifier, transformer, formatter, encryptor,…
Single unit / function reused in pipelines of different purpose etc.
What it looks like in a product like CloverDX? In here you can see the same source, called DonationsReader, being used in two different pipelines.
Prevent issues: in dynamic transformations
Data quality
SILENT error, automatic mapping issue
Code review
Automated built-in checks etc.
Data quality
SILENT error, automatic mapping issue
Code review
Automated built-in checks etc.
Data quality
SILENT error, automatic mapping issue
Code review
Automated built-in checks etc.
Naming conventions for files, processes, …
Ask yourself a question what the data means to your business and why you collect them? it is worth checking data quality
Poor data quality Inaccurate reporting wrong business decisions
RWI:
Incomplete records
Process fails
Missing alternative path?
Do you backup source delta records to rebuild history in case of an error?
Efficient data process
* Not spending too much time on something that is not worth it
Sooner
File type check
Profile data
Are you expecting an XML file? Check it is an XML file at first.
Profile data (if necessary) before you start individual record validation.
Unnecessary validation
Big data profiling may lead to unnecessary read operations (few lines might be enough? Or leave it for later)
Create libraries or custom components and reuse them as often as possible
Handle exceptions
-
Backup data that you will not be able to retrieve again
Especially those that are business critical
Typically, this would be data from:
Transactional systems
Third party systems
Efficient team work too…
Document wisely: Notes in a pipeline should only deal with the code in the pipeline
Document wisely: Notes in a pipeline should only deal with the code in the pipeline