Characteristics of modern
data architecture that drive
innovation
Cullen Patel
Solutions Engineer @ CloverDX
Automation, Simplicity, Data Quality,
Scalability, and Cost Savings
These matter because
o Keep up with changes
o Keep competitive edge
Real world solution – CloverDX
project
5 Key characteristics
Automation
Speed up data processing
Reduce errors
Reduce time and effort required
Ensure data is accurate and
consistent
Why is automation important?
Think about your automation requirements
o Should it go beyond time-based automation
o Should have required level of granularity
o Should be able to orchestrate multiple data jobs intelligently
Automated/handle bad data not just the good
o Should notify or even fix errors/re-try automatically
o Provide it on time to the right person
o Should provide the details to address the error
Ensure tools have intelligent automation
Simplicity
Data pipelines maintainable in long-term
Development team productivity
Build the process in pieces
Trust in process
Why is simplicity important?
How to break the job into smaller pieces?
Transfer files
to cloud
Load into
Snowflake
Build Models
Identify individual components of data pipelines
Each job should deal with a single task
How to break the job into smaller pieces?
Log
Ingest
Log Log Log
Validate Transform Deliver
Transfer files
to cloud
Load into
Snowflake
Build Models
Ask questions
o What is the purpose of the process, and what is its business impact?
o What interfaces are you going to use?
o How would you like to automate the process?
o What are the weak points?
o How to handle errors?
How to break the job into smaller pieces?
Ask questions
o What is the purpose of the process, and what is its business impact?
o What interfaces are you going to use?
o How would you like to automate the process?
o What are the weak points?
o How to handle errors?
Identify patterns
o Repeatable and configurable code sections
o Logging, monitoring, automation, …
How to break the job into smaller pieces?
Data Quality
Good data is great for business but
bad data can be ruinous
Ensure data is accurate and
consistent
Design for bad data
Why is data quality important?
Data profiling – Analyzing the data and looking at its statistics
Data validation – This process involves verifying that data is accurate,
consistent, and even could involve business rules
Data cleansing -This process involves removing or correcting any errors
or inconsistencies in the data
Do it as soon as you can!
Have it always on!
How to ensure high quality data?
Scalability and extensibility
Data is essential for innovation
Data volume is increasing
Businesses need to scale without
sacrificing the performance/experience
Why is scalability and extensibility important?
A lot of factors to consider
We will focus on:
o How and when to scale hardware
o How and when to scale your data pipelines/jobs
How to scale
Vertical
o increasing the RAM, CPU, or storage capacity of a single server
o often used in traditional on-premise data centers
Horizontal
o adding more nodes to a system and distributing the load
o this approach is often used in distributed systems – in the cloud
Both
o often the approach in the real world
o can be deployed in the cloud or on prem
How to scale hardware – vertical vs horizontal
Many/large data jobs
o Often as business grows, volume of data/jobs grows
o Environment and tools need to be stable
o Smart automation helps
Many unique data jobs
o Much more challenging to solve
o Proper tooling and development methodology is key
o Your software should allow you use dev time efficiently
o You software leverages templates/reusable parts
How to handle process scaling
Cost Savings
Two types of costs, variable and fixed
Cost in cloud can be unpredictable –
o providers offer tools to track and even estimate costs
o Estimates can be hard
o Need to have good discipline and monitoring for the cost -> FinOps
Costs can vary – even with good FinOps
o Seasonal changes
o Changes to workflows
Cost
Capacity vs consumption
Consumption is the most common,
especially in the cloud
Consumption can be hard to estimate
Capacity makes it easier to plan and
budget
Pricing models
Real-world example
Scalability
Automation
Data Quality
Simplicity
Cost Savings
5 Key characteristics
The company is having trouble scaling properly
Each new client they onboard takes too much time
Each has its own variation/uniqueness
Costs for onboarding make ROI much longer
Real project: Highly Scalable ingestion Framework
In cloud — On premise — Hybrid
CloverDX Data Integration Platform
Automation of data
workloads from A to Z
One place for solving the
mundane and the complex
Productivity and trust
for the enterprise
Architecture example
Amazon S3
Bucket
Source SFTP
Servers
CloverDX
Server
GitHub
Target SFTP
Servers
SMTP Server
Codebase
Email report
Data
Data
Configurations
Architecture example
Data staging Preprocessing Validations Postprocessing Data delivery Logging
Configuration
Upcoming webinar
What’s new in CloverDX 6 -April 6
Want to know more?
Requestademo:cloverdx.com/demo
Getatrial:cloverdx.com/trial
Watchpastwebinars:cloverdx.com/webinars
Q&A

Characteristics of modern data architecture that drive innovation

  • 1.
    Characteristics of modern dataarchitecture that drive innovation Cullen Patel Solutions Engineer @ CloverDX
  • 2.
    Automation, Simplicity, DataQuality, Scalability, and Cost Savings These matter because o Keep up with changes o Keep competitive edge Real world solution – CloverDX project 5 Key characteristics
  • 3.
  • 4.
    Speed up dataprocessing Reduce errors Reduce time and effort required Ensure data is accurate and consistent Why is automation important?
  • 5.
    Think about yourautomation requirements o Should it go beyond time-based automation o Should have required level of granularity o Should be able to orchestrate multiple data jobs intelligently Automated/handle bad data not just the good o Should notify or even fix errors/re-try automatically o Provide it on time to the right person o Should provide the details to address the error Ensure tools have intelligent automation
  • 6.
  • 7.
    Data pipelines maintainablein long-term Development team productivity Build the process in pieces Trust in process Why is simplicity important?
  • 8.
    How to breakthe job into smaller pieces? Transfer files to cloud Load into Snowflake Build Models
  • 9.
    Identify individual componentsof data pipelines Each job should deal with a single task How to break the job into smaller pieces? Log Ingest Log Log Log Validate Transform Deliver Transfer files to cloud Load into Snowflake Build Models
  • 10.
    Ask questions o Whatis the purpose of the process, and what is its business impact? o What interfaces are you going to use? o How would you like to automate the process? o What are the weak points? o How to handle errors? How to break the job into smaller pieces?
  • 11.
    Ask questions o Whatis the purpose of the process, and what is its business impact? o What interfaces are you going to use? o How would you like to automate the process? o What are the weak points? o How to handle errors? Identify patterns o Repeatable and configurable code sections o Logging, monitoring, automation, … How to break the job into smaller pieces?
  • 12.
  • 13.
    Good data isgreat for business but bad data can be ruinous Ensure data is accurate and consistent Design for bad data Why is data quality important?
  • 14.
    Data profiling –Analyzing the data and looking at its statistics Data validation – This process involves verifying that data is accurate, consistent, and even could involve business rules Data cleansing -This process involves removing or correcting any errors or inconsistencies in the data Do it as soon as you can! Have it always on! How to ensure high quality data?
  • 15.
  • 16.
    Data is essentialfor innovation Data volume is increasing Businesses need to scale without sacrificing the performance/experience Why is scalability and extensibility important?
  • 17.
    A lot offactors to consider We will focus on: o How and when to scale hardware o How and when to scale your data pipelines/jobs How to scale
  • 18.
    Vertical o increasing theRAM, CPU, or storage capacity of a single server o often used in traditional on-premise data centers Horizontal o adding more nodes to a system and distributing the load o this approach is often used in distributed systems – in the cloud Both o often the approach in the real world o can be deployed in the cloud or on prem How to scale hardware – vertical vs horizontal
  • 19.
    Many/large data jobs oOften as business grows, volume of data/jobs grows o Environment and tools need to be stable o Smart automation helps Many unique data jobs o Much more challenging to solve o Proper tooling and development methodology is key o Your software should allow you use dev time efficiently o You software leverages templates/reusable parts How to handle process scaling
  • 20.
  • 21.
    Two types ofcosts, variable and fixed Cost in cloud can be unpredictable – o providers offer tools to track and even estimate costs o Estimates can be hard o Need to have good discipline and monitoring for the cost -> FinOps Costs can vary – even with good FinOps o Seasonal changes o Changes to workflows Cost
  • 22.
    Capacity vs consumption Consumptionis the most common, especially in the cloud Consumption can be hard to estimate Capacity makes it easier to plan and budget Pricing models
  • 23.
  • 24.
  • 25.
    The company ishaving trouble scaling properly Each new client they onboard takes too much time Each has its own variation/uniqueness Costs for onboarding make ROI much longer Real project: Highly Scalable ingestion Framework
  • 26.
    In cloud —On premise — Hybrid CloverDX Data Integration Platform Automation of data workloads from A to Z One place for solving the mundane and the complex Productivity and trust for the enterprise
  • 27.
    Architecture example Amazon S3 Bucket SourceSFTP Servers CloverDX Server GitHub Target SFTP Servers SMTP Server Codebase Email report Data Data Configurations
  • 28.
    Architecture example Data stagingPreprocessing Validations Postprocessing Data delivery Logging Configuration
  • 29.
    Upcoming webinar What’s newin CloverDX 6 -April 6 Want to know more? Requestademo:cloverdx.com/demo Getatrial:cloverdx.com/trial Watchpastwebinars:cloverdx.com/webinars Q&A

Editor's Notes

  • #3 The five key principles we will be talking about are here: Automation, Simplicity, Data Quality, Scalability, and Cost Savings After that we can talk about what this all looks like in the real world and we will show how you can take advantage of these characteristics in CloverDX Competitive edge Gives your company the ability to stay ahead Helps drive more success at a lower cost Change rapidly Allows you to see changes/trends You can have a well informed response These are the 5 most important characteristics for your data architecture
  • #4 First up, lets talk about automation. It is one of the most important characteristic of modern data architecture that drives innovation. Automation refers to the use of software tools and technologies to streamline data processing and management tasks, reduce manual labor, and improve efficiency and accuracy.
  • #5 One of the main benefits of automation in data architecture is that it can help organizations to speed up data processing tasks and reduce errors. For example, instead of manually preparing and cleaning data for analysis, organizations can use tools like to automate these tasks. This can greatly reduce the time and effort required to prepare data for analysis, and ensure that the data is accurate and consistent and delivered to the targets in a timely manner. Otherwise the data can backup and lead to delays or incomplete or incorrectly processed data. Talk a little about automating the error handling but we will talk about that more in the data quality section
  • #6 Automating data pipelines – define what is automation? For us it is automating the run of a job, but each job is an orchestration so that steps follow the right order. If something fails, it doesn’t just stop, but takes steps to start correcting them. Think about what it is you need? Should you have multiple methods of automation: time based, monitoring the environment, etc Should have granularity in options so you can run things based on the conditions that make sense for you. Even if it is only time based and not Should work with within your environment from the oldest tool/server like a mainframe or flatfile to the newest one like distributed or non-rational databases and API’s. Whatever is actively being used in your data environment, should be automated. Your automation should also focus on bad data. We live in the real world where bad data is unfortunately common. Automation can cause anxiety if you aren’t sure errors/issues are handled Should notify or even fix errors/re-try automatically Provide it on time to the right person Should provide the details to address the error Regardless of if your environment is in the cloud or on-prem, you need an automated environment to really benefit from modern data architecture.
  • #7 Next lets talk about simplicity. So what do we mean by simplicity? Well we want to keep our data environment as simple as possible while getting the accurate, cleansed, and transformed data This involves things like breaking logic into pieces so the flow of data is easy to follow Reusing those blocks of logic for repetitive tasks
  • #8 Data pipelines maintainable in long-term no more siloed knowledge or points of failure where only one member of the team knows how to fix an issue Development team productivity – doing something simple is much easier and faster than something complex. You can also build the processes in pieces and add additional workflows and changes gradually Trust in process because it is always easier to trust things you understand
  • #10 Split responsibilities between components
  • #11 Ideal pipeline has up to 15 components One job should not do multiple things
  • #12 Ideal pipeline has up to 15 components One job should not do multiple things
  • #13 The next characteristic of your modern data architecture is data quality
  • #14 Your data architecture should be designed for bad data! Your solution should handle common errors and at least handle/notify you of uncommon/unexpected errors
  • #15 Data profiling: This involves analyzing data to identify any anomalies or inconsistencies. Data profiling helps to identify errors, gaps, and inconsistencies in data, which can then be corrected before the data is used for analysis. Data validation: This process involves verifying that data is accurate, complete, and consistent. Data validation checks can be automated, using rules or algorithms, or can be carried out manually. Data cleansing: This process involves removing or correcting any errors or inconsistencies in the data. Data cleansing can involve a range of techniques, such as using logic to they to correct any errors, or looking up missing values, or even manually dumping them to an error log to be corrected.
  • #17 Scalability is one of the key characteristics of modern data architecture that is essential for driving innovation. In today's data-driven world, organizations are collecting, storing, and processing vast amounts of data every day. As the volume of data continues to grow, it becomes increasingly important for data architectures to be scalable, i.e., capable of handling the increasing load without sacrificing performance. You want to avoid unexpected costs or unexpected downtime – requires lots of manual intervetion
  • #18 Scalability can be achieved through various means, such as horizontal and vertical scaling. Vertical scaling, on the other hand, involves adding more resources to a single node, such as increasing the RAM, CPU, or storage capacity of a server. This approach is typically used in traditional on-premise data centers. Horizontal scaling refers to adding more nodes to a system, which allows it to handle more load by distributing the load across multiple nodes. This approach is often used in distributed systems, such as Hadoop clusters or cloud-based data warehouses. Apache Spark Of course both of these approaches are available in the cloud but they are also both available in an on premise environment as well. Vertical is easy to do in an on premise environment but with the proper tool such as a virtual machine tool you can also do horizontal scaling
  • #19 Scalability can be achieved through various means, such as horizontal and vertical scaling. Vertical scaling, on the other hand, involves adding more resources to a single node, such as increasing the RAM, CPU, or storage capacity of a server. This approach is typically used in traditional on-premise data centers.
  • #20 Many/large jobs Often as business grows, volume of data/jobs grows You query a report more often or insert transactions, etc. if the jobs are very large, you need stability otherwise it could fail partway through and if you can’t restart from where you left off, you might not ever complete the job If you have many jobs running, then parallelization is key and Your environment and tooling are going to be important in this scenario This problem is relatively simple to fix and the easiest way to do so is simply throwing more hardware at the problem or re-writing jobs to be more performant Many unique data jobs Much more challenging to solve If you design with simplicity, it can help you deal with the challenge Proper tooling and development methodology is key otherwise the problem can quickly cost you a lot. Buying more hardware is one thing but hiring or getting more dev time is much harder and more costly so Your software should allow you use dev time efficiently – make it easy to keep things simple where you can but be extensible, customizable were needed. Not an easy challenge to solve but as we go through the other characteristics, we will talk a little more on this but first lets talk about something I already mentioned: which is Automation This is a little hard to conceptualized so we will show you an example of what this looks like in the real world
  • #21 All companies want to save money and if you aren’t taking steps to reduces overhead, others will have a competitive advantage Cost predictability – normally more data means more cost - Clover – Flexible deployment in cloud on prem or both. Clover offers “flat rate”/predictable pricing based on core count and number of developers
  • #22 Cost in cloud can be unpredictable – providers offer tools to track and even estimate in many cases, but estimates can be hard How do you estimate GB/records/connections/etc. per month for new process that is not yet implemented? Need to have good discipline and monitoring for the cost -> FinOps Even with good practices it can happen that the cost jumps up in certain month – e.g. before high season (like Xmas for retailers). Need to track costs over multiple year and expect that kind of change.
  • #23 There are many pricing categories each with its own nuances but the main two we will talk about is capacity vs consumption We do not price based on consumption, but rather capacity yearly subscription makes cost estimates trivial Costs only go up on request – the way we Still need to track costs for “hardware” – the compute, DB etc. if in the cloud. However, these stay static for very long periods of time and only change when deployment is made bigger (new server etc.) We at CloverDX are capacity based so we charge based on the number of developers and the number of cores needed on your server.
  • #24 So we talked about 5 key characteristics of Modern Data Architecture but what does this look like in the real world? W
  • #25 Just as a reminder, here are the 5 characteristics we are discussing. We already talked about the cost savings aspect
  • #26 So in the example
  • #28 As you can see even a simple environment can have lots of dependencies. You have some target and source systems but you also have email endpoints for notifications and a repository to store your code. In the middle of it all we have CloverDX managing each of these within our data environment. automate it
  • #29 In this real world example, we helped a client create a data ingestion framework. What they wanted was to add new clients very quickly and efficiently Even looking at this high level view, we can see some of the characteristics we talked about. 1. The job is split up so it is easier to follow along, test and implement both now and any changes in the future. 2. We also have a configuration step that feeds information into the first five steps. This allows you to modify the logic in each step for the individual client’s needs. This allows you to scale because you don’t have to modify the pipelines every time a new client gets onboarded, instead you simple provide a new configuration file for the new client. 3. You can see that each of the steps also involves logging. This allows us to ensure high data quality and ensures us that if anything does go wrong, someone will know about it. 4. Automation is built in with how the project is designed. The pipeline itself is kicked off at regular intervals but because of how it is designed, it orchestrates many processes together each time it runs so you not only process the data but you also pre-process the data, validate it, complete additional post-processing steps if required and finally deliver the data all while logging and notifying the appropriate people if there is an issue. 4. Cost savings is somewhat rolled up into all of them. Keeping it simple allows us to build on it to improve the process without re-engineering everything which would cost more. We also are alerted to issues and resolve them quickly which can save customers and ensure you meet your SLA’s. The scalability allows you to onboard new clients quickly and simply. 5. What is missing? Well it