Automatic Data Reconciliation, Data Quality, and Data Observability.pdf

Automating Data Reconciliation, Data Observability, and Data Quality Check After
Each Data Load
Over the last several years with the rise of cloud data warehouses and lakes such as
Snowflake, Redshift, and Databricks, data load processes have become increasingly
distributed and complex. Organizations are investing more capital in ingesting data from
multiple internal and external data sources. As companies’ dependency on data increases,
every day and business users use the data for critical business decisions, ensuring high data
quality is a top requirement in any data analytics platform.
As data gets processed every day through various pipelines, data can break for hundreds of
reasons, from code changes to business process changes. With a limited team size and
multiple competing priorities, data engineers are often not able to reconcile all data(or any

data) every day. As a result, many times business users find out about the data issues
before the data engineering team knows about them. But at that point, it is too late for them
to build the trust back.
How can we pro-actively learn about data issues before users tell us? What if we
automatically reconcile data after each load every day andalert data engineers when there is
a data issue? Is there any architecture or solution that can help us?
Yes, let’s review a solution called 4DAlert that automates data reconciliation, data quality,
and data observability in detail and see how it could help identify the issues automatically
before bad data reaches downstream reports and dashboards used by multiple users.

Scenario 1 — Reconcile data between source and target
Almost all data platforms load data from multiple source systems. Due to one or other
reasons data between source and target doesn’t match. Data teams spend manual effort
every day to reconcile numerous data sources.
4DAlert solution connects to diverse data sources and automatically reconciles data
between source and target. The solution leverages its own AI engine to determine
the reconciliation issues and alerts appropriate stakeholders through multiple channels
which include email, texts, and Slack channels.

Scenario 2-Data reconciliation within the analytics platform
Sometimes connecting to source systems is not possible due to several reasons such as
source systems are owned by different groups and they don’t allow or source systems are
too rigid for any external connection. In that scenario, 4DAlert’s AI engine reconciles
incoming new data with historical trends to determine data
anomalies and reconciliation issues.
Scenario 3 - Data Compare across the systems

In most organizations, there are multiple systems that consume the same data. Therefore, it
is a continuous challenge to keep data in-sync across systems. 4DAlert’s flexible
architecture allows it to connect diverse source systems and check key data points across
the systems.
Scenario 4- Checking numbers across layers in an analytics platform
Many times, the same data is stored in different layers and different objects. As multiple
pipelines and loads run on a daily basis, it becomes difficult to check if the numbers are the
same across the systems. the 4DAlert solution checks the numbers across layers and alerts
when data doesn’t match.

The solution that connects to diverse data sources.
4DAlert is a WEB API based AI solution that connects to most databases such as
Snowflake, Redshift, Synapse, HANA, SQL server Oracle Postgres and many more) and
reconciles data between source and target at a periodic schedule.
The solution is designed to connect source and target databases even though both source
and target databases are built on different database technology. For example, say source
could be SAP HANA system and target could be Snowflake or Redshift system or source
could be data lake in Azure or AWS S3 and target could be Snowflake or Redshift database,
4DAlert would be able to reconcile data without any issue.

Write your own SQL to detect the anomaly and check data quality
Users can write their custom SQL queries to pinpoint any particular anomalies and overwrite
their tolerance limit. For example, Sales varying by 10% is acceptable but varying by 60% is
not acceptable. When users don’t define their tolerance, 4DAlert uses statistical variances
and anomaly detection methods to detect outliers and alert as appropriate.

Data Observability
In a data platform, there could be hundreds or thousands of tables. Every day multiple
pipelines run and load objects. Few of the objects are loaded daily(sometimes multiple times
a day) and weekly, monthly, or yearly, and others are loaded on-demand on an ad-hoc
basis. It is very hard to keep track of how fresh the data is. Many times users continuously
ask about the last load date.
4DAlert checks vital statistics of each object on a regular basis and labels each object on its
freshness. This information could be broadcasted to users so that users are aware of the
freshness of each dataset.

Auto Quality Score
In an analytics platform, objects need to be loaded on a regular basis (sometimes with
predefined SLA). Anytime data is loaded users expect the data to be loaded without any
quality issue or load issue. However, many times there are objects that have frequent issues
in load timing or data quality. A data observability platform such as 4DAlert tracks the failure
points and provides a detailed performance scorecard for each object. Scores for each
object are published as a dashboard to data engineers, enterprise data team and data
scientists, and sometimes end-users for greater transparency.
Multiple keys and multiple metrics for any data set

Many times, a dataset contains more than one key metric. For example; Dataset could have
revenue and sold qt, discount, cost of goods sold and any of these metrics could go wrong.
So a solution should be able to scan more than one metric simultaneously to look for
abnormalities.
Key quality metrics(Ex Row count, Null count, distinct count, max value, Min value )

4DAlert comes with many predefined metrics that are applied automatically to detect
anomalies in the data. For example, the Material Number in inventory data should not be a
null or distinct list of countries in the data set, it can’t be millions or the maximum amount of
PO should not be more than 10,000. These rules are predefined and come out of the box
and data sets are checked for these rules.
Enumerated value check
Many times the data team wants to restrict certain field values to predefined value sets.
Example currencies should be a value from a predefined currency list. Same for plants,
country, region, etc… 4DAlert could check
Seasonality, Month-end/Quarter end or year-end spike
Many times, data spikes at month-end or quarter-end or year-end or at any particular period
of the year. An AI-enabled solution such as 4DAlert takes into account the seasonality in the
data as it tries to identify the anomalies.
Custom metrics

If predefined metrics or custom metrics are not all you need then you should be able to add
your own metrics. 4DAlert allows you to write your SQL query, check the values and detect
anomalies.
This post was written by Nihar Rout, Managing Partner, and Lead Architect@ 4DAlert.
Want to try schema compare features that will help you continuously deploy changes with
Zero error? Request a demo with one of our experts at https://4dalert.com/
Resource: https://medium.com/@nihar.rout_analytics/automatic-data-
reconciliation-data-quality-and-data-observability-3eeca4650cd

Automatic Data Reconciliation, Data Quality, and Data Observability.pdf

Recommended

Recommended

More Related Content

Similar to Automatic Data Reconciliation, Data Quality, and Data Observability.pdf

Similar to Automatic Data Reconciliation, Data Quality, and Data Observability.pdf (20)

Recently uploaded

Recently uploaded (20)

Automatic Data Reconciliation, Data Quality, and Data Observability.pdf