Data Observability Best Pracices

Kensu | Confidential | All rights reserved | Copyright, 2022
(Data) Observability=Best Practices
Examples in Pandas, Scikit-Learn, PySpark, DBT

My 20 years of scars with data 🩹
2
First 10 years
- Software Engineer in Geospatial - Map/Coverage/Catalog (Java, C++, YUI 🩹)
- (Satellite) Images and Vector Data Miner
(Java/Python/R/Fortran/Scala)
Next 5 years
- Spark evangelist, teaching, and consulting in Big Data/AI in the Silicon Valley
- Creator of Spark-Notebook (pre-jupyter): open-source (3100+ ✪) and community-drive (20K+)
Last 5 years
- Brainstorming on how to bring quality and monitoring DevOps best practices to data (aka DODD)
- Founded Kensu: easing data teams to embrace best practices and create trust in deliveries
Meanwhile, “serial author”
- “What is Data Governance”, O’Reilly, 2020
- “What is Data Observability”, O’Reilly, 2021
- “Fundamentals of Data Observability”, O’Reilly, 2023

Agenda
3
1. Observability & Data
2. DO @ The Source - Pandas, PySpark, Scikit-Learn
3. Showcase

Observability & Data
1
4

So… Observability - 6 Areas ∋ Data Observability
In IT, “Observability” is the
capability of an IT system to
generate behavioral
information to allow external
observers to modelize its
internal state.
NOTE: an observer cannot interact with the system while it is functioning!
Infrastructure
5

Common questions we struggle with
7
Questions during analysis Observations needed Channel
How is data used? Usages (purposes, users, …) Lineage|Log
Why do users feel it is wrong? Expectations (perf, quality, …) Rule
Where is the data? Location (server, file path, …) Log
What does it represent? Structure metadata (fields, …) Log
What does/did the data look like? Content metadata (metric, kpis, …) Metrics
Has anything changed & when? Historical metadata Metrics
What data was used to create? Data Lineage Lineage
How is the data created? Application (data) lineage Lineage

“Data Observability” ⁉️
Data Observability is the component of an observable system that
generates information on how data influences the behavior of the
system and conversely.
8
Infra, Apps, User
LOGS
Data Metrics (profiling)
METRICS
(Apps & Data) Lineage
TRACES
● Application & Project info
● Metadata (fields, …)
● Freshness
● Completeness
● Distribution
● Data sources
● Data fields
● Application (pipeline)

Introducing DO @ the Source
Examples
2
9

How to make data pipelines observable?
10
orders
CSV
customers
CSV
train.py
load.py
orders
&cust
…
predict.py
pickle
result
CSV
orders
cust…
contact
dbt:mart dbt:view

Wrong answer: the Kraken Anti-Pattern
11
orders
CSV
customers
CSV
train.py
load.py
orders
&cust
…
predict.py
pickle
result
CSV
orders
cust…
contact
dbt:mart dbt:view
Compute
Resources $$
Maintenance $$$
Found a
background
gate 🥳

The answer: the “At the Source” Pattern
12
orders
CSV
customers
CSV
train.py
load.py
orders
&cust
…
predict.py
pickle
result
CSV
orders
cust…
contact
dbt:mart dbt:view
Aggregate
compute compute
compute compute compute

Pandas
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://host.pg:5432/db’)
customers = pd.read_csv("campaign/customer_list.csv")
customers.to_sql('customers', engine, index=False)
orders = pd.read_csv("campaign/orders.csv")
orders = orders.rename(columns={'id':"customer_id"})
orders.to_sql('orders', engine, index=False)
KensuProvider().initKensu(input_stats=True)
import kensu.pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://host.pg:5432/db’)
customers = pd.read_csv("campaign/customer_list.csv")
customers.to_sql('customers', engine, index=False)
orders = pd.read_csv("campaign/orders.csv")
orders = orders.rename(columns={'id':"customer_id"})
orders.to_sql('orders', engine, index=False)
13
Input
Output
Lineage
Logge
r
Interceptor
Extract:
- Location
- Schema
- Metrics (summary)
Connect
as
Lineage

PySpark (& dbt)
spark = SparkSession.builder.appName("MyApp").getOrCreate()
all_assets = spark.read.option("inferSchema","true")
.option("header","true")
.csv("monthly_assets.csv")
apptech = all_assets[all_assets['Symbol'] == 'APCX']
Buzzfeed = all_assets[all_assets['Symbol'] == 'ENFA']
buzz_report = Buzzfeed.withColumn('Intraday_Delta',
Buzzfeed['Adj Close'] - Buzzfeed['Open'])
apptech_report = apptech.withColumn('Intraday_Delta',
apptech['Adj Close'] - apptech['Open'])
kept_values = ['Open','Adj Close','Intraday_Delta']
final_report_buzzfeed = buzz_report[kept_values]
final_report_apptech = apptech_report[kept_values]
final_report_buzzfeed.write.mode('overwrite').csv("report_bf.csv")
final_report_apptech.write.mode('overwrite').csv("report_afcsv")
spark = SparkSession.builder.appName("MyApp")
.config("spark.driver.extraClassPath",
"kensu-spark-
agent.jar").getOrCreate()
init_kensu_spark(spark, input_stats=True)
all_assets = spark.read.option("inferSchema","true")
.option("header","true")
.csv("monthly_assets.csv")
apptech = all_assets[all_assets['Symbol'] == 'APCX']
Buzzfeed = all_assets[all_assets['Symbol'] == 'ENFA']
buzz_report = Buzzfeed.withColumn('Intraday_Delta',
Buzzfeed['Adj Close'] - Buzzfeed['Open'])
apptech_report = apptech.withColumn('Intraday_Delta',
apptech['Adj Close'] - apptech['Open'])
kept_values = ['Open','Adj Close','Intraday_Delta']
final_report_buzzfeed = buzz_report[kept_values]
final_report_apptech = apptech_report[kept_values]
final_report_buzzfeed.write.mode('overwrite').csv("report_bf.csv")
final_report_apptech.write.mode('overwrite').csv("report_afcsv")
14
Input
Filters
Computations
Select
2 Outputs
Interceptor
Logger
Extract from DAG:
- DataFrames
(I/O)
- Location
- Schema
- Metrics
- Lineage

k = KensuProvider().initKensu(input_stats=True)
Import kensu.pickle as pickle
from kensu.sklearn.model_selection import train_test_split
data = pd.read_csv("orders.csv")
df=data[['total_qty', 'total_basket']]
X = df.drop('total_basket',axis=1)
y = df['total_basket']
X_train, X_test, y_train, y_test = train_test_split(X, y)
from kensu.sklearn.linear_model import LinearRegression
model=LinearRegression().fit(X_train,y_train)
with open('model.pickle', 'wb') as f:
pickle.dump(model,f)
Scikit-Learn: 🚂
import pickle as pickle
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv("orders.csv")
df=data[['total_qty', 'total_basket']]
X = df.drop('total_basket',axis=1)
y = df['total_basket']
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.linear_model import LinearRegression
model=LinearRegression().fit(X_train,y_train)
with open('model.pickle', 'wb') as f:
pickle.dump(model,f)
15
Filter
Transformation
Output
Input
Select
Logge
r
Interceptors
Extract:
- Location
- Schema
- Data Metrics
- Model Metrics
Accumulate
connections as
Lineage

k = KensuProvider().initKensu(input_stats=True)
import kensu.pickle as pickle
data = pd.read_csv("second_campaign/orders.csv")
with open('model.pickle', 'rb') as f:
model=pickle.load(f)
df=data[['total_qty']]
pred = model.predict(df)
df = data.copy()
df['model_pred']=pred
df.to_csv('model_results.csv', index=False)
Scikit-Learn: 🔮
import pandas as pd
import pickle as pickle
data = pd.read_csv("second_campaign/orders.csv")
with open('model.pickle', 'rb') as f:
model=pickle.load(f)
df=data[['total_qty']]
pred = model.predict(df)
df = data.copy()
df['model_pred']=pred
df.to_csv('model_results.csv', index=False)
2 Inputs
Output
Transformation
Select
Computation
Logge
r
Interceptors
Accumulate
connections as
Lineage
Extract:
- Location
- Schema
- Data Metrics
- Model Metrics?

Showcase
4
17
Hey! Where is 3?

Let it run ➡ let it rain
Example code:
https://github.com/kensuio-oss/kensu-public-examples
All you need is Docker Compose
Example using a Free Platform for the Data Community:
https://sandbox.kensuapp.com/
All you need is a Google Account/Mail
18

Thank YOU!
Try it by yourself: https://sandbox.kensuapp.com
Powered by
- Connect with Google in 10 seconds 😊.
- 🆓 of use.
- 🚀 Get started with examples in
Python, Spark, DBT, SQL, …
Ping me: @noootsab - LinkedIn - andy.petrella@kensu.io
19
Chapters 1&2 ☑️
Chapter 3 👷
♂️
https://kensu.io

Data Observability Best Pracices

More Related Content

What's hot

Similar to Data Observability Best Pracices

More from Andy Petrella

Recently uploaded

Data Observability Best Pracices