1. Kensu | Confidential | All rights reserved | Copyright, 2022
(Data) Observability=Best Practices
Examples in Pandas, Scikit-Learn, PySpark, DBT
2. Kensu | Confidential | All rights reserved | Copyright, 2022
My 20 years of scars with data 🩹
2
First 10 years
- Software Engineer in Geospatial - Map/Coverage/Catalog (Java, C++, YUI 🩹)
- (Satellite) Images and Vector Data Miner
(Java/Python/R/Fortran/Scala)
Next 5 years
- Spark evangelist, teaching, and consulting in Big Data/AI in the Silicon Valley
- Creator of Spark-Notebook (pre-jupyter): open-source (3100+ ✪) and community-drive (20K+)
Last 5 years
- Brainstorming on how to bring quality and monitoring DevOps best practices to data (aka DODD)
- Founded Kensu: easing data teams to embrace best practices and create trust in deliveries
Meanwhile, “serial author”
- “What is Data Governance”, O’Reilly, 2020
- “What is Data Observability”, O’Reilly, 2021
- “Fundamentals of Data Observability”, O’Reilly, 2023
3. Kensu | Confidential | All rights reserved | Copyright, 2022
Agenda
3
1. Observability & Data
2. DO @ The Source - Pandas, PySpark, Scikit-Learn
3. Showcase
4. Kensu | Confidential | All rights reserved | Copyright, 2022
Observability & Data
1
4
5. Kensu | Confidential | All rights reserved | Copyright, 2022
So… Observability - 6 Areas ∋ Data Observability
In IT, “Observability” is the
capability of an IT system to
generate behavioral
information to allow external
observers to modelize its
internal state.
NOTE: an observer cannot interact with the system while it is functioning!
Infrastructure
5
6. Kensu | Confidential | All rights reserved | Copyright, 2022
Observability - Log, Metrics, Traces (examples)
Infrastructure
Syslog | `top` | `route`
App Log | Opentracing/telemetry
Security log | traces
Audit log | `git blame`
What are the questions
we want to answer quickly?
6
7. Kensu | Confidential | All rights reserved | Copyright, 2022
Common questions we struggle with
7
Questions during analysis Observations needed Channel
How is data used? Usages (purposes, users, …) Lineage|Log
Why do users feel it is wrong? Expectations (perf, quality, …) Rule
Where is the data? Location (server, file path, …) Log
What does it represent? Structure metadata (fields, …) Log
What does/did the data look like? Content metadata (metric, kpis, …) Metrics
Has anything changed & when? Historical metadata Metrics
What data was used to create? Data Lineage Lineage
How is the data created? Application (data) lineage Lineage
8. Kensu | Confidential | All rights reserved | Copyright, 2022
“Data Observability” ⁉️
Data Observability is the component of an observable system that
generates information on how data influences the behavior of the
system and conversely.
8
Infra, Apps, User
LOGS
Data Metrics (profiling)
METRICS
(Apps & Data) Lineage
TRACES
● Application & Project info
● Metadata (fields, …)
● Freshness
● Completeness
● Distribution
● Data sources
● Data fields
● Application (pipeline)
9. Kensu | Confidential | All rights reserved | Copyright, 2022
Introducing DO @ the Source
Examples
2
9
10. Kensu | Confidential | All rights reserved | Copyright, 2022
How to make data pipelines observable?
10
orders
CSV
customers
CSV
train.py
load.py
orders
&cust
…
predict.py
pickle
result
CSV
orders
cust…
contact
dbt:mart dbt:view
11. Kensu | Confidential | All rights reserved | Copyright, 2022
Wrong answer: the Kraken Anti-Pattern
11
orders
CSV
customers
CSV
train.py
load.py
orders
&cust
…
predict.py
pickle
result
CSV
orders
cust…
contact
dbt:mart dbt:view
Compute
Resources $$
Maintenance $$$
Found a
background
gate 🥳
12. Kensu | Confidential | All rights reserved | Copyright, 2022
The answer: the “At the Source” Pattern
12
orders
CSV
customers
CSV
train.py
load.py
orders
&cust
…
predict.py
pickle
result
CSV
orders
cust…
contact
dbt:mart dbt:view
Aggregate
compute compute
compute compute compute
15. Kensu | Confidential | All rights reserved | Copyright, 2022
k = KensuProvider().initKensu(input_stats=True)
Import kensu.pickle as pickle
from kensu.sklearn.model_selection import train_test_split
import kensu.pandas as pd
data = pd.read_csv("orders.csv")
df=data[['total_qty', 'total_basket']]
X = df.drop('total_basket',axis=1)
y = df['total_basket']
X_train, X_test, y_train, y_test = train_test_split(X, y)
from kensu.sklearn.linear_model import LinearRegression
model=LinearRegression().fit(X_train,y_train)
with open('model.pickle', 'wb') as f:
pickle.dump(model,f)
Scikit-Learn: 🚂
import pickle as pickle
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv("orders.csv")
df=data[['total_qty', 'total_basket']]
X = df.drop('total_basket',axis=1)
y = df['total_basket']
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.linear_model import LinearRegression
model=LinearRegression().fit(X_train,y_train)
with open('model.pickle', 'wb') as f:
pickle.dump(model,f)
15
Filter
Transformation
Output
Input
Select
Logge
r
Interceptors
Extract:
- Location
- Schema
- Data Metrics
- Model Metrics
Accumulate
connections as
Lineage
16. Kensu | Confidential | All rights reserved | Copyright, 2022
k = KensuProvider().initKensu(input_stats=True)
import kensu.pandas as pd
import kensu.pickle as pickle
data = pd.read_csv("second_campaign/orders.csv")
with open('model.pickle', 'rb') as f:
model=pickle.load(f)
df=data[['total_qty']]
pred = model.predict(df)
df = data.copy()
df['model_pred']=pred
df.to_csv('model_results.csv', index=False)
Scikit-Learn: 🔮
import pandas as pd
import pickle as pickle
data = pd.read_csv("second_campaign/orders.csv")
with open('model.pickle', 'rb') as f:
model=pickle.load(f)
df=data[['total_qty']]
pred = model.predict(df)
df = data.copy()
df['model_pred']=pred
df.to_csv('model_results.csv', index=False)
2 Inputs
Output
Transformation
Select
Computation
Logge
r
Interceptors
Accumulate
connections as
Lineage
Extract:
- Location
- Schema
- Data Metrics
- Model Metrics?
17. Kensu | Confidential | All rights reserved | Copyright, 2022
Showcase
4
17
Hey! Where is 3?
18. Kensu | Confidential | All rights reserved | Copyright, 2022
Let it run ➡ let it rain
Example code:
https://github.com/kensuio-oss/kensu-public-examples
All you need is Docker Compose
Example using a Free Platform for the Data Community:
https://sandbox.kensuapp.com/
All you need is a Google Account/Mail
18
19. Kensu | Confidential | All rights reserved | Copyright, 2022
Thank YOU!
Try it by yourself: https://sandbox.kensuapp.com
Powered by
- Connect with Google in 10 seconds 😊.
- 🆓 of use.
- 🚀 Get started with examples in
Python, Spark, DBT, SQL, …
Ping me: @noootsab - LinkedIn - andy.petrella@kensu.io
19
Chapters 1&2 ☑️
Chapter 3 👷
♂️
https://kensu.io