SlideShare a Scribd company logo
1 of 19
Kensu | Confidential | All rights reserved | Copyright, 2022
(Data) Observability=Best Practices
Examples in Pandas, Scikit-Learn, PySpark, DBT
Kensu | Confidential | All rights reserved | Copyright, 2022
My 20 years of scars with data 🩹
2
First 10 years
- Software Engineer in Geospatial - Map/Coverage/Catalog (Java, C++, YUI 🩹)
- (Satellite) Images and Vector Data Miner
(Java/Python/R/Fortran/Scala)
Next 5 years
- Spark evangelist, teaching, and consulting in Big Data/AI in the Silicon Valley
- Creator of Spark-Notebook (pre-jupyter): open-source (3100+ ✪) and community-drive (20K+)
Last 5 years
- Brainstorming on how to bring quality and monitoring DevOps best practices to data (aka DODD)
- Founded Kensu: easing data teams to embrace best practices and create trust in deliveries
Meanwhile, “serial author”
- “What is Data Governance”, O’Reilly, 2020
- “What is Data Observability”, O’Reilly, 2021
- “Fundamentals of Data Observability”, O’Reilly, 2023
Kensu | Confidential | All rights reserved | Copyright, 2022
Agenda
3
1. Observability & Data
2. DO @ The Source - Pandas, PySpark, Scikit-Learn
3. Showcase
Kensu | Confidential | All rights reserved | Copyright, 2022
Observability & Data
1
4
Kensu | Confidential | All rights reserved | Copyright, 2022
So… Observability - 6 Areas ∋ Data Observability
In IT, “Observability” is the
capability of an IT system to
generate behavioral
information to allow external
observers to modelize its
internal state.
NOTE: an observer cannot interact with the system while it is functioning!
Infrastructure
5
Kensu | Confidential | All rights reserved | Copyright, 2022
Observability - Log, Metrics, Traces (examples)
Infrastructure
Syslog | `top` | `route`
App Log | Opentracing/telemetry
Security log | traces
Audit log | `git blame`
What are the questions
we want to answer quickly?
6
Kensu | Confidential | All rights reserved | Copyright, 2022
Common questions we struggle with
7
Questions during analysis Observations needed Channel
How is data used? Usages (purposes, users, …) Lineage|Log
Why do users feel it is wrong? Expectations (perf, quality, …) Rule
Where is the data? Location (server, file path, …) Log
What does it represent? Structure metadata (fields, …) Log
What does/did the data look like? Content metadata (metric, kpis, …) Metrics
Has anything changed & when? Historical metadata Metrics
What data was used to create? Data Lineage Lineage
How is the data created? Application (data) lineage Lineage
Kensu | Confidential | All rights reserved | Copyright, 2022
“Data Observability” ⁉️
Data Observability is the component of an observable system that
generates information on how data influences the behavior of the
system and conversely.
8
Infra, Apps, User
LOGS
Data Metrics (profiling)
METRICS
(Apps & Data) Lineage
TRACES
● Application & Project info
● Metadata (fields, …)
● Freshness
● Completeness
● Distribution
● Data sources
● Data fields
● Application (pipeline)
Kensu | Confidential | All rights reserved | Copyright, 2022
Introducing DO @ the Source
Examples
2
9
Kensu | Confidential | All rights reserved | Copyright, 2022
How to make data pipelines observable?
10
orders
CSV
customers
CSV
train.py
load.py
orders
&cust
…
predict.py
pickle
result
CSV
orders
cust…
contact
dbt:mart dbt:view
Kensu | Confidential | All rights reserved | Copyright, 2022
Wrong answer: the Kraken Anti-Pattern
11
orders
CSV
customers
CSV
train.py
load.py
orders
&cust
…
predict.py
pickle
result
CSV
orders
cust…
contact
dbt:mart dbt:view
Compute
Resources $$
Maintenance $$$
Found a
background
gate 🥳
Kensu | Confidential | All rights reserved | Copyright, 2022
The answer: the “At the Source” Pattern
12
orders
CSV
customers
CSV
train.py
load.py
orders
&cust
…
predict.py
pickle
result
CSV
orders
cust…
contact
dbt:mart dbt:view
Aggregate
compute compute
compute compute compute
Kensu | Confidential | All rights reserved | Copyright, 2022
Pandas
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://host.pg:5432/db’)
customers = pd.read_csv("campaign/customer_list.csv")
customers.to_sql('customers', engine, index=False)
orders = pd.read_csv("campaign/orders.csv")
orders = orders.rename(columns={'id':"customer_id"})
orders.to_sql('orders', engine, index=False)
KensuProvider().initKensu(input_stats=True)
import kensu.pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://host.pg:5432/db’)
customers = pd.read_csv("campaign/customer_list.csv")
customers.to_sql('customers', engine, index=False)
orders = pd.read_csv("campaign/orders.csv")
orders = orders.rename(columns={'id':"customer_id"})
orders.to_sql('orders', engine, index=False)
13
Input
Output
Lineage
Logge
r
Interceptor
Extract:
- Location
- Schema
- Metrics (summary)
Connect
as
Lineage
Kensu | Confidential | All rights reserved | Copyright, 2022
PySpark (& dbt)
spark = SparkSession.builder.appName("MyApp").getOrCreate()
all_assets = spark.read.option("inferSchema","true")
.option("header","true")
.csv("monthly_assets.csv")
apptech = all_assets[all_assets['Symbol'] == 'APCX']
Buzzfeed = all_assets[all_assets['Symbol'] == 'ENFA']
buzz_report = Buzzfeed.withColumn('Intraday_Delta',
Buzzfeed['Adj Close'] - Buzzfeed['Open'])
apptech_report = apptech.withColumn('Intraday_Delta',
apptech['Adj Close'] - apptech['Open'])
kept_values = ['Open','Adj Close','Intraday_Delta']
final_report_buzzfeed = buzz_report[kept_values]
final_report_apptech = apptech_report[kept_values]
final_report_buzzfeed.write.mode('overwrite').csv("report_bf.csv")
final_report_apptech.write.mode('overwrite').csv("report_afcsv")
spark = SparkSession.builder.appName("MyApp")
.config("spark.driver.extraClassPath",
"kensu-spark-
agent.jar").getOrCreate()
init_kensu_spark(spark, input_stats=True)
all_assets = spark.read.option("inferSchema","true")
.option("header","true")
.csv("monthly_assets.csv")
apptech = all_assets[all_assets['Symbol'] == 'APCX']
Buzzfeed = all_assets[all_assets['Symbol'] == 'ENFA']
buzz_report = Buzzfeed.withColumn('Intraday_Delta',
Buzzfeed['Adj Close'] - Buzzfeed['Open'])
apptech_report = apptech.withColumn('Intraday_Delta',
apptech['Adj Close'] - apptech['Open'])
kept_values = ['Open','Adj Close','Intraday_Delta']
final_report_buzzfeed = buzz_report[kept_values]
final_report_apptech = apptech_report[kept_values]
final_report_buzzfeed.write.mode('overwrite').csv("report_bf.csv")
final_report_apptech.write.mode('overwrite').csv("report_afcsv")
14
Input
Filters
Computations
Select
2 Outputs
Interceptor
Logger
Extract from DAG:
- DataFrames
(I/O)
- Location
- Schema
- Metrics
- Lineage
Kensu | Confidential | All rights reserved | Copyright, 2022
k = KensuProvider().initKensu(input_stats=True)
Import kensu.pickle as pickle
from kensu.sklearn.model_selection import train_test_split
import kensu.pandas as pd
data = pd.read_csv("orders.csv")
df=data[['total_qty', 'total_basket']]
X = df.drop('total_basket',axis=1)
y = df['total_basket']
X_train, X_test, y_train, y_test = train_test_split(X, y)
from kensu.sklearn.linear_model import LinearRegression
model=LinearRegression().fit(X_train,y_train)
with open('model.pickle', 'wb') as f:
pickle.dump(model,f)
Scikit-Learn: 🚂
import pickle as pickle
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv("orders.csv")
df=data[['total_qty', 'total_basket']]
X = df.drop('total_basket',axis=1)
y = df['total_basket']
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.linear_model import LinearRegression
model=LinearRegression().fit(X_train,y_train)
with open('model.pickle', 'wb') as f:
pickle.dump(model,f)
15
Filter
Transformation
Output
Input
Select
Logge
r
Interceptors
Extract:
- Location
- Schema
- Data Metrics
- Model Metrics
Accumulate
connections as
Lineage
Kensu | Confidential | All rights reserved | Copyright, 2022
k = KensuProvider().initKensu(input_stats=True)
import kensu.pandas as pd
import kensu.pickle as pickle
data = pd.read_csv("second_campaign/orders.csv")
with open('model.pickle', 'rb') as f:
model=pickle.load(f)
df=data[['total_qty']]
pred = model.predict(df)
df = data.copy()
df['model_pred']=pred
df.to_csv('model_results.csv', index=False)
Scikit-Learn: 🔮
import pandas as pd
import pickle as pickle
data = pd.read_csv("second_campaign/orders.csv")
with open('model.pickle', 'rb') as f:
model=pickle.load(f)
df=data[['total_qty']]
pred = model.predict(df)
df = data.copy()
df['model_pred']=pred
df.to_csv('model_results.csv', index=False)
2 Inputs
Output
Transformation
Select
Computation
Logge
r
Interceptors
Accumulate
connections as
Lineage
Extract:
- Location
- Schema
- Data Metrics
- Model Metrics?
Kensu | Confidential | All rights reserved | Copyright, 2022
Showcase
4
17
Hey! Where is 3?
Kensu | Confidential | All rights reserved | Copyright, 2022
Let it run ➡ let it rain
Example code:
https://github.com/kensuio-oss/kensu-public-examples
All you need is Docker Compose
Example using a Free Platform for the Data Community:
https://sandbox.kensuapp.com/
All you need is a Google Account/Mail
18
Kensu | Confidential | All rights reserved | Copyright, 2022
Thank YOU!
Try it by yourself: https://sandbox.kensuapp.com
Powered by
- Connect with Google in 10 seconds 😊.
- 🆓 of use.
- 🚀 Get started with examples in
Python, Spark, DBT, SQL, …
Ping me: @noootsab - LinkedIn - andy.petrella@kensu.io
19
Chapters 1&2 ☑️
Chapter 3 👷
♂️
https://kensu.io

More Related Content

What's hot

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Data Architecture for Data Governance
Data Architecture for Data GovernanceData Architecture for Data Governance
Data Architecture for Data Governance
DATAVERSITY
 

What's hot (20)

Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
 
Data Architecture for Data Governance
Data Architecture for Data GovernanceData Architecture for Data Governance
Data Architecture for Data Governance
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and Governance
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Landing Self Service Analytics using Microsoft Azure & Power BI
Landing Self Service Analytics using Microsoft Azure & Power BILanding Self Service Analytics using Microsoft Azure & Power BI
Landing Self Service Analytics using Microsoft Azure & Power BI
 
Data modelling 101
Data modelling 101Data modelling 101
Data modelling 101
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
 

Similar to Data Observability Best Pracices

Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
Yael Garten
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
Dataiku
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
ijdpsjournal
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Databricks
 

Similar to Data Observability Best Pracices (20)

Data Science in the Elastic Stack
Data Science in the Elastic StackData Science in the Elastic Stack
Data Science in the Elastic Stack
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
FMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menannoFMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menanno
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
 
Leveraging Quandl
Leveraging Quandl Leveraging Quandl
Leveraging Quandl
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 

More from Andy Petrella

More from Andy Petrella (20)

How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 

Recently uploaded

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 

Data Observability Best Pracices

  • 1. Kensu | Confidential | All rights reserved | Copyright, 2022 (Data) Observability=Best Practices Examples in Pandas, Scikit-Learn, PySpark, DBT
  • 2. Kensu | Confidential | All rights reserved | Copyright, 2022 My 20 years of scars with data 🩹 2 First 10 years - Software Engineer in Geospatial - Map/Coverage/Catalog (Java, C++, YUI 🩹) - (Satellite) Images and Vector Data Miner (Java/Python/R/Fortran/Scala) Next 5 years - Spark evangelist, teaching, and consulting in Big Data/AI in the Silicon Valley - Creator of Spark-Notebook (pre-jupyter): open-source (3100+ ✪) and community-drive (20K+) Last 5 years - Brainstorming on how to bring quality and monitoring DevOps best practices to data (aka DODD) - Founded Kensu: easing data teams to embrace best practices and create trust in deliveries Meanwhile, “serial author” - “What is Data Governance”, O’Reilly, 2020 - “What is Data Observability”, O’Reilly, 2021 - “Fundamentals of Data Observability”, O’Reilly, 2023
  • 3. Kensu | Confidential | All rights reserved | Copyright, 2022 Agenda 3 1. Observability & Data 2. DO @ The Source - Pandas, PySpark, Scikit-Learn 3. Showcase
  • 4. Kensu | Confidential | All rights reserved | Copyright, 2022 Observability & Data 1 4
  • 5. Kensu | Confidential | All rights reserved | Copyright, 2022 So… Observability - 6 Areas ∋ Data Observability In IT, “Observability” is the capability of an IT system to generate behavioral information to allow external observers to modelize its internal state. NOTE: an observer cannot interact with the system while it is functioning! Infrastructure 5
  • 6. Kensu | Confidential | All rights reserved | Copyright, 2022 Observability - Log, Metrics, Traces (examples) Infrastructure Syslog | `top` | `route` App Log | Opentracing/telemetry Security log | traces Audit log | `git blame` What are the questions we want to answer quickly? 6
  • 7. Kensu | Confidential | All rights reserved | Copyright, 2022 Common questions we struggle with 7 Questions during analysis Observations needed Channel How is data used? Usages (purposes, users, …) Lineage|Log Why do users feel it is wrong? Expectations (perf, quality, …) Rule Where is the data? Location (server, file path, …) Log What does it represent? Structure metadata (fields, …) Log What does/did the data look like? Content metadata (metric, kpis, …) Metrics Has anything changed & when? Historical metadata Metrics What data was used to create? Data Lineage Lineage How is the data created? Application (data) lineage Lineage
  • 8. Kensu | Confidential | All rights reserved | Copyright, 2022 “Data Observability” ⁉️ Data Observability is the component of an observable system that generates information on how data influences the behavior of the system and conversely. 8 Infra, Apps, User LOGS Data Metrics (profiling) METRICS (Apps & Data) Lineage TRACES ● Application & Project info ● Metadata (fields, …) ● Freshness ● Completeness ● Distribution ● Data sources ● Data fields ● Application (pipeline)
  • 9. Kensu | Confidential | All rights reserved | Copyright, 2022 Introducing DO @ the Source Examples 2 9
  • 10. Kensu | Confidential | All rights reserved | Copyright, 2022 How to make data pipelines observable? 10 orders CSV customers CSV train.py load.py orders &cust … predict.py pickle result CSV orders cust… contact dbt:mart dbt:view
  • 11. Kensu | Confidential | All rights reserved | Copyright, 2022 Wrong answer: the Kraken Anti-Pattern 11 orders CSV customers CSV train.py load.py orders &cust … predict.py pickle result CSV orders cust… contact dbt:mart dbt:view Compute Resources $$ Maintenance $$$ Found a background gate 🥳
  • 12. Kensu | Confidential | All rights reserved | Copyright, 2022 The answer: the “At the Source” Pattern 12 orders CSV customers CSV train.py load.py orders &cust … predict.py pickle result CSV orders cust… contact dbt:mart dbt:view Aggregate compute compute compute compute compute
  • 13. Kensu | Confidential | All rights reserved | Copyright, 2022 Pandas import pandas as pd from sqlalchemy import create_engine engine = create_engine('postgresql://host.pg:5432/db’) customers = pd.read_csv("campaign/customer_list.csv") customers.to_sql('customers', engine, index=False) orders = pd.read_csv("campaign/orders.csv") orders = orders.rename(columns={'id':"customer_id"}) orders.to_sql('orders', engine, index=False) KensuProvider().initKensu(input_stats=True) import kensu.pandas as pd from sqlalchemy import create_engine engine = create_engine('postgresql://host.pg:5432/db’) customers = pd.read_csv("campaign/customer_list.csv") customers.to_sql('customers', engine, index=False) orders = pd.read_csv("campaign/orders.csv") orders = orders.rename(columns={'id':"customer_id"}) orders.to_sql('orders', engine, index=False) 13 Input Output Lineage Logge r Interceptor Extract: - Location - Schema - Metrics (summary) Connect as Lineage
  • 14. Kensu | Confidential | All rights reserved | Copyright, 2022 PySpark (& dbt) spark = SparkSession.builder.appName("MyApp").getOrCreate() all_assets = spark.read.option("inferSchema","true") .option("header","true") .csv("monthly_assets.csv") apptech = all_assets[all_assets['Symbol'] == 'APCX'] Buzzfeed = all_assets[all_assets['Symbol'] == 'ENFA'] buzz_report = Buzzfeed.withColumn('Intraday_Delta', Buzzfeed['Adj Close'] - Buzzfeed['Open']) apptech_report = apptech.withColumn('Intraday_Delta', apptech['Adj Close'] - apptech['Open']) kept_values = ['Open','Adj Close','Intraday_Delta'] final_report_buzzfeed = buzz_report[kept_values] final_report_apptech = apptech_report[kept_values] final_report_buzzfeed.write.mode('overwrite').csv("report_bf.csv") final_report_apptech.write.mode('overwrite').csv("report_afcsv") spark = SparkSession.builder.appName("MyApp") .config("spark.driver.extraClassPath", "kensu-spark- agent.jar").getOrCreate() init_kensu_spark(spark, input_stats=True) all_assets = spark.read.option("inferSchema","true") .option("header","true") .csv("monthly_assets.csv") apptech = all_assets[all_assets['Symbol'] == 'APCX'] Buzzfeed = all_assets[all_assets['Symbol'] == 'ENFA'] buzz_report = Buzzfeed.withColumn('Intraday_Delta', Buzzfeed['Adj Close'] - Buzzfeed['Open']) apptech_report = apptech.withColumn('Intraday_Delta', apptech['Adj Close'] - apptech['Open']) kept_values = ['Open','Adj Close','Intraday_Delta'] final_report_buzzfeed = buzz_report[kept_values] final_report_apptech = apptech_report[kept_values] final_report_buzzfeed.write.mode('overwrite').csv("report_bf.csv") final_report_apptech.write.mode('overwrite').csv("report_afcsv") 14 Input Filters Computations Select 2 Outputs Interceptor Logger Extract from DAG: - DataFrames (I/O) - Location - Schema - Metrics - Lineage
  • 15. Kensu | Confidential | All rights reserved | Copyright, 2022 k = KensuProvider().initKensu(input_stats=True) Import kensu.pickle as pickle from kensu.sklearn.model_selection import train_test_split import kensu.pandas as pd data = pd.read_csv("orders.csv") df=data[['total_qty', 'total_basket']] X = df.drop('total_basket',axis=1) y = df['total_basket'] X_train, X_test, y_train, y_test = train_test_split(X, y) from kensu.sklearn.linear_model import LinearRegression model=LinearRegression().fit(X_train,y_train) with open('model.pickle', 'wb') as f: pickle.dump(model,f) Scikit-Learn: 🚂 import pickle as pickle from sklearn.model_selection import train_test_split import pandas as pd data = pd.read_csv("orders.csv") df=data[['total_qty', 'total_basket']] X = df.drop('total_basket',axis=1) y = df['total_basket'] X_train, X_test, y_train, y_test = train_test_split(X, y) from sklearn.linear_model import LinearRegression model=LinearRegression().fit(X_train,y_train) with open('model.pickle', 'wb') as f: pickle.dump(model,f) 15 Filter Transformation Output Input Select Logge r Interceptors Extract: - Location - Schema - Data Metrics - Model Metrics Accumulate connections as Lineage
  • 16. Kensu | Confidential | All rights reserved | Copyright, 2022 k = KensuProvider().initKensu(input_stats=True) import kensu.pandas as pd import kensu.pickle as pickle data = pd.read_csv("second_campaign/orders.csv") with open('model.pickle', 'rb') as f: model=pickle.load(f) df=data[['total_qty']] pred = model.predict(df) df = data.copy() df['model_pred']=pred df.to_csv('model_results.csv', index=False) Scikit-Learn: 🔮 import pandas as pd import pickle as pickle data = pd.read_csv("second_campaign/orders.csv") with open('model.pickle', 'rb') as f: model=pickle.load(f) df=data[['total_qty']] pred = model.predict(df) df = data.copy() df['model_pred']=pred df.to_csv('model_results.csv', index=False) 2 Inputs Output Transformation Select Computation Logge r Interceptors Accumulate connections as Lineage Extract: - Location - Schema - Data Metrics - Model Metrics?
  • 17. Kensu | Confidential | All rights reserved | Copyright, 2022 Showcase 4 17 Hey! Where is 3?
  • 18. Kensu | Confidential | All rights reserved | Copyright, 2022 Let it run ➡ let it rain Example code: https://github.com/kensuio-oss/kensu-public-examples All you need is Docker Compose Example using a Free Platform for the Data Community: https://sandbox.kensuapp.com/ All you need is a Google Account/Mail 18
  • 19. Kensu | Confidential | All rights reserved | Copyright, 2022 Thank YOU! Try it by yourself: https://sandbox.kensuapp.com Powered by - Connect with Google in 10 seconds 😊. - 🆓 of use. - 🚀 Get started with examples in Python, Spark, DBT, SQL, … Ping me: @noootsab - LinkedIn - andy.petrella@kensu.io 19 Chapters 1&2 ☑️ Chapter 3 👷 ♂️ https://kensu.io