Data Secrets From a Platform Engineer (Bilbro)

data secrets
from a data scientist
ML engineer
platform engineer

tl;dr
When we deploy notebooks to production,
data scientists become SREs

Dr. Rebecca
Bilbro
Co-Founder/CTO @ Rotational Labs
Creator of scikit-Yellowbrick
Applied Text Analysis with Python (O’Reilly)
Apache Hudi: The Deﬁnitive Guide (O’Reilly)

Why are my queries taking so long?
Why ORMs are slow
sqlalchemy (~126 iterations/sec)

Why is entity resolution so hard?
def compare(self, batch):
# create pairs from batch and perform pairwise comparison
for first, second in combinations(batch.values(), 2):
if compare_pair(first, second):
# Check if the reference has a canonical and *UNIQUE* id
first_canonical = self.canonical_ids.get(first["guid"])
second_canonical = self.canonical_ids.get(second["guid"])
smallest_id = int(min(first_canonical, second_canonical))
# Update the guids for both references
self.canonical_ids[first["guid"]] = smallest_id
self.canonical_ids[second["guid"]] = smallest_id
# Update other references that resolve to the same guid
for reference, guid in self.canonical_ids.items():
if guid in [first["guid"], second["guid"]]:
self.canonical_ids[reference] = smallest_id
Turned out it was
actually a data
warehouse not a
database.
No UUIDs!

DATA WAREHOUSE
(Redshift, Snowﬂake, Big Query, etc)
Columnar data
“Relational experience”
Optimized for analytics & BI
Fast queries on big data
Some support for transactions
But…
It’s not actually a database
RELATIONAL DATABASE
(RDS, PostgresQL, MySQL, etc)
Structured data
Transactions
Updates and deletes
Data integrity
Low cost
But…
Queries can get slow, esp. for wide tables

HTML
Paras
Sents
Tokens
Tags
Raw
Corpus
Preprocessed
Corpus
LLM
Applied Text Analysis w/ Python

They also make it more available,
that is, responsive even to many
simultaneous client requests, or to
very geographically distributed
requests.
The computers are peers. Working
together, they can make the system
more tolerant to failures like
earthquakes and outages.
alpha
echo
delta
bravo
charlie
A distributed system is a network of computers that appear to the
end-user as a single computer.

They have to synchronize independent, potentially concurrent requests to update
data…
set key K to value V
set key K to value Q
They may become inconsistent.
K : V K : Q
alpha
alpha
bravo
bravo

Too many models
never make it to
production

There is no batch
When you think of data how many of
you picture something tabular?
Of course you do!
But it isn’t, actually.

How platform engineers think about data
● Is it in ﬂight or at rest (for encryption,
serialization, and compression reasons)?
● Is it raw (SQL, Avro) or has it been processed
(CSV, ORC, Parquet, JSON)?
● Does it make more sense to scale horizontally or
vertically (adding more peers vs more
partitions)

My first deployed model
- A few dozen classifiers (trained, serialized,
stored to S3).
- A Kafka queue for data ETL. Subscriber gets
new document, .predict(), add classifier
predict_proba arrays as new metadata, store to
Elastic.
- A web application that renders the data for
users, where metadata enable crosstab filtering.
… learned that Clojure doesn’t speak Numpy :)

Failure is the norm
import numpy as np
import tensorflow as tf
# Generate some sample data
x_train = np.random.random(( 100, 10))
y_train = np.random.random(( 100,))
# Create a simple model
model = tf.keras.Sequential([
tf.keras.layers.Dense( 64, activation= 'relu', input_shape=( 10,)),
tf.keras.layers.Dense( 1)
])
# Compile the model
model.compile(optimizer= 'adam', loss='mse')
# Create a checkpoint callback
checkpointer = tf.keras.callbacks.ModelCheckpoint(
filepath=CHECKPOINT_STORE,
save_weights_only= True,
verbose= 1
)
# Train the model with checkpointing
model.fit(x_train, y_train, epochs= 10, callbacks=[ checkpointer ])
# Retrieve that checkpoint
model.load_weights( "checkpoints/110720240005.ckpt" )
Handle exceptions and failures
gracefully.
Periodically save computation
state so we can recover.
Replicate data and services
across multiple nodes.

5 data projects you
could do to help the
platform team

The Observability Plane
Build the platform team a custom
data dashboard using Prometheus
and Grafana.
Aggregate alerts and logs from
different systems into a single data
observability plane.
Put the code into a shared git
repository that others can add onto.

The Downtime Predictor
Build a model using historical incidents,
logs, performance metrics to forecast
system downtimes or failures before
they occur.
Run regular summary reports to send
to the team to help them proactively
maintain infrastructure and prevent
service disruptions.

The Resource Allocation Copilot
Train a model to predict demand on
resources based on historical usage
patterns, seasonal trends, and other
factors like time of day, workload type,
or application usage.
Attempt to use reinforcement learning
to optimize allocation (e.g. by
balancing performance requirements
with cost).

Custom Anomaly Detector
Finetune an autoencoder on your
organization’s application logs.
See if you can use it to detect
anomalies that line up with reported
failures, security breaches, or
performance issues.

Test Data LLM
Finetune an LLM to produce
synthetic test data that the
platform team can use for
benchmarking or integration
testing.

Remember…
deja vu is a
glitch in the
matrix
slow queries
re-wrangling
OOMs
connection timeouts
data drift
model graveyards
irreproducibility

Data Secrets From a Platform Engineer (Bilbro)

More Related Content

Similar to Data Secrets From a Platform Engineer (Bilbro)

More from Rebecca Bilbro

Recently uploaded

Data Secrets From a Platform Engineer (Bilbro)