data secrets
from a data scientist
ML engineer
platform engineer
tl;dr
When we deploy notebooks to production,
data scientists become SREs
Dr. Rebecca
Bilbro
Co-Founder/CTO @ Rotational Labs
Creator of scikit-Yellowbrick
Applied Text Analysis with Python (O’Reilly)
Apache Hudi: The Definitive Guide (O’Reilly)
2018 2025
Why are my queries taking so long?
Why ORMs are slow
sqlalchemy (~126 iterations/sec)
Why is entity resolution so hard?
def compare(self, batch):
# create pairs from batch and perform pairwise comparison
for first, second in combinations(batch.values(), 2):
if compare_pair(first, second):
# Check if the reference has a canonical and *UNIQUE* id
first_canonical = self.canonical_ids.get(first["guid"])
second_canonical = self.canonical_ids.get(second["guid"])
smallest_id = int(min(first_canonical, second_canonical))
# Update the guids for both references
self.canonical_ids[first["guid"]] = smallest_id
self.canonical_ids[second["guid"]] = smallest_id
# Update other references that resolve to the same guid
for reference, guid in self.canonical_ids.items():
if guid in [first["guid"], second["guid"]]:
self.canonical_ids[reference] = smallest_id
Turned out it was
actually a data
warehouse not a
database.
No UUIDs!
Follow the
white rabbit
DATA WAREHOUSE
(Redshift, Snowflake, Big Query, etc)
Columnar data
“Relational experience”
Optimized for analytics & BI
Fast queries on big data
Some support for transactions
But…
It’s not actually a database
RELATIONAL DATABASE
(RDS, PostgresQL, MySQL, etc)
Structured data
Transactions
Updates and deletes
Data integrity
Low cost
But…
Queries can get slow, esp. for wide tables
HTML
Paras
Sents
Tokens
Tags
Raw
Corpus
Preprocessed
Corpus
LLM
Applied Text Analysis w/ Python
Deeper down
the rabbit
hole
They also make it more available,
that is, responsive even to many
simultaneous client requests, or to
very geographically distributed
requests.
The computers are peers. Working
together, they can make the system
more tolerant to failures like
earthquakes and outages.
alpha
echo
delta
bravo
charlie
A distributed system is a network of computers that appear to the
end-user as a single computer.
They have to synchronize independent, potentially concurrent requests to update
data…
set key K to value V
set key K to value Q
They may become inconsistent.
K : V K : Q
alpha
alpha
bravo
bravo
Too many models
never make it to
production
Keep going
There is no batch
When you think of data how many of
you picture something tabular?
Of course you do!
But it isn’t, actually.
How platform engineers think about data
● Is it in flight or at rest (for encryption,
serialization, and compression reasons)?
● Is it raw (SQL, Avro) or has it been processed
(CSV, ORC, Parquet, JSON)?
● Does it make more sense to scale horizontally or
vertically (adding more peers vs more
partitions)
I know
Kung Fu!
My first deployed model
- A few dozen classifiers (trained, serialized,
stored to S3).
- A Kafka queue for data ETL. Subscriber gets
new document, .predict(), add classifier
predict_proba arrays as new metadata, store to
Elastic.
- A web application that renders the data for
users, where metadata enable crosstab filtering.
… learned that Clojure doesn’t speak Numpy :)
Failure is the norm
import numpy as np
import tensorflow as tf
# Generate some sample data
x_train = np.random.random(( 100, 10))
y_train = np.random.random(( 100,))
# Create a simple model
model = tf.keras.Sequential([
tf.keras.layers.Dense( 64, activation= 'relu', input_shape=( 10,)),
tf.keras.layers.Dense( 1)
])
# Compile the model
model.compile(optimizer= 'adam', loss='mse')
# Create a checkpoint callback
checkpointer = tf.keras.callbacks.ModelCheckpoint(
filepath=CHECKPOINT_STORE,
save_weights_only= True,
verbose= 1
)
# Train the model with checkpointing
model.fit(x_train, y_train, epochs= 10, callbacks=[ checkpointer ])
# Retrieve that checkpoint
model.load_weights( "checkpoints/110720240005.ckpt" )
Handle exceptions and failures
gracefully.
Periodically save computation
state so we can recover.
Replicate data and services
across multiple nodes.
“Use Case Engineer”
5 data projects you
could do to help the
platform team
The Observability Plane
Build the platform team a custom
data dashboard using Prometheus
and Grafana.
Aggregate alerts and logs from
different systems into a single data
observability plane.
Put the code into a shared git
repository that others can add onto.
The Downtime Predictor
Build a model using historical incidents,
logs, performance metrics to forecast
system downtimes or failures before
they occur.
Run regular summary reports to send
to the team to help them proactively
maintain infrastructure and prevent
service disruptions.
The Resource Allocation Copilot
Train a model to predict demand on
resources based on historical usage
patterns, seasonal trends, and other
factors like time of day, workload type,
or application usage.
Attempt to use reinforcement learning
to optimize allocation (e.g. by
balancing performance requirements
with cost).
Custom Anomaly Detector
Finetune an autoencoder on your
organization’s application logs.
See if you can use it to detect
anomalies that line up with reported
failures, security breaches, or
performance issues.
Test Data LLM
Finetune an LLM to produce
synthetic test data that the
platform team can use for
benchmarking or integration
testing.
Remember…
deja vu is a
glitch in the
matrix
slow queries
re-wrangling
OOMs
connection timeouts
data drift
model graveyards
irreproducibility
thank you for
being curious

Data Secrets From a Platform Engineer (Bilbro)

  • 1.
    data secrets from adata scientist ML engineer platform engineer
  • 2.
    tl;dr When we deploynotebooks to production, data scientists become SREs
  • 3.
    Dr. Rebecca Bilbro Co-Founder/CTO @Rotational Labs Creator of scikit-Yellowbrick Applied Text Analysis with Python (O’Reilly) Apache Hudi: The Definitive Guide (O’Reilly)
  • 4.
  • 5.
    Why are myqueries taking so long? Why ORMs are slow sqlalchemy (~126 iterations/sec)
  • 6.
    Why is entityresolution so hard? def compare(self, batch): # create pairs from batch and perform pairwise comparison for first, second in combinations(batch.values(), 2): if compare_pair(first, second): # Check if the reference has a canonical and *UNIQUE* id first_canonical = self.canonical_ids.get(first["guid"]) second_canonical = self.canonical_ids.get(second["guid"]) smallest_id = int(min(first_canonical, second_canonical)) # Update the guids for both references self.canonical_ids[first["guid"]] = smallest_id self.canonical_ids[second["guid"]] = smallest_id # Update other references that resolve to the same guid for reference, guid in self.canonical_ids.items(): if guid in [first["guid"], second["guid"]]: self.canonical_ids[reference] = smallest_id Turned out it was actually a data warehouse not a database. No UUIDs!
  • 7.
  • 8.
    DATA WAREHOUSE (Redshift, Snowflake,Big Query, etc) Columnar data “Relational experience” Optimized for analytics & BI Fast queries on big data Some support for transactions But… It’s not actually a database RELATIONAL DATABASE (RDS, PostgresQL, MySQL, etc) Structured data Transactions Updates and deletes Data integrity Low cost But… Queries can get slow, esp. for wide tables
  • 9.
  • 12.
  • 13.
    They also makeit more available, that is, responsive even to many simultaneous client requests, or to very geographically distributed requests. The computers are peers. Working together, they can make the system more tolerant to failures like earthquakes and outages. alpha echo delta bravo charlie A distributed system is a network of computers that appear to the end-user as a single computer.
  • 14.
    They have tosynchronize independent, potentially concurrent requests to update data… set key K to value V set key K to value Q They may become inconsistent. K : V K : Q alpha alpha bravo bravo
  • 15.
    Too many models nevermake it to production
  • 16.
  • 17.
    There is nobatch When you think of data how many of you picture something tabular? Of course you do! But it isn’t, actually.
  • 18.
    How platform engineersthink about data ● Is it in flight or at rest (for encryption, serialization, and compression reasons)? ● Is it raw (SQL, Avro) or has it been processed (CSV, ORC, Parquet, JSON)? ● Does it make more sense to scale horizontally or vertically (adding more peers vs more partitions)
  • 19.
  • 20.
    My first deployedmodel - A few dozen classifiers (trained, serialized, stored to S3). - A Kafka queue for data ETL. Subscriber gets new document, .predict(), add classifier predict_proba arrays as new metadata, store to Elastic. - A web application that renders the data for users, where metadata enable crosstab filtering. … learned that Clojure doesn’t speak Numpy :)
  • 21.
    Failure is thenorm import numpy as np import tensorflow as tf # Generate some sample data x_train = np.random.random(( 100, 10)) y_train = np.random.random(( 100,)) # Create a simple model model = tf.keras.Sequential([ tf.keras.layers.Dense( 64, activation= 'relu', input_shape=( 10,)), tf.keras.layers.Dense( 1) ]) # Compile the model model.compile(optimizer= 'adam', loss='mse') # Create a checkpoint callback checkpointer = tf.keras.callbacks.ModelCheckpoint( filepath=CHECKPOINT_STORE, save_weights_only= True, verbose= 1 ) # Train the model with checkpointing model.fit(x_train, y_train, epochs= 10, callbacks=[ checkpointer ]) # Retrieve that checkpoint model.load_weights( "checkpoints/110720240005.ckpt" ) Handle exceptions and failures gracefully. Periodically save computation state so we can recover. Replicate data and services across multiple nodes.
  • 22.
  • 23.
    5 data projectsyou could do to help the platform team
  • 24.
    The Observability Plane Buildthe platform team a custom data dashboard using Prometheus and Grafana. Aggregate alerts and logs from different systems into a single data observability plane. Put the code into a shared git repository that others can add onto.
  • 25.
    The Downtime Predictor Builda model using historical incidents, logs, performance metrics to forecast system downtimes or failures before they occur. Run regular summary reports to send to the team to help them proactively maintain infrastructure and prevent service disruptions.
  • 26.
    The Resource AllocationCopilot Train a model to predict demand on resources based on historical usage patterns, seasonal trends, and other factors like time of day, workload type, or application usage. Attempt to use reinforcement learning to optimize allocation (e.g. by balancing performance requirements with cost).
  • 27.
    Custom Anomaly Detector Finetunean autoencoder on your organization’s application logs. See if you can use it to detect anomalies that line up with reported failures, security breaches, or performance issues.
  • 28.
    Test Data LLM Finetunean LLM to produce synthetic test data that the platform team can use for benchmarking or integration testing.
  • 29.
    Remember… deja vu isa glitch in the matrix slow queries re-wrangling OOMs connection timeouts data drift model graveyards irreproducibility
  • 30.