Balancing Infrastructure with Optimization and Problem Formulation

Balancing Infrastructure
with Optimization and
Problem Formulation
Sailthru Data Science

How do we
think about and practice
Data Science

Talk Outline
Part 1:
● What is Data Science
● Where should we spend our time as data
scientists?
Part 2:
● How we balance infrastructure,
optimization and problem formulation at
Sailthru.

“Data Science is
the extraction of
knowledge from data”
… Wikipedia

Data Scientists are good at …
These Interpretations Suggest:

Data Scientists are good at structuring
problems, and solving for and optimizing them.
These Interpretations Suggest:

So what’s missing here?
zorger.com

● problem formulation
● optimization
● infrastructure
The title of this talk mentions...

Infrastructure
the basic physical and organizational
structures and facilities needed for the
operation of a society or enterprise
“
”
… Wikipedia

Infrastructure: Often under-appreciated
or undervalued by Data Scientists

A Data Scientist’s infrastructure?

Infrastructure
Something we become intimately familiar with

Infrastructure
A mission-critical component of our work!

Components of a Solid Infrastructure
● Lots of Machinery. VMs, Containers
● Machines require coordination, redundancy
and fault tolerance. CAP Theorem

Components of a Solid Infrastructure
● Resource Allocation Fair Scheduling, Bin Packing
● Control strategies Auto Scaling, Feedback, PID
● Communication algorithms Gossip, Paxos, ...
● Configuration Dynamic Persistence, Namespaces
● Monitoring Anomaly Detection, Visualization
● Data Storage Relational, Graph, Key-Value
● SO MANY TOOLS!

So What is Data Science?
Problem
Formulation
Infrastructure
Optimization

Central Question
As a data scientist,
how do I choose where to
spend my time?

As a Data Scientist, ...
...when do I:
○ build infrastructure that supports my ideas
○ optimize my existing models and
problems
○ find new problems to work on

Here’s a glimpse of how we tackle these
choices at Sailthru.

● Sailthru is a personalization platform.
● We help our clients communicate with their
customers.
● Our goal is to maximize the lifetime value of these
customers so that our clients do well, customers
are happy, and Sailthru is successful.

Sailthru Sightlines: User Predictions

Sightlines - Example Use Cases
Incentivize users with low
chance of purchasing
Personalize discounts
above expected order value
Suppress users likely to opt-
out of messages
Engage users unlikely to
open on other channels

Computational Challenges
● Feature Engineering + ML
● Run many dependent jobs at scale
● Resource allocator
● Auto Scaler

Computational Challenges
● Feature Engineering + ML → Tidyjson & GBMs
● Run many dependent jobs at scale → Stolos
● Resource allocator → Mesos + AWS Spot Instances
● Auto Scaler → Relay.Mesos

github.com/sailthru/stolos
STOLOS

What problem does it solve?
A Directed Acyclic Multi-Graph task dependency
scheduler designed to simplify complex, distributed
pipelines.
It creates application queues that can be consumed
from in any order.

Sightlines - Stolos Pipeline
450 * 20
Each node is a job

Sightlines - Stolos Pipeline
Repeats over time
(currently, 1 day periods)

github.com/sailthru/relay
github.com/sailthru/relay.mesos
Relay.Mesos

What problem does it solve?
Relay actively minimizes the difference between a
measured signal and a target signal.
Relay.Mesos plugs Relay into a tool called Mesos.
→ Lets us auto-scale consumers of queued Stolos
jobs

Signal
FFT
f1
f2
f3
f4
FFT Visualization
k=0
k=1
k=2
ai-k
=1

The PID Algorithm
PV = Process Variable (Signal)
SP = Set Point (Target)
MV = Manipulated Variable (Output)
t = index on timesteps
**The “D” in PID is excluded here

The PID Algorithm
PV = Process Variable (Signal)
SP = Set Point (Target)
MV = Manipulated Variable (Output)
t = index on timesteps
**The “D” in PID is excluded here
+ Kd
Δ dt

Tidyjson github.com/sailthru/tidyjson
Stolos github.com/sailthru/stolos
Relay github.com/sailthru/relay
Relay.Mesos github.com/sailthru/relay.
mesos
Consulconf github.com/sailthru/consulconf
With more in progress!
Check out our open sourced tools!

Sightlines - On Mesos
←----------------> CPU Units <------------------>
←--------------------->RAM←--------------------->
←----------------> CPU Units <------------------>
←--------------------->RAM←--------------------->

Sightlines - Stages
Predict
API
Push
Sample &
Assemble
Grid Build

Sightlines - Stages
Predict
API
Push
Sample &
Assemble
Grid Build
Build train and
test sets from a
sample of data

Sightlines - Stages
Predict
API
Push
Sample &
Assemble
Grid Build
Run Grid Search
to identify
Hyperparameters
for the model

Sightlines - Stages
Predict
API
Push
Sample &
Assemble
Grid Build
Build the
model

Sightlines - Stages
Predict
API
Push
Sample &
Assemble
Grid Build
Generate
predictions for
all relevant
models

Sightlines - Pipeline
Sample
Database
Database
SampleSample &
Assemble
AssembleSampleSampleGrid
AssembleSampleSampleBuild
AssembleSampleSamplePredict SampleSampleAPI
Push

Sightlines - Pipeline
Sample
Database
Database
SampleSample &
Assemble
AssembleSampleSampleGrid
AssembleSampleSampleBuild
AssembleSampleSamplePredict SampleSampleAPI
Push
○ Upper branch: once per (client, day, model)
○ Lower branch: once per (client, day)

Balancing Infrastructure with Optimization and Problem Formulation

Recommended

Recommended

More Related Content

Similar to Balancing Infrastructure with Optimization and Problem Formulation

Similar to Balancing Infrastructure with Optimization and Problem Formulation (20)

Recently uploaded

Recently uploaded (20)

Balancing Infrastructure with Optimization and Problem Formulation