ThinkFast: Scaling Machine Learning to Modern Demands

ThinkFast: Scaling Machine Learning
to Modern Demands
Hristo Paskov

The Genomic Data Deluge
• Precision Medicine
Initiative: sequence
1,000,000 genomes
– $215 Million in 2015
– Pilot study
– Outputs 10-50 GB/person
How do we analyze all of this data to drive
progress?

Massive Data Sources
News
eCommerce
Bioinformatics
100K Genomes
Social Media

The Analysis Refinement Cycle
⨂
Data
1
2
𝑦 − 𝑋𝑤 2
2
+
𝜆
2
𝑤 2
2
Model
𝑥+
= 𝑥 − 𝛼𝑀𝛻𝑓 𝑥
Solver
Model
captures
data
nuance?
Solver
exists, is
fast
enough?
Yes? Proceed
! No? Quit
Increase time, money, experience, resources

More Than Just Training Models
• Regularization paths
• Model risk assessment
• Interpretability
ModelCoefficient
Regularization Parameter

Brief History of Statistical Learning
Interpretability & Statistical Guarantees
Scalability
Ease of
Use
Simple
Models
Kernel
Methods
Trees &
Ensembles
Structured
Regularization

Structured Regularization
Losses
Regression
Classification
Ranking
Motif Finding
Matrix Factorization
Feature Embedding
Data Imputation
…
Regularizers
Sparsity
Spatial/ Temporal /
Manifold Structure
Group Structure
Hierarchical Structure
Structured & Unstructured
Multitask Learning
…
min
𝛽∈ℝ 𝑑
𝐿 𝑋𝛽 + 𝜆𝑅 𝛽

The Lasso’s Combinatorial Side
min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
𝜆
0
3
2
1
4
ModelCoefficient

The Database Perspective
min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1

−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
Feature & label storage

−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
Data access operations
𝑢 = 𝑦 − 𝑋𝛽
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇 𝑣

−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇 𝑣
ML “Query Language” min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1

min
𝛽1,𝛽2,𝛽3∈ℝ 𝑑
𝑡=1
3
𝐿 𝑡 𝑦𝑡 − 𝑋𝑡 𝛽𝑡 + 𝜆 𝑡 𝑅𝑡 𝛽𝑡
+𝜔 𝛽1 𝛽2 𝛽3 ∗

Feature, label and
model storage
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇
𝑣
ML “Query Language” min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
𝑀1
𝑀2
𝑀1
𝑀2
𝑀3
𝑀1
𝑀2

𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇
𝑣
min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
𝑀1
𝑀2
𝑀1
𝑀2
𝑀3
𝑀1
𝑀2
Processing Memory
Mathematical
Structure

“Query Language” Optimization
• Static analysis
𝑦 − 𝑋𝑤 2
2
+ 𝑤 2
2
𝑦 − 𝑋𝑤 2
2
+ 𝑤 1
?
𝑦 − 𝑋𝑤 2
2
+
1
2
𝑤 2
2
+ 𝑤 1

• Static analysis
𝑦 − 𝑋𝑤 2
2
+ 𝑤 2
2
𝑦 − 𝑋𝑤 2
2
+ 𝑤 1
𝑦 − 𝑋𝑤 2
2
+
1
2
𝑤 2
2
+ 𝑤 1
?
𝜀 𝑦 − 𝑋𝑤 +
1
2
𝑤 2
2
+ 𝑤 1

• Static analysis
• Runtime analysis

Some Bioinformatics Applications
• Personalized medicine, Memorial Sloan
Kettering Cancer Center
– 35% accuracy improvement over state-of-the-art
• Metagenomic binning and DNA quality
assessment, Stanford School of Medicine
– Previously unsolved problem
• Toxicogenomic analysis, Stanford University
– Improved on state-of-the-art results

Upcoming
• Massive scale character level sentiment and
text analysis on Amazon data
– Billions of features, hours to solve a model
– Efficient multitask learning
• Characterize the global limitations of learning
word structure
– Devise provably more efficient regularizers for
uncovering structure

ThinkFast: Scaling Machine Learning to Modern Demands

More Related Content

What's hot

Viewers also liked

Similar to ThinkFast: Scaling Machine Learning to Modern Demands

More from Domino Data Lab

Recently uploaded

ThinkFast: Scaling Machine Learning to Modern Demands

Editor's Notes