ThinkFast: Scaling Machine Learning
to Modern Demands
Hristo Paskov
The Genomic Data Deluge
• Precision Medicine
Initiative: sequence
1,000,000 genomes
– $215 Million in 2015
– Pilot study
– Outputs 10-50 GB/person
How do we analyze all of this data to drive
progress?
Massive Data Sources
News
eCommerce
Bioinformatics
100K Genomes
Social Media
The Analysis Refinement Cycle
⨂
Data
1
2
𝑦 − 𝑋𝑤 2
2
+
𝜆
2
𝑤 2
2
Model
𝑥+
= 𝑥 − 𝛼𝑀𝛻𝑓 𝑥
Solver
Model
captures
data
nuance?
Solver
exists, is
fast
enough?
Yes? Proceed
! No? Quit
Increase time, money, experience, resources
More Than Just Training Models
• Regularization paths
• Model risk assessment
• Interpretability
ModelCoefficient
Regularization Parameter
Brief History of Statistical Learning
Interpretability & Statistical Guarantees
Scalability
Ease of
Use
Simple
Models
Kernel
Methods
Trees &
Ensembles
Structured
Regularization
Structured Regularization
Losses
Regression
Classification
Ranking
Motif Finding
Matrix Factorization
Feature Embedding
Data Imputation
…
Regularizers
Sparsity
Spatial/ Temporal /
Manifold Structure
Group Structure
Hierarchical Structure
Structured & Unstructured
Multitask Learning
…
min
𝛽∈ℝ 𝑑
𝐿 𝑋𝛽 + 𝜆𝑅 𝛽
The Lasso’s Combinatorial Side
min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
𝜆
0
3
2
1
4
ModelCoefficient
The Database Perspective
min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
The Database Perspective
−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
Feature & label storage
The Database Perspective
−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
Feature & label storage
Data access operations
𝑢 = 𝑦 − 𝑋𝛽
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇 𝑣
The Database Perspective
−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
Feature & label storage
Data access operations
𝑢 = 𝑦 − 𝑋𝛽
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇 𝑣
ML “Query Language” min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
The Database Perspective
min
𝛽1,𝛽2,𝛽3∈ℝ 𝑑
𝑡=1
3
𝐿 𝑡 𝑦𝑡 − 𝑋𝑡 𝛽𝑡 + 𝜆 𝑡 𝑅𝑡 𝛽𝑡
+𝜔 𝛽1 𝛽2 𝛽3 ∗
The Database Perspective
Feature, label and
model storage
Data access operations
𝑢 = 𝑦 − 𝑋𝛽
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇
𝑣
ML “Query Language” min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
𝑀1
𝑀2
𝑀1
𝑀2
𝑀3
𝑀1
𝑀2
The Database Perspective
𝑢 = 𝑦 − 𝑋𝛽
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇
𝑣
min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
𝑀1
𝑀2
𝑀1
𝑀2
𝑀3
𝑀1
𝑀2
Processing Memory
Mathematical
Structure
Efficient Feature Storage
“Query Language” Optimization
• Static analysis
𝑦 − 𝑋𝑤 2
2
+ 𝑤 2
2
𝑦 − 𝑋𝑤 2
2
+ 𝑤 1
?
𝑦 − 𝑋𝑤 2
2
+
1
2
𝑤 2
2
+ 𝑤 1
“Query Language” Optimization
• Static analysis
𝑦 − 𝑋𝑤 2
2
+ 𝑤 2
2
𝑦 − 𝑋𝑤 2
2
+ 𝑤 1
𝑦 − 𝑋𝑤 2
2
+
1
2
𝑤 2
2
+ 𝑤 1
?
𝜀 𝑦 − 𝑋𝑤 +
1
2
𝑤 2
2
+ 𝑤 1
“Query Language” Optimization
• Static analysis
• Runtime analysis
Some Bioinformatics Applications
• Personalized medicine, Memorial Sloan
Kettering Cancer Center
– 35% accuracy improvement over state-of-the-art
• Metagenomic binning and DNA quality
assessment, Stanford School of Medicine
– Previously unsolved problem
• Toxicogenomic analysis, Stanford University
– Improved on state-of-the-art results
Upcoming
• Massive scale character level sentiment and
text analysis on Amazon data
– Billions of features, hours to solve a model
– Efficient multitask learning
• Characterize the global limitations of learning
word structure
– Devise provably more efficient regularizers for
uncovering structure

ThinkFast: Scaling Machine Learning to Modern Demands

  • 1.
    ThinkFast: Scaling MachineLearning to Modern Demands Hristo Paskov
  • 2.
    The Genomic DataDeluge • Precision Medicine Initiative: sequence 1,000,000 genomes – $215 Million in 2015 – Pilot study – Outputs 10-50 GB/person How do we analyze all of this data to drive progress?
  • 3.
  • 4.
    The Analysis RefinementCycle ⨂ Data 1 2 𝑦 − 𝑋𝑤 2 2 + 𝜆 2 𝑤 2 2 Model 𝑥+ = 𝑥 − 𝛼𝑀𝛻𝑓 𝑥 Solver Model captures data nuance? Solver exists, is fast enough? Yes? Proceed ! No? Quit Increase time, money, experience, resources
  • 5.
    More Than JustTraining Models • Regularization paths • Model risk assessment • Interpretability ModelCoefficient Regularization Parameter
  • 6.
    Brief History ofStatistical Learning Interpretability & Statistical Guarantees Scalability Ease of Use Simple Models Kernel Methods Trees & Ensembles Structured Regularization
  • 7.
    Structured Regularization Losses Regression Classification Ranking Motif Finding MatrixFactorization Feature Embedding Data Imputation … Regularizers Sparsity Spatial/ Temporal / Manifold Structure Group Structure Hierarchical Structure Structured & Unstructured Multitask Learning … min 𝛽∈ℝ 𝑑 𝐿 𝑋𝛽 + 𝜆𝑅 𝛽
  • 8.
    The Lasso’s CombinatorialSide min 𝛽∈ℝ 𝑑 𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1 𝜆 0 3 2 1 4 ModelCoefficient
  • 9.
    The Database Perspective min 𝛽∈ℝ𝑑 𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1 −𝑋 𝑇 𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
  • 10.
    The Database Perspective −𝑋𝑇 𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1 Feature & label storage
  • 11.
    The Database Perspective −𝑋𝑇 𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1 Feature & label storage Data access operations 𝑢 = 𝑦 − 𝑋𝛽 𝑣 = 𝜕 𝑢 𝐿 𝑢 𝑤 = 𝑋 𝑇 𝑣
  • 12.
    The Database Perspective −𝑋𝑇 𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1 Feature & label storage Data access operations 𝑢 = 𝑦 − 𝑋𝛽 𝑣 = 𝜕 𝑢 𝐿 𝑢 𝑤 = 𝑋 𝑇 𝑣 ML “Query Language” min 𝛽∈ℝ 𝑑 𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
  • 13.
    The Database Perspective min 𝛽1,𝛽2,𝛽3∈ℝ𝑑 𝑡=1 3 𝐿 𝑡 𝑦𝑡 − 𝑋𝑡 𝛽𝑡 + 𝜆 𝑡 𝑅𝑡 𝛽𝑡 +𝜔 𝛽1 𝛽2 𝛽3 ∗
  • 14.
    The Database Perspective Feature,label and model storage Data access operations 𝑢 = 𝑦 − 𝑋𝛽 𝑣 = 𝜕 𝑢 𝐿 𝑢 𝑤 = 𝑋 𝑇 𝑣 ML “Query Language” min 𝛽∈ℝ 𝑑 𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1 𝑀1 𝑀2 𝑀1 𝑀2 𝑀3 𝑀1 𝑀2
  • 15.
    The Database Perspective 𝑢= 𝑦 − 𝑋𝛽 𝑣 = 𝜕 𝑢 𝐿 𝑢 𝑤 = 𝑋 𝑇 𝑣 min 𝛽∈ℝ 𝑑 𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1 𝑀1 𝑀2 𝑀1 𝑀2 𝑀3 𝑀1 𝑀2 Processing Memory Mathematical Structure
  • 16.
  • 17.
    “Query Language” Optimization •Static analysis 𝑦 − 𝑋𝑤 2 2 + 𝑤 2 2 𝑦 − 𝑋𝑤 2 2 + 𝑤 1 ? 𝑦 − 𝑋𝑤 2 2 + 1 2 𝑤 2 2 + 𝑤 1
  • 18.
    “Query Language” Optimization •Static analysis 𝑦 − 𝑋𝑤 2 2 + 𝑤 2 2 𝑦 − 𝑋𝑤 2 2 + 𝑤 1 𝑦 − 𝑋𝑤 2 2 + 1 2 𝑤 2 2 + 𝑤 1 ? 𝜀 𝑦 − 𝑋𝑤 + 1 2 𝑤 2 2 + 𝑤 1
  • 19.
    “Query Language” Optimization •Static analysis • Runtime analysis
  • 20.
    Some Bioinformatics Applications •Personalized medicine, Memorial Sloan Kettering Cancer Center – 35% accuracy improvement over state-of-the-art • Metagenomic binning and DNA quality assessment, Stanford School of Medicine – Previously unsolved problem • Toxicogenomic analysis, Stanford University – Improved on state-of-the-art results
  • 21.
    Upcoming • Massive scalecharacter level sentiment and text analysis on Amazon data – Billions of features, hours to solve a model – Efficient multitask learning • Characterize the global limitations of learning word structure – Devise provably more efficient regularizers for uncovering structure

Editor's Notes

  • #3 [Tons of data, show graph?] [Models are not good] [Howe do we quickly iterate with different models] [Memory $$$]