This document discusses scaling machine learning to meet modern demands for analyzing massive datasets. It notes initiatives like the Precision Medicine Initiative that will generate terabytes of genomic data per person, and how statistical learning and structured regularization can help analyze such data. The document presents machine learning as a "query language" that can be optimized like database queries by using the mathematical structure of learning problems and efficient feature storage. It provides examples of applications in bioinformatics that have improved results over state-of-the-art using these techniques.
2. The Genomic Data Deluge
• Precision Medicine
Initiative: sequence
1,000,000 genomes
– $215 Million in 2015
– Pilot study
– Outputs 10-50 GB/person
How do we analyze all of this data to drive
progress?
4. The Analysis Refinement Cycle
⨂
Data
1
2
𝑦 − 𝑋𝑤 2
2
+
𝜆
2
𝑤 2
2
Model
𝑥+
= 𝑥 − 𝛼𝑀𝛻𝑓 𝑥
Solver
Model
captures
data
nuance?
Solver
exists, is
fast
enough?
Yes? Proceed
! No? Quit
Increase time, money, experience, resources
5. More Than Just Training Models
• Regularization paths
• Model risk assessment
• Interpretability
ModelCoefficient
Regularization Parameter
6. Brief History of Statistical Learning
Interpretability & Statistical Guarantees
Scalability
Ease of
Use
Simple
Models
Kernel
Methods
Trees &
Ensembles
Structured
Regularization
20. Some Bioinformatics Applications
• Personalized medicine, Memorial Sloan
Kettering Cancer Center
– 35% accuracy improvement over state-of-the-art
• Metagenomic binning and DNA quality
assessment, Stanford School of Medicine
– Previously unsolved problem
• Toxicogenomic analysis, Stanford University
– Improved on state-of-the-art results
21. Upcoming
• Massive scale character level sentiment and
text analysis on Amazon data
– Billions of features, hours to solve a model
– Efficient multitask learning
• Characterize the global limitations of learning
word structure
– Devise provably more efficient regularizers for
uncovering structure
Editor's Notes
[Tons of data, show graph?]
[Models are not good]
[Howe do we quickly iterate with different models]
[Memory $$$]