1. Translating a Trillion Points of Data into
Diagnostics, Therapies and New Insights
in Health and Disease
atul.butte@ucsf.edu
@atulbutte
Atul Butte, MD, PhD
Director, Institute for Computational
Health Sciences
University of California, San Francisco
5. The Cancer Genome Atlas
• 14 thousand cases
• 39 types of cancers
• 13 types of data: molecular, clinical, sequencing
6. 227 million substances x
1.3 million assays
More than a billion measurements
within a grid of 300 trillion cells
71 million meet Lipinski 5
1.2 million active substances
10. Preeclampsia: large cause of maternal and fetal death
• Incidence
• 5-8% of all pregnancies in the U.S. and worldwide
• 4.1 million births in the U.S. in 2009
• Up to 300K cases of preeclampsia annually in the U.S.
• Mortality
• Responsible for 18% of all maternal deaths in the U.S.
• Maternal death in 56 out of every 100,000 live births in US
• Neonatal death in 71 out of every 100,000 live births in US
• Cost
• $20 billion in direct costs in the U.S. annually
• Average hospital stay of 3.5 days
Linda Liu
Bruce Ling
Matt Cooper
12. New blood markers for preeclampsia
Linda Liu
Bruce Ling
Matt Cooper
@MarchofDimes
bit.ly/preeclamp
13. Need a
diagnostic for
preeclampsia
Public big data
available
March of Dimes
Center for
Prematurity
Research
Data analyzed,
diagnostic
designed
SPARK grant
($50k)
Life Science
Angels, other
seed investors
($2 million)
@CarmentaBio
progenity.com
bit.ly/carm_prog
16. Cancer Discovery 2013, 3:1.
Psychiatric Drug Imipramine Shows Significant Activity
Against Small Cell Lung Cancer
Vehicle control Imipramine
p53/Rb/p130
triple knockout
model of SCLC
Mice dosed after
tumor formation
Joel Dudley
Nadine Jahchan
Julien Sage
Alejandro Sweet-Cordero
Joel Neal
@NuMedii
17. Bin Chen
Wei Wei
Li Ma
Bin Yang
Mei-Sze Chua
Samuel So
Gastroenterology, 2017
18. Need more drugs
for more diseases
Public big data
available
NIH funding
Data analyzed,
method designed
Company launched,
ARRA, StartX,
Stanford license,
first deal
Claremont Creek,
Lightspeed ($3.5
million)
@NuMedii
19. The next big open data: clinical trials
Download 100+ studies today
Drug repositioning, new patient subsets,
digital comparative effectiveness, more!
immport.org
Sanchita Bhattacharya
Elizabeth Thomson
25. Clinical Data Warehouse
A Big UC Healthcare Data Analytics Platform
Combining healthcare data from across the
six University of California medical schools and systems
29. ML lessons I’ve learned over 20 years
• Get the question right; solve the problems that health care professionals need solved
– Solve the problems that health care professionals need solved: Don't just guess
• And verify good questions and good unmet needs with more than one doc
– Build a great diagnostic vs. understanding the biology
• Perfectly lassoed variables may miss the big picture biology
– Biologists and medical professionals really love explanations over black boxes
• Watch out for input limiting models
– Patients might not type in the right codes for their symptoms
– They barely enter their own race/ethnicity
– And docs?
• Learn what IRB, HIPAA, BAA are. Learn what ICD-10 and CPT codes are. CLIA and CAP.
– And learn patience.
– Not all of us are cloud allergic.
• Not everything needs deep learning
• Having all data on everyone is super rare: genomics, images, and longitudinal EHR data?
• Health care inefficiency is not about friction
• Data integration and harmonization can happen if there is a business reason for it
• Platforms and their companies are seemingly commoditized
– Come to us with more medical knowledge and background. Convince us you care about this vertical.
– Show us that we are going to learn more from you, than we are going to have to teach you.
30. Open challenges
• Can’t teach a computer with half the game. Need the start and finish of
medical “stories”
– In medical care, need primary and tertiary care in the same database
– Compound data might be available, but trials data is not
– Means data integration and harmonization is a rate limiter
• Need the right diversity in data
– Otherwise might be extrapolating beyond what was learned
– Need enough data, big amounts of data, true-positive cases
• Need methods to handle complicated multi-modal data
• What does validation mean?
– Drug discovery: pre-clinical success? Or Phase 2 success?
– Clinical accuracy? Or clinical utility?
• Career fear uncertainty and doubt (FUD) in some circles
– Tech recruitment
– Too many startups?
31. UC Clinical Data Warehouse Team
Executive Team
• Atul Butte
• Joe Bengfort
• Michael Pfeffer
• Tom Andriola
• Chris Longhurst
Steering Committee
• Irfan Chaudhry
• Mohammed Mahbouba
• Lisa Dahm
• David Dobbs
• Kent Andersen
• Ralph James
• Jennifer Holland
• Eugene Lee
ETL Team
• Albert Dugan
• Tony Choe
• Michael Sweeney
• Timothy Satterwhite
• Ayan Patel
• Niranjan Wagle
• Ralph James
• Joseph Dalton
Data Harmonization
• Dana Ludwig
• Daniella Meeker
Data Quality
• Momeena Ali
• Jodie Nygaard
Epic
• Kevin Ames
• Ben Jenkins
• Steve Gesualdo
Business Analyst
• Ankeeta Shukla
Hardware
• Sandeep Chandra
• Jeff Love
• Scott Bailey
• Kwong Law
• Pallav Saxena
Support
• Jack Stobo
• Michael Blum
• Sam Hawgood
32. Support
Admin and Tech Staff
• Mary Lyall
• Mounira Kenaani
• Kevin Kaier
• Boris Oskotsky
• Mae Moredo
• Ada Chen
• University of California, San Francisco
• Pricilla Chan and Mark Zuckerberg
• NIH: NIAID, NLM, NIGMS, NCI, NHLBI, OD; NIDDK, NHGRI, NIA, NCATS
• March of Dimes
• Juvenile Diabetes Research Foundation
• Hewlett Packard
• Howard Hughes Medical Institute
• California Institute for Regenerative Medicine
• Luke Evnin and Deann Wright (Scleroderma Research Foundation)
• Clayville Research Fund
• PhRMA Foundation
• Stanford Cancer Center, Bio-X, SPARK
• Tarangini Deshpande
• Kimayani Butte
• Sam Hawgood and Keith Yamamoto
• Isaac Kohane