Preparing For Genomic Medicine
September 2016: CXO Forum
Chris Dwan
Director, IT Architecture and Strategy
cdwan@broadinstitute.org @fdmts
Conclusions
• We are still in the early days of genomic medicine.
• Organizations who are effective at collaboration and
integrative data analysis will lead the next decade of
biomedical delivery*
• Privacy, security, and ensuring appropriate access to data
will be necessary challenges
• Technology and technologists will be a key differentiator
* Increasingly, you must be competent at these things to even participate.
Coming soon, to a patient near you
Coming soon, to a patient near you
This will be the new normal and
public expectation within 5 years
• Non-profit biomedical research
institute founded in 2004
• Fifty core faculty members,
from MIT and Harvard, plus
hundreds of associate
members.
• ~1000 employees
• >> 2,400+ researchers
Programs and Initiatives
focused on specific disease or biology areas
Cancer Genome Biology
Cell Circuits Psychiatric Disease
Metabolism Medical and Population
Genetics Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics Data Sciences
Therapeutics Imaging
Metabolite Profiling Proteomics
Genetic Perturbation
The Broad Institute
• Non-profit biomedical research
institute founded in 2004
• Fifty core faculty members,
from MIT and Harvard, plus
hundreds of associate
members.
• ~1000 directly affiliated
personnel
• ~2,400+ associated researchers
Programs and Initiatives
focused on specific disease or biology areas
Cancer Genome Biology
Cell Circuits Psychiatric Disease
Metabolism Medical and Population
Genetics Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics Data Sciences
Therapeutics Imaging
Metabolite Profiling Proteomics
Genetic Perturbation
The Broad Institute
“This generation has a historic opportunity and responsibility
to transform medicine by using systematic approaches in the
biological sciences to dramatically accelerate the
understanding and cure of disease”
Genomic Data Production @ Broad
Genomic Data Production @ Broad
~140 Whole Genome
Sequences / day
Genomic Data Production @ Broad
~140 Whole Genome
Sequences / day
~1PB data / month raw (> 40PB total holdings)
~15k cores (hybrid cloud) dedicated to primary analysis
100Gb/sec link to Internet2
Genomic Data Production in Context
Genomic Data Production in Context
We have learned a
vast amount in the
last decade
Genomic Data Production in Context
We have learned a
vast amount in the
last decade
The question is no longer “can we do
this?” but “what shall we do?”
People @ Broad
The future is already here – it’s just not very well
distributed
William
Gibson
The future is already here – it’s just not very well
distributed
William
Gibson
The right side of history
• Applications are containerized (Docker)
• Data is accessed RESTfully (S3 standard)
• Identity management is federated (Oauth2)
• Analytics are ubiquitous (HDFS / Spark)
• Public clouds (AWS, GCS, Azure) provide flexible commodity
infrastructure and surge capacity
• Data flow operations adopt serverless architectures (Lambda)
• Technologists are embedded in project teams (DevOps)
This is a multi year
journey. Start today.
Transition to Public Clouds
Genomes on the Cloud (April 2016)
Testing the
genome analysis
pipeline
“Go-live”
3rd Party Companies Fill Cloud Feature Gaps
Cloudhealth dashboard atop the billing API
Storage $$
Network $$
Governance remains critical
$$ !!
The new normal
The new normal
The new normal
You move towards and become like that which you
think about.
The Big Data Healthcare Feeding Frenzy
• “If we sequence X new patients with condition Y every year,
the sequencing data alone will take up ALL THE
EXABYTES”*
• The data storage and analysis needs of precision /
personalized / genomic medicine are not unreasonable by
comparison with major, data driven industries.
• We can compensate by being thoughtful about what data we
store, how we store it, and how we share it.
* If you multiply a number by a sufficiently large number the product is a large number.
We’re starting to get a handle on
the basics
• Reduced footprint for genomic data
– 30X WGS: 200GB ==> 40GB
• Increasingly standardized and well integrated variant
calling and annotation tools
• Powerful public reference sets and tools
… people who had
nothing to do with
the design and
execution of the study …
... use another group’s data for their own ends …
… even use the data to try to disprove what the
original investigators had posited…
… some researchers have characterized as “research
parasites”
Fear, Uncertainty, and Doubt
The regulatory framework
• Under current law and practice, there is very limited
organizational upside to sharing PHI and EMR.
• Data use policies: Financial risk
• Research participation: Risk to privacy
• De-identification (AKA data mutilation) is not a viable, long
term strategy in the age of analytics
• Also, the compliance process, even in lightweight versions
is killing our ability to innovate.
“To be without method is deplorable, but to depend
entirely on method is worse.”
The Mustard Seed Garden Manual of Painting, 1679
Appropriate Usage: A framework
Any person
Should have appropriate access to any and all data
Necessary to correctly answer appropriate questions
Appropriate Usage: A new framework
Any person
Should have appropriate access to any and all data
Necessary to correctly answer appropriate questions
This looks almost nothing like our current regulatory framework.
What we need
• Incentive structures that reward making data accessible
and useful
– All indicators except the benefit of the patient lead to suboptimal behavior
– This will require courage.
• National / global data scale data repositories, standards,
and toolkits
– Death to walled gardens, monolithic systems, and GUIs.
– Life to APIs built for a global community (c.f. Amazon, 2002)
• Open, fearless conversation about data protection vs.
appropriate use
– Genomic data is inherently personally identifiable and should be treated as such
– “Appropriate usage” goes well beyond legal conformity
Standards are needed for genomic data
“The mission of the Global Alliance for Genomics
and Health is to accelerate progress in human
health by helping to establish a common framework
of harmonized approaches to enable effective and
responsible sharing of genomic and clinical data,
and by catalyzing data sharing projects that drive
and demonstrate the value of data sharing.”
Regulatory Issues
Ethical Issues
Technical Issues
This stuff is important
We have an opportunity to change lives and health
outcomes, and to realize the gains of genomic medicine, not
in an indefinite future, but this year.
We also have an opportunity to waste vast amounts of
money (very rapidly) and still not really help anybody.
I would like to work together with you to build a better future.
cdwan@broadinstitute.org
The right side of history
• Applications are containerized (Docker)
• Data is accessed RESTfully (S3 standard)
• Identity management is federated (Oauth2)
• Analytics are ubiquitous (HDFS / Spark)
• Public clouds (AWS, GCS, Azure) provide flexible commodity
infrastructure and surge capacity
• Data flows and transforms adopt serverless architectures (Lambda)
• Technologists are embedded in project teams (DevOps)
This is a multi year
journey. Start today.
Thank You

2016 09 cxo forum

  • 1.
    Preparing For GenomicMedicine September 2016: CXO Forum Chris Dwan Director, IT Architecture and Strategy cdwan@broadinstitute.org @fdmts
  • 2.
    Conclusions • We arestill in the early days of genomic medicine. • Organizations who are effective at collaboration and integrative data analysis will lead the next decade of biomedical delivery* • Privacy, security, and ensuring appropriate access to data will be necessary challenges • Technology and technologists will be a key differentiator * Increasingly, you must be competent at these things to even participate.
  • 3.
    Coming soon, toa patient near you
  • 4.
    Coming soon, toa patient near you This will be the new normal and public expectation within 5 years
  • 5.
    • Non-profit biomedicalresearch institute founded in 2004 • Fifty core faculty members, from MIT and Harvard, plus hundreds of associate members. • ~1000 employees • >> 2,400+ researchers Programs and Initiatives focused on specific disease or biology areas Cancer Genome Biology Cell Circuits Psychiatric Disease Metabolism Medical and Population Genetics Infectious Disease Epigenomics Platforms focused technological innovation and application Genomics Data Sciences Therapeutics Imaging Metabolite Profiling Proteomics Genetic Perturbation The Broad Institute
  • 6.
    • Non-profit biomedicalresearch institute founded in 2004 • Fifty core faculty members, from MIT and Harvard, plus hundreds of associate members. • ~1000 directly affiliated personnel • ~2,400+ associated researchers Programs and Initiatives focused on specific disease or biology areas Cancer Genome Biology Cell Circuits Psychiatric Disease Metabolism Medical and Population Genetics Infectious Disease Epigenomics Platforms focused technological innovation and application Genomics Data Sciences Therapeutics Imaging Metabolite Profiling Proteomics Genetic Perturbation The Broad Institute “This generation has a historic opportunity and responsibility to transform medicine by using systematic approaches in the biological sciences to dramatically accelerate the understanding and cure of disease”
  • 7.
  • 8.
    Genomic Data Production@ Broad ~140 Whole Genome Sequences / day
  • 9.
    Genomic Data Production@ Broad ~140 Whole Genome Sequences / day ~1PB data / month raw (> 40PB total holdings) ~15k cores (hybrid cloud) dedicated to primary analysis 100Gb/sec link to Internet2
  • 10.
  • 11.
    Genomic Data Productionin Context We have learned a vast amount in the last decade
  • 12.
    Genomic Data Productionin Context We have learned a vast amount in the last decade The question is no longer “can we do this?” but “what shall we do?”
  • 13.
  • 14.
    The future isalready here – it’s just not very well distributed William Gibson
  • 15.
    The future isalready here – it’s just not very well distributed William Gibson
  • 16.
    The right sideof history • Applications are containerized (Docker) • Data is accessed RESTfully (S3 standard) • Identity management is federated (Oauth2) • Analytics are ubiquitous (HDFS / Spark) • Public clouds (AWS, GCS, Azure) provide flexible commodity infrastructure and surge capacity • Data flow operations adopt serverless architectures (Lambda) • Technologists are embedded in project teams (DevOps) This is a multi year journey. Start today.
  • 17.
  • 18.
    Genomes on theCloud (April 2016) Testing the genome analysis pipeline “Go-live”
  • 19.
    3rd Party CompaniesFill Cloud Feature Gaps Cloudhealth dashboard atop the billing API Storage $$ Network $$
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
    You move towardsand become like that which you think about.
  • 25.
    The Big DataHealthcare Feeding Frenzy • “If we sequence X new patients with condition Y every year, the sequencing data alone will take up ALL THE EXABYTES”* • The data storage and analysis needs of precision / personalized / genomic medicine are not unreasonable by comparison with major, data driven industries. • We can compensate by being thoughtful about what data we store, how we store it, and how we share it. * If you multiply a number by a sufficiently large number the product is a large number.
  • 26.
    We’re starting toget a handle on the basics • Reduced footprint for genomic data – 30X WGS: 200GB ==> 40GB • Increasingly standardized and well integrated variant calling and annotation tools • Powerful public reference sets and tools
  • 27.
    … people whohad nothing to do with the design and execution of the study … ... use another group’s data for their own ends … … even use the data to try to disprove what the original investigators had posited… … some researchers have characterized as “research parasites” Fear, Uncertainty, and Doubt
  • 28.
    The regulatory framework •Under current law and practice, there is very limited organizational upside to sharing PHI and EMR. • Data use policies: Financial risk • Research participation: Risk to privacy • De-identification (AKA data mutilation) is not a viable, long term strategy in the age of analytics • Also, the compliance process, even in lightweight versions is killing our ability to innovate.
  • 29.
    “To be withoutmethod is deplorable, but to depend entirely on method is worse.” The Mustard Seed Garden Manual of Painting, 1679
  • 30.
    Appropriate Usage: Aframework Any person Should have appropriate access to any and all data Necessary to correctly answer appropriate questions
  • 31.
    Appropriate Usage: Anew framework Any person Should have appropriate access to any and all data Necessary to correctly answer appropriate questions This looks almost nothing like our current regulatory framework.
  • 32.
    What we need •Incentive structures that reward making data accessible and useful – All indicators except the benefit of the patient lead to suboptimal behavior – This will require courage. • National / global data scale data repositories, standards, and toolkits – Death to walled gardens, monolithic systems, and GUIs. – Life to APIs built for a global community (c.f. Amazon, 2002) • Open, fearless conversation about data protection vs. appropriate use – Genomic data is inherently personally identifiable and should be treated as such – “Appropriate usage” goes well beyond legal conformity
  • 33.
    Standards are neededfor genomic data “The mission of the Global Alliance for Genomics and Health is to accelerate progress in human health by helping to establish a common framework of harmonized approaches to enable effective and responsible sharing of genomic and clinical data, and by catalyzing data sharing projects that drive and demonstrate the value of data sharing.” Regulatory Issues Ethical Issues Technical Issues
  • 34.
    This stuff isimportant We have an opportunity to change lives and health outcomes, and to realize the gains of genomic medicine, not in an indefinite future, but this year. We also have an opportunity to waste vast amounts of money (very rapidly) and still not really help anybody. I would like to work together with you to build a better future. cdwan@broadinstitute.org
  • 35.
    The right sideof history • Applications are containerized (Docker) • Data is accessed RESTfully (S3 standard) • Identity management is federated (Oauth2) • Analytics are ubiquitous (HDFS / Spark) • Public clouds (AWS, GCS, Azure) provide flexible commodity infrastructure and surge capacity • Data flows and transforms adopt serverless architectures (Lambda) • Technologists are embedded in project teams (DevOps) This is a multi year journey. Start today.
  • 36.