Separating Signal from Noise in the Age of 
Genomics & Big Data: 
A Public Health Approach 
Muin J. Khoury MD, PhD 
CDC Office of Public Health Genomics 
NCI Epidemiology & Genomics Research Program
Outline 
 Big Data & Causation in the Age of Genomics 
 Promises of Genomics & Big Data 
 Challenges of Genomics & Big Data 
 A Public Health Approach to Realize Potential of 
Genomics & Big Data
A Case Study: Searching for Needles in the 
Haystack- The CDC HuGE Navigator 
http://www.hugenavigator.net/HuGENavigator/home.do
Text Mining Tool To Find HuGE Articles 
in Published Literature 
 PubMed Signal/Noise ratio very 
low 
 Support Vector Machine (SVM) 
tool generated in 2008 
 Based on >3800 words in text, 
extensively validated 
 Sensitivity & specificity >97% 
 Since 2008, genetic epidemiology 
literature has changed 
considerably 
 Performance of SVM model was 
significantly reduced (60%) 
 In 2014, Retrained SVM now using 
> 4500 words pushed sensitivity 
and specificity to >90% 
Yu W et al. BMC Bioinformatics, 2008
Application of Data Mining in the Prediction 
of Type 2 Diabetes in the United States 
 1999-2004 National Health and 
Nutrition Examination Survey 
 Developed and validated SVM 
models for diabetes, undiagnosed 
diabetes & prediabetes using 
numerous variables in survey 
 Discriminative abilities Using area 
under ROC curve of 84% and 73% 
 Validated known risk factors for 
diabetes 
 Not clear what best models, what 
best variables to use and how 
applicable to other populations 
 Proof of concept only Yu W et al. BMC Medical Informatics 2010
The IOM Ecological Model & the Need for 
Multilevel Analysis of ā€œCausationā€ 
Obesity Example NEJM 2007;357:404-7 
IOM Ecological Model
Genomics & Big Data 
The Genome is Just the Beginning 
ā€œWe will all be surrounded by a personal cloud of billions of data pointslā€œ L 
Hood (ISB)
Big Data: From Association to Prediction 
How about Causation? 
 Association 
 Replication 
 Classification 
 Prediction 
 ?CAUSATION 
 Does Big Data care about ā€œCausationā€? 
 Intervention is based on cause-effect 
relationships
The Promises of Genomics & Big Data 
The Economist
The Promises of Genomics & Big Data 
 Workup of Rare & Familial Diseases 
NEJM June2014
The Promises of Genomics & Big Data 
 Improved Disease Classification
The Promises of Genomics & Big Data 
 Improved Measurement of the ā€œEnvironmentā€ 
http://www.niehs.nih.gov/research/programs/geh/geh_newsletter/2014/4/spotlight/index.cfm
The Promises of Genomics & Big Data 
 Better Understanding of Natural History 
G Ginsburg
The Promises of Genomics & Big Data 
 Stratified Prevention (One size does not fit All) 
No one is average: ā€œpopulation medicine: let’s get over itā€ (E. Topol)
The Promises of Genomics & Big Data 
 Precision Medicine
The Promises of Genomics & Big Data 
 Pathogen Genomics
The Promises of Genomics & Big Data 
 Public Health Practice 
ā€œAs cholera swept through London in the 
mid-19th century, a physician named John 
Snow painstakingly drew a paper map 
indicating clusters of homes where the 
deadly waterborne infection had struck. In 
an iconic feat in public health history, he 
implicated the Broad Street pump as the 
source of the scourge—a founding event in 
modern epidemiology. Today, Snow might 
have crunched GPS information and disease 
prevalence data and solved the problem 
within hoursā€ 
http://www.hsph.harvard.edu/news/magazine/big-datas-big-visionary/? 
utm_source=SilverpopMailing&utm_medium=email&utm_cam 
paign=Kiosk%2009.25.14_academic%20(1)&utm_content
Some Promises of Genomics & Big Data 
 Workup of Rare & Familial Diseases 
 Improved Disease Classification 
 Improved Measurement of the ā€œEnvironmentā€ 
 Better Understanding of Disease Natural History 
 Stratified Prevention 
 Precision Medicine 
 Pathogen Genomics 
 Public Health Practice
The Challenges of Genomics & Big Data 
 Problems of Study Designs & Hidden Biases 
ā€œā€¦claims are based upon complex 
(and we believe flawed) 
analyses…there are far simpler 
alternative explanations for the 
patterns they observed. We believe 
that the authors have not excluded 
important alternative explanationsā€œ 
G. Breen 
Schizophrenia is Eight Different Diseases 
Not Oneā€ USA Today (9/15/2014) 
ā€œEight types of schizophrenia? Not so 
fastā€ Genomes Unzipped (9/30/2014) 
Am J Psychiatry Sep 2014
The Challenges of Genomics & Big Data 
 Analytic Issues: Dealing with Complexity 
Prediction of LDL cholesterol response to statin using transcriptomic and 
genetic variation. Kyungpil Kim et al. Genome Biology, Sep 2014
The Challenges of Genomics & Big Data 
 Reproducibility 
Lots of Input 
Variables 
Molecularly defined 
Disease subsets & precursors 
Millions 
of genetic 
variants
Am J Clin Nutrition 2013
The Challenges of Genomics & Big Data 
 Causation, Ecologic Fallacies & Hubris
ā€˜The Scientific Method Itself is Growing 
Obsolete.’ (A. Butte, Sep 2014) 
ā€œ..implicit 
assumption that big 
data are a substitute 
for, rather than a 
supplement to, 
traditional data 
collection and 
analysis." 
http://blogs.kqed.org/science/ 
audio/how-big-data-is-changing- 
medicine/ 
Garbage In, Garbage Out (GIGO)
The Challenges of Genomics & Big Data 
 Beyond Prediction: From Validity to Utility
The Challenges of Genomics & Big Data 
 Challenges of Population Stratification & Precision 
Medicine
Some Challenges of Genomics & Big Data 
 Problems of Study Designs & Hidden Biases 
 Analytic Issues: Dealing with Complexity 
 Reproducibility and Replication 
 Causation vs Association-Ecologic Fallacies & 
Hubris 
 Translation: from Validity into Utility and 
Implementation 
 Challenges of Population Stratification & 
Personalized Medicine
A Public Health Translation Framework 
for Genomics & Big Data 
Population 
Health 
Discovery 
Evaluation 
Evidence based 
Recommendation 
or Policy 
T1 
Health care 
& Prevention 
Programs 
Application 
Knowledge 
Integration 
T2 
T4 T3 
T0 
Implementation 
Science 
Khoury MJ et al, AJPH, 2012 
Effectiveness 
& Outcomes 
Research (CER, PCOR. 
Economics, ELSI 
Development 
Basic, Clinical & 
Population 
Sciences
A Public Health Approach to Realizing 
Promises of Genomics & Big Data 
 1. Use a Strong 
Epidemiologic Foundation 
 The study of distribution and 
determinants of disease occurrence 
and outcomes in populations, and 
using resulting knowledge to 
improve health and prevent disease 
 Fundamental science of medicine 
and public health 
 Human Genome Epidemiology 
(HuGE)- Beyond Gene Discovery 
 New Brand of ā€œBig Data 
Epidemiologyā€ 2010
Epidemiologic Cohort Studies: 
The NCI Cohort Consortium 
• Investigators responsible: 
– 40+ high-quality cohorts 
– 4+ million people 
• Coordinated, 
interdisciplinary approach 
• Tackle important scientific 
questions, economies of 
scale, and opportunities to 
quicken the pace of 
research 
• Focused so far mostly on 
etiology, but adapting to 
include outcomes 
• Major role in identifying 
specific carcinogenic 
environment agents 
ā–« Asbestos – Lung 
ā–« Benzene – Leukemia 
ā–« Smoking – many dzs 
• Exposures/Risk factors 
assessment prior to 
onset of disease 
ā–« Overcome 
recall/selection biases 
• Permit absolute 
measures of 
risks/incidence rates 
ā–« Relevant for public 
health policies 
• Value resource for 
studying for repeated 
measures and multiple 
outcomes
Epidemiology Data Sharing & Harmonization 
Nature, August 27, 2014
A Public Health Approach to Realizing 
Promises of Genomics & Big Data 
 2. Develop a Robust Knowledge Integration 
Process
A Public Health Approach to Realizing 
Promises of Genomics & Big Data 
 2. Develop a Robust Knowledge Integration 
Process
Components of Knowledge Integration 
• Knowledge Management: Integration of 
knowledge from disparate sources & disciplines 
• Knowledge Synthesis: Systematic synthesis 
of scientific findings 
ā–« Accumulating evidence on a cancer outcome 
Minimize waste in repeat funding 
ā–« Identify scientific gaps 
Inform research priorities 
• Knowledge Translation 
ā–« Stakeholder engagement 
ā–« Evidence-based information 
ā–« Decision support tools
Interpretation 
ā€œThe Bottleneck for Realizing Personalized Medicineā€ 
(Good et al. Genome Biology Sep 2014)
The NIH BD2K Initiative Can Help
A Public Health Approach to Realizing 
Promises of Genomics & Big Data 
 3. Use (and not avoid) Principles of Evidence-based 
Medicine and Population Screening
Guidelines We Can Trust (IOM, 2011)
Guidelines We Can Trust in Genomic Medicine 
(Schully S et al. Genetics in Medicine 2014)
CDC-Sponsored 
EGAPP Working Group 
• Independent, multidisciplinary, non-federal panel 
established in 2004 
• Established a systematic, evidence-based process to 
assess validity & utility of genomic tests & family health 
history applications. 
• New methods for evidence synthesis and modeling in 2013, 
including next generation sequencing and stratified cancer 
screening based on family history 
• 10 recommendation statements to date: 
• Colorectal cancer, breast cancer, heart disease, clotting 
disorders, depression, prostate cancer, diabetes, and more 
• Clinical Validity vs Clinical Utility 
• Uncovered evidence gaps that require additional 
research 
• Principles can be applied to other ā€œBig Dataā€
Evidence-based Classification of Genomic 
Applications in Practice 
Tier 1 
Tier 2 
Tier 3 
http://www.cdc.gov/genomics/gtesting/tier.htm
Evidence-based Binning of the Genome 
Genetics in Medicine 2011
A Public Health Approach to Realizing 
Promises of Genomics & Big Data 
 4. Develop a Robust T2+ Translational 
Research Agenda
Limited Translational Research in Genomics 
Beyond the Bedside 
T0 ↔ T1 ↔ T2 ↔ T3 ↔ T4 
Discovery to Application Guideline to Practice to 
Application to Guideline Practice Population 
Khoury MJ, 2007, Schully, 2012. Clyne, M, 2014 
Health 
Impact 
<1% of published genomics research 
in T2 – T4 
Multiple clinical and population 
scientific disciplines involved
Cancer Genomics Research Funding T2+ 
Public Health Genomics 2010
A MultiDisciplinary T2+ Research Agenda 
 Comparative Effectiveness Research 
 Patient-centered Outcomes Research 
 Behavioral, Social & Communication Sciences 
 Economic Studies 
 Surveillance & Population Monitoring
A Public Health Approach to Realizing 
Promises of Genomics & Big Data 
 Use a Strong Epidemiologic Foundation 
 Develop a Robust Knowledge Integration 
Process 
 Use (and not avoid) Principles of Evidence-based 
Medicine and Population Screening 
 Develop a Robust T2+ Research Agenda 
(Learning Health systems, Consumer 
Involvement etc..)
In Summary 
 ā€œBig Dataā€ is agnostic to disease causation 
 Numerous promises for health impact of genomics 
& Big Data- Leading edge in genomics in Big Data 
beginning to be applied 
 But numerous challenges face genomics & Big 
Data. So we should not overpromise & under 
deliver 
 A ā€œPublic Healthā€ translational approach Is needed 
to realize potential of genomics & Big Data

Khoury ashg2014

  • 1.
    Separating Signal fromNoise in the Age of Genomics & Big Data: A Public Health Approach Muin J. Khoury MD, PhD CDC Office of Public Health Genomics NCI Epidemiology & Genomics Research Program
  • 2.
    Outline  BigData & Causation in the Age of Genomics  Promises of Genomics & Big Data  Challenges of Genomics & Big Data  A Public Health Approach to Realize Potential of Genomics & Big Data
  • 3.
    A Case Study:Searching for Needles in the Haystack- The CDC HuGE Navigator http://www.hugenavigator.net/HuGENavigator/home.do
  • 4.
    Text Mining ToolTo Find HuGE Articles in Published Literature  PubMed Signal/Noise ratio very low  Support Vector Machine (SVM) tool generated in 2008  Based on >3800 words in text, extensively validated  Sensitivity & specificity >97%  Since 2008, genetic epidemiology literature has changed considerably  Performance of SVM model was significantly reduced (60%)  In 2014, Retrained SVM now using > 4500 words pushed sensitivity and specificity to >90% Yu W et al. BMC Bioinformatics, 2008
  • 5.
    Application of DataMining in the Prediction of Type 2 Diabetes in the United States  1999-2004 National Health and Nutrition Examination Survey  Developed and validated SVM models for diabetes, undiagnosed diabetes & prediabetes using numerous variables in survey  Discriminative abilities Using area under ROC curve of 84% and 73%  Validated known risk factors for diabetes  Not clear what best models, what best variables to use and how applicable to other populations  Proof of concept only Yu W et al. BMC Medical Informatics 2010
  • 6.
    The IOM EcologicalModel & the Need for Multilevel Analysis of ā€œCausationā€ Obesity Example NEJM 2007;357:404-7 IOM Ecological Model
  • 7.
    Genomics & BigData The Genome is Just the Beginning ā€œWe will all be surrounded by a personal cloud of billions of data pointslā€œ L Hood (ISB)
  • 8.
    Big Data: FromAssociation to Prediction How about Causation?  Association  Replication  Classification  Prediction  ?CAUSATION  Does Big Data care about ā€œCausationā€?  Intervention is based on cause-effect relationships
  • 9.
    The Promises ofGenomics & Big Data The Economist
  • 10.
    The Promises ofGenomics & Big Data  Workup of Rare & Familial Diseases NEJM June2014
  • 11.
    The Promises ofGenomics & Big Data  Improved Disease Classification
  • 12.
    The Promises ofGenomics & Big Data  Improved Measurement of the ā€œEnvironmentā€ http://www.niehs.nih.gov/research/programs/geh/geh_newsletter/2014/4/spotlight/index.cfm
  • 13.
    The Promises ofGenomics & Big Data  Better Understanding of Natural History G Ginsburg
  • 14.
    The Promises ofGenomics & Big Data  Stratified Prevention (One size does not fit All) No one is average: ā€œpopulation medicine: let’s get over itā€ (E. Topol)
  • 15.
    The Promises ofGenomics & Big Data  Precision Medicine
  • 16.
    The Promises ofGenomics & Big Data  Pathogen Genomics
  • 17.
    The Promises ofGenomics & Big Data  Public Health Practice ā€œAs cholera swept through London in the mid-19th century, a physician named John Snow painstakingly drew a paper map indicating clusters of homes where the deadly waterborne infection had struck. In an iconic feat in public health history, he implicated the Broad Street pump as the source of the scourge—a founding event in modern epidemiology. Today, Snow might have crunched GPS information and disease prevalence data and solved the problem within hoursā€ http://www.hsph.harvard.edu/news/magazine/big-datas-big-visionary/? utm_source=SilverpopMailing&utm_medium=email&utm_cam paign=Kiosk%2009.25.14_academic%20(1)&utm_content
  • 18.
    Some Promises ofGenomics & Big Data  Workup of Rare & Familial Diseases  Improved Disease Classification  Improved Measurement of the ā€œEnvironmentā€  Better Understanding of Disease Natural History  Stratified Prevention  Precision Medicine  Pathogen Genomics  Public Health Practice
  • 19.
    The Challenges ofGenomics & Big Data  Problems of Study Designs & Hidden Biases ā€œā€¦claims are based upon complex (and we believe flawed) analyses…there are far simpler alternative explanations for the patterns they observed. We believe that the authors have not excluded important alternative explanationsā€œ G. Breen Schizophrenia is Eight Different Diseases Not Oneā€ USA Today (9/15/2014) ā€œEight types of schizophrenia? Not so fastā€ Genomes Unzipped (9/30/2014) Am J Psychiatry Sep 2014
  • 21.
    The Challenges ofGenomics & Big Data  Analytic Issues: Dealing with Complexity Prediction of LDL cholesterol response to statin using transcriptomic and genetic variation. Kyungpil Kim et al. Genome Biology, Sep 2014
  • 22.
    The Challenges ofGenomics & Big Data  Reproducibility Lots of Input Variables Molecularly defined Disease subsets & precursors Millions of genetic variants
  • 23.
    Am J ClinNutrition 2013
  • 24.
    The Challenges ofGenomics & Big Data  Causation, Ecologic Fallacies & Hubris
  • 25.
    ā€˜The Scientific MethodItself is Growing Obsolete.’ (A. Butte, Sep 2014) ā€œ..implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis." http://blogs.kqed.org/science/ audio/how-big-data-is-changing- medicine/ Garbage In, Garbage Out (GIGO)
  • 26.
    The Challenges ofGenomics & Big Data  Beyond Prediction: From Validity to Utility
  • 27.
    The Challenges ofGenomics & Big Data  Challenges of Population Stratification & Precision Medicine
  • 28.
    Some Challenges ofGenomics & Big Data  Problems of Study Designs & Hidden Biases  Analytic Issues: Dealing with Complexity  Reproducibility and Replication  Causation vs Association-Ecologic Fallacies & Hubris  Translation: from Validity into Utility and Implementation  Challenges of Population Stratification & Personalized Medicine
  • 29.
    A Public HealthTranslation Framework for Genomics & Big Data Population Health Discovery Evaluation Evidence based Recommendation or Policy T1 Health care & Prevention Programs Application Knowledge Integration T2 T4 T3 T0 Implementation Science Khoury MJ et al, AJPH, 2012 Effectiveness & Outcomes Research (CER, PCOR. Economics, ELSI Development Basic, Clinical & Population Sciences
  • 30.
    A Public HealthApproach to Realizing Promises of Genomics & Big Data  1. Use a Strong Epidemiologic Foundation  The study of distribution and determinants of disease occurrence and outcomes in populations, and using resulting knowledge to improve health and prevent disease  Fundamental science of medicine and public health  Human Genome Epidemiology (HuGE)- Beyond Gene Discovery  New Brand of ā€œBig Data Epidemiologyā€ 2010
  • 32.
    Epidemiologic Cohort Studies: The NCI Cohort Consortium • Investigators responsible: – 40+ high-quality cohorts – 4+ million people • Coordinated, interdisciplinary approach • Tackle important scientific questions, economies of scale, and opportunities to quicken the pace of research • Focused so far mostly on etiology, but adapting to include outcomes • Major role in identifying specific carcinogenic environment agents ā–« Asbestos – Lung ā–« Benzene – Leukemia ā–« Smoking – many dzs • Exposures/Risk factors assessment prior to onset of disease ā–« Overcome recall/selection biases • Permit absolute measures of risks/incidence rates ā–« Relevant for public health policies • Value resource for studying for repeated measures and multiple outcomes
  • 33.
    Epidemiology Data Sharing& Harmonization Nature, August 27, 2014
  • 34.
    A Public HealthApproach to Realizing Promises of Genomics & Big Data  2. Develop a Robust Knowledge Integration Process
  • 35.
    A Public HealthApproach to Realizing Promises of Genomics & Big Data  2. Develop a Robust Knowledge Integration Process
  • 36.
    Components of KnowledgeIntegration • Knowledge Management: Integration of knowledge from disparate sources & disciplines • Knowledge Synthesis: Systematic synthesis of scientific findings ā–« Accumulating evidence on a cancer outcome Minimize waste in repeat funding ā–« Identify scientific gaps Inform research priorities • Knowledge Translation ā–« Stakeholder engagement ā–« Evidence-based information ā–« Decision support tools
  • 37.
    Interpretation ā€œThe Bottleneckfor Realizing Personalized Medicineā€ (Good et al. Genome Biology Sep 2014)
  • 38.
    The NIH BD2KInitiative Can Help
  • 39.
    A Public HealthApproach to Realizing Promises of Genomics & Big Data  3. Use (and not avoid) Principles of Evidence-based Medicine and Population Screening
  • 40.
    Guidelines We CanTrust (IOM, 2011)
  • 41.
    Guidelines We CanTrust in Genomic Medicine (Schully S et al. Genetics in Medicine 2014)
  • 42.
    CDC-Sponsored EGAPP WorkingGroup • Independent, multidisciplinary, non-federal panel established in 2004 • Established a systematic, evidence-based process to assess validity & utility of genomic tests & family health history applications. • New methods for evidence synthesis and modeling in 2013, including next generation sequencing and stratified cancer screening based on family history • 10 recommendation statements to date: • Colorectal cancer, breast cancer, heart disease, clotting disorders, depression, prostate cancer, diabetes, and more • Clinical Validity vs Clinical Utility • Uncovered evidence gaps that require additional research • Principles can be applied to other ā€œBig Dataā€
  • 43.
    Evidence-based Classification ofGenomic Applications in Practice Tier 1 Tier 2 Tier 3 http://www.cdc.gov/genomics/gtesting/tier.htm
  • 44.
    Evidence-based Binning ofthe Genome Genetics in Medicine 2011
  • 45.
    A Public HealthApproach to Realizing Promises of Genomics & Big Data  4. Develop a Robust T2+ Translational Research Agenda
  • 46.
    Limited Translational Researchin Genomics Beyond the Bedside T0 ↔ T1 ↔ T2 ↔ T3 ↔ T4 Discovery to Application Guideline to Practice to Application to Guideline Practice Population Khoury MJ, 2007, Schully, 2012. Clyne, M, 2014 Health Impact <1% of published genomics research in T2 – T4 Multiple clinical and population scientific disciplines involved
  • 47.
    Cancer Genomics ResearchFunding T2+ Public Health Genomics 2010
  • 48.
    A MultiDisciplinary T2+Research Agenda  Comparative Effectiveness Research  Patient-centered Outcomes Research  Behavioral, Social & Communication Sciences  Economic Studies  Surveillance & Population Monitoring
  • 49.
    A Public HealthApproach to Realizing Promises of Genomics & Big Data  Use a Strong Epidemiologic Foundation  Develop a Robust Knowledge Integration Process  Use (and not avoid) Principles of Evidence-based Medicine and Population Screening  Develop a Robust T2+ Research Agenda (Learning Health systems, Consumer Involvement etc..)
  • 50.
    In Summary ļ®ā€œBig Dataā€ is agnostic to disease causation  Numerous promises for health impact of genomics & Big Data- Leading edge in genomics in Big Data beginning to be applied  But numerous challenges face genomics & Big Data. So we should not overpromise & under deliver  A ā€œPublic Healthā€ translational approach Is needed to realize potential of genomics & Big Data