Vice President Discusses Pharmacogenomics and Big Data

Gerry Higgins, Ph.D., M.D.
Vice President, Pharmacogenomic Science
AssureRx Health, Inc.

AssureRx Health, Inc. CONFIDENTIAL 1

» The Human Genome
» Explosive Growth in Sequence Data
» The ‘Big Data’ Problem
» The ‘Diminishing Discovery’ Problem
» Human Genome Variation and Pharmacogenomics
» Evolution of next generation sequencing (NGS)
technology
» Future Trends


The Human Genome

• ~3.2 billion base pairs1

• 22,500 ± 2,000 genes2 (= ~1.3% 0f genome)

• 100,000 – 500,000 proteins, depending on
tissue3
1InternationalHuman Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome.
Nature 2004, 431, 931-945.
2Pertea M and Salzberg SL. Between a chicken and a grape: estimating the number of human genes. Genome Biology
2010, 11:206.
3RamsköldD et al. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data.
PLoS Computational Biology 2009 5(12).


The Human Genome - Regulation


Example: Alternative splicing of mRNAS
Mechanisms Percentage of alternatively-spliced genes1

= 48%

= 16% = 16%

1Yeo g et al. Variation in alternative splicing across human tissues. Genome Biology 2004, 5:R74.


Example: Brain-specific methylation patterns1
• As determined by Methylated DNA immunoprecipitation (MeDIP)
– genome-wide methylation analysis
• CpG Islands (CGI) tend to be the most highly methylated regions of the genome –
GC-rich promoters of genes tend to be the most hypo-methylated GC sequences
• The most methylated regions of the genome are related to genes involved in brain
development – BDNF, CACNA1A and CACNA1F (calcium-channel genes involved in
neuronal growth and development and controlling the release of neurotransmitters),
and GRIK5 (a receptor for the excitatory neurotransmitter glutamate).
Unsupervised hierarchal cluster analysis (a statistical measure of the difference between values)

Cerebral cortex Cerebellum Blood

1Davies M et al. Functional annotation of the human brain methylome identifies tissue-specific epigenetic variation
across brain and blood. Genome Biology 2012, 13:R43.

Example: Interactome – Variants in Genes in the Same Pathway
Predict Susceptibility to Disease1,2
Major Depressive Disorder:
GENE SNP
PDE6C rs7903947
BDNF rs7927728
GHRHR rs2228078
PSMD9 rs1168658
HSD3B1 rs2208382

1Wong M-L et al. Prediction of susceptibility to major depression by a model of interactions of multiple functional
genetic variants and environmental factors. Molecular Psychiatry, 2012 17:624-633.
2 Barrenas F et al. Highly interconnected genes in disease-specific networks are enriched for disease-associated

polymorphisms. Genome Biology 2012, 13:R46.


Explosive Growth in Sequence Data

As the cost of DNA sequencing falls,
the growth of human genome data becomes exponential


The ‘Big Data’ Problem

Lee Hood, IOM February 27, 2012

The ‘Big Data’ Problem
“The world is shifting to an
innovation economy and nobody
does innovation better than
America.”
—President Obama, 12/6/2011

 Pillers of Bioeconomy R&D:
1) Synthetic Biology
2) Proteomics
3) Information Technology—
Bioinformatics &
Computational Biology


The ‘Diminishing Discovery’ Problem


FDA’s Solution: Adaptation in the Pre-Competitive Space
SCREENING TRIAL Achieve surrogate
Investigational drugs end point predictive Promising drug candidate
of clinical outcome
& associated PGx markers & associated PGx marker

CONFIRMATORY TRIAL
Replicate Achieve clinical outcome
surrogate end (regulatory standard for
Promising drug candidate
point FDA approval)
& associated PGx marker

FDA APPROVAL
Accelerated drug approval with
Full drug approval
approval of PGx biomarker

*Slide adapted , with permission, from Janet Woodcock and Issam Zineh, CDER, FDA


Pre-Competitive Collaboration: Solution for Pharma

• Share use cases/questions – gaps in current tools
• Identify common solutions & options
• Share development risk/costs
• Build interoperability standards into platforms
• Publicly share experiences - good & bad
• PPP (public-private-partnership) infrastructure
• Build portable talent base/experts across sites
• Compile innovations from participating groups
• Follow European model – share trial participants
• Faster path for FDA drug approval


tranSMART: Bioinformatics & shared data analytics platform
• tranSMART is an open source informatics software platform that allows
pharmaceutical, diagnostic and medical device companies to share “pre-competitive”
data and a set of common tools for analysis of data. The license protects the
intellectual property of all stakeholders.
• Dr. Eric Perakslis, now CIO and Chief Scientist (Informatics) at the FDA, originally
developed tranSMART when he served as a research scientist at Johnson &
Johnson. tranSMART is based on the i2b2 informatics platform.
• tranSMART has been adopted more broadly in Europe than in the U.S. An example
of a study where “pre-competitive” data were shared (KM: Knowledge
Management):

U-BIOPRED
(Unbiased BIOmarkers in PREDiction
of respiratory disease outcomes)1
1Bel EH et al. Diagnosis and definition of severe refractory
asthma: an international consensus statement from the
Innovative Medicine Initiative (IMI). Thorax. 2011 66(10):910


One Mind Integrative Informatics Platform
Genome Proteome Signaling Phenome Disease

Integrative Analyses Managed Thru Cloud-Based Portal

One Mind
PortalTM
Builds off of
tranSMART
Data Knowledge
Management
System


Human Genome Variation as determined by NGS
“The ability of sequencing to detect a site that is segregating in the population is dominated by two
factors:
1. Whether the non-reference allele is present among the individuals chosen for sequencing, and;
2. The number of high quality and well mapped reads that overlap the variant site in individuals who
carry it.
Simple models show that for a given total amount of sequencing, the number of variants discovered is
maximized by sequencing many samples at low coverage. This is because high coverage of a few
genomes, while providing the highest sensitivity and accuracy in genotyping a single individual, involves
considerable redundancy and misses variation not represented by those samples.”1

Genome variants of different Transposons
types, determined by low coverage
sequencing of individuals, trios Duplications
(e.g., mother, father and daughter) and
exons. These data are derived from the 1000
Deletions Known
genomes project.1 Novel
Insertions
• Note that they did not attempt to resolve
Copy Number Variants (CNVs) or Variable SNPs
Number of Tandem Repeats
(VNTRs), which convey inter-individual
variation. 0% 50% 100%
• Note the large percentage
1Durbin et al. A map of human genomeof novel from population-scale sequencing. 2010. Nature 467: 1061-1073.
SNPs
variation
that were discovered by NGS.

Genome Variation and Pharmacogenomics
Some important points about Single Nucleotide Polymorphisms (SNPs) :
• All methods to determine human genome variation contain error.
• So-called “common” SNPs, with a frequency of >0. 5%, have yielded modest effects in genome-
wide association scans (GWAS) for determination in complex diseases.
• Early results from pharmacogenomic GWAS appear to indicate a greater ability to discover SNPs
with substantial effect size. Nevertheless, they do not explain the full extent of human genome
variation and drug response. Pharmacogenomic GWAS are limited in power by small cohort sizes.1
• Although each human genome may have ~3 M SNPs, only some of these variants are deleterious.
• SNPs have been the easiest genomic variant to measure, but other variants, such as Copy Number
Variants (CNVs), may be more important determinants of drug response.2
• Most variants that impact individual drug response have not yet been identified.3*
1Guessous, I., Gwinn, M. & Khoury, M.J. Genome-wide association studies in pharmacogenomics: untapped potential for
translation. Genome Med 1, 46 (2009); Group, S.C. et al. SLCO1B1 variants and statin-induced myopathy—a genome
wide study. N Engl J Med 359, 789-799 (2008). Sato, Y. et al. A new statistical screening approach for finding
pharmacokinetics related genes in genome-wide studies. Pharmacogenomics J 9, 137-146 (2009);
Crowley, J.J., Sullivan, P.F. & McLeod, H.L. Pharmacogenomic genome-wide association studies: lessons learned thus
far. Pharmacogenomics 10, 161-163 (2009).
2Rasmussen H B et al. Genome-wide identification of structural variants in genes encoding drug targets: possible

implications for individualized drug therapy. Pharmacogenetics and Genomics. July 2012. 22 (7): 471-483.
3Durbin et al. A map of human genome variation from population-scale sequencing. 2010. Nature 467: 1061-1073. *FDA.


Allele-Specific PCR cannot accurately detect SNPs1:

Unknown SNP

1Favis,
R. Applying next generation sequencing to
Unknown SNP pharmacogenomics studies in clinical trials.


High throughput genotyping platforms cannot accurately resolve
allelic variants of the CYP2D6 superfamily1:
Genome-wide arrays, some that are specifically configured to examine
pharmacogene variants, were poor at discriminating CYP2D6 alleles:

1Gamazon ER et al. The limits of genome-wide methods for pharmacogenomics testing. Pharmacogenetics and
Genomics. 2012. 22:261–272.;


Some important points about Next Generation Sequencing (NGS):
• All methods to determine human genome variation contain error.
• All ‘short read’ NGS methods rely on the use of a “reference genome” as ground truth, when the
various reference genomes have been shown to have unusual variation1.
• Short read NGS technology is fraught with errors, and thus either requires 60-100 fold coverage
for a single individual, or low coverage whole genome sequence data from a large popoulation2.
The most accurate results have been obtained from sequencing the whole genomes of closely-
related individuals, along with inclusion of other data related to family medical history1,3.
• Short read NGS technology is especially poor at calling variants in GC-rich regions of the genome
such as CpG islands.
• The real value is provided by long read technology, which has been implemented by Complete
Genomics, but they have a backlog of genomes to sequence under contract (~27,354 as of 6/12).
• So-called ‘clinical’ or bench-top sequencers, such as Illumina’s MiSeq or Life Technologies Ion
Torrent, manifest all the problems associated with short read technology, including extensive
pre-processing of tissue samples and complex data analysis.
1Dewey et al. Phased whole-genome genetic risk in a family quartet using a major allele reference sequence. PLoS
Genet. 2011 September; 7(9): e1002280.
2Durbin et al. A map of human genome variation from population-scale sequencing. 2010. Nature 467: 1061-1073.
3Patel C J et al. Data-driven integration of epidemiological and toxicological data to select candidate interacting genes

and environmental factors in association with disease. Bioinformatics. 2012 Jun 15;28(12):i121-i126.


Whole genome sequencing & analysis has been able to resolve pharmacogene variation on a
genome-wide level, including the various alleles of the CYP2D6 superfamily1:
Allele Effect on Metabolism Allele Effect on Metabolism Allele Effect on Metabolism
*1 Fully functional *14 Null *33 Fully functional
*2 Fully functional *14A Null *35 Fully functional
*3 Null *14B Null *36 pseudogene
*4 Null *15 Null *37 Reduced activity
*5 Null *16 Null *38 Null
*6 Null *17 Reduced activity *39 pseudogene
*7 Null *18 Null *40 Null
*8 Null *19 Null *41 Reduced activity
*9 Reduced activity *20 Null *42 Null
*10 Reduced activity *25 pseudogene *43 pseudogene
*10AB Reduced activity *26 pseudogene *44 Null
*11 Reduced activity *29 Reduced activity *45 Reduced activity
*12 Null *30 pseudogene *46 Reduced activity
*13 Null *31 pseudogene *56 Reduced activity
1Black
JL et al. Frequency of undetected CYP2D6 hybrid genes in clinical samples: Impact on phenotype prediction. Drug
Metab Dispos June 2012 40:1238; Patents: United States Patent Application 20120088247;

Trends in Next Generation Sequencing
2010 2013
Generation 2nd Generation NGS 3rd Generation NGS
Fundamental technology SBS or degradation Direct physical inspection of the DNA molecule
using nanopore, high speed camera and/or silicon
chip technology
Resolution Averaged across many copies of the DNA Single-molecule resolution
molecule being sequenced
Raw read accuracy High, with >60-fold coverage High, missed variant calls: 1 in 500kb – 1M bases
Read length Short - ~35 bases, generally much shorter Long, 10,000 bp and longer
than Sanger sequencing
Throughput High Highest
Current cost Low cost per base Lowest cost per base
RNA-sequencing cDNA sequencing Direct RNA sequencing and cDNA sequencing
Start-to-Finish Days One hour per whole genome
Sample preparation Complex, library and PCR amplification Very simple
required
Data analysis Complex because of large data volumes and Complex because of large data volumes– however
because short reads complicate assembly and those can be solved by new high speed camera
alignment algorithms and chip technologies

Primary results Base calls with quality values Base calls with quality values, other base
information such as kinetics, structural variants
and phased haplotypes


Trends in Next Generation Sequencing
2nd Generation NGS - Short read archive:
• Hardware and Service Companies – Market Share– Ilumina and Complete
Genomics sequenced over 90% of all genomes as of 10/1/111
Percentage of Whole Human Genomes Sequenced

Illumina

Complete Genomics

Life Technologies

Others

• Concordance of variant calls – Illumina versus Complete Genomics short read1

Concordance between platforms: SNPs Indels
(One individual, 76-fold coverage, ~3.7M SNPs)
88.1% 26.5%
1Lam HL et al. Performance comparison of whole-genome sequencing platforms. Nature Biotech. 2012. 30: 78-82.


Next Generation Sequencing – Update 6/12
Company Product(s) Tech Problems Prognosis
• HiSeq 2nd generation - Too expensive; Will eventually be
• MiSeq Short read Should have taken buyout acquired at bargain
clinical from Roche; Dominate market price, or merge – best
sequencer* – believe they can do the same candidate for M&A is
*(FDA-approved in molecular diagnostics BGI
Type III device)
Sequencing-as-a- 2nd generation - Just laid off 55 employees – Long read technology is
service Short read (75% restructuring so as to only very accurate, but have
of business); focus on clinical markets – no “over-committed”,
3rd generation more life sciences research. including Mayo, ARUP,
(25%) Need to switch to long read INOVA, Partners, etc.
technology ASAP – but can’t Will survive …
because of sequence backlog.
• Personal 2nd generation - Tiny market share; already Company is diversified
genome Short read pushed back dates on Ion enough to subsidize
• Exome Torrent Exome to 9/12 sequencing hardware
machine
• Gridiron and 3rd generation – No credibility; USB mini-pore Long read technology is
Mini-Ion long read – can only sequence one accurate, Company has
licensed from genome in closed system – over $150M funding–
Winters-Hilt expensive. who knows?
Not named yet 3rd generation – “Still working on the Long read technology is
long read – chemistry”. CEO won’t discuss very accurate,
licensed from status of company… represents optimal
AssureRx Health, Inc. Winters-Hilt CONFIDENTIAL survive.
solution – will 28

NGS – Complete Genomics, Inc.


NGS – Long Read Nanopore Solutions
Complete Genomics Their most recent technology involves
combining a very high speed CCD (charge-
coupled display) camera with each DNA
base tagged with a fluorochrome coming
through a nanopore.

•They have achieved 500Kb read
lengths, claim error rate is “I missed base
call variant every 500Kb” – Lee Hood.
•They have been able to resolve phased
maternal and paternal chromosomes
1. Extract and fragment DNA
•They can resolve distributed repeats (e.g.
2. Each base (A, C, G, T) tagged
pseudogenes)
with a different fluorochrome
3. Multi-planar graphene array •However, their in-house, pre- and post-
4. High-speed CCD camera – can processing steps are very complex and time-
consuming, their turnaround time for a
capture every base per pixel
human genome with a coverage of 10-fold is
with DNA traveling at ~10 base 72 days, and they now have a backlog of
pairs per second. 25,000 genomes.

NGS – Long Read Nanopore Solutions
Ideal System1 Rosenstein et al1 latest device can accurately
sequence 1 million base pairs of double-
stranded DNA without error.
• Unlike most researchers interested in
using nanopores to directly sequence
DNA that have slowed the DNA velocity in
the nanopore translocation stage through
adding an enzyme ratchet such as Oxford
Nanopore Technology to accommodate
the low bandwidths available, these
1. Extract DNA. researchers used complementary metal-
2. Pass “naked” DNA through oxide semiconductor (CMOS) processing
graphene nanopore array. and integrated circuits technology.
3. High bandwidth CMOS pre-amplifier • They have been able to redesign their
system to increase the bandwidth above
positioned under every pore. 50MHz, with a very low signal-to-noise
4. Solid state silicon nitride membrane ratio to sequence an entire human
chip mounted in the fluid cell. genome with very little sample
preparation in 20 minutes.
1RosensteinJK et al. Integrated nanopore sensing platform with sub-microsecond temporal resolution. Nature
Methods. 2012. 9 (5): 487-492.

WGA – Clinical Interpretation Software
Whole Genome Analysis - “The $1,000 genome and the $1M interpretation.”

3 major approaches:

• Filter data followed by complex analysis – Used by Cypher Genomics and Illumina

• Apply proprietary natural language processing algorithms against whole
genome or whole exome data – Used by Silicon Valley Biosystems

• Genomic best linear unbiased prediction (GBLUP) method to evaluate
predictive ability by cross-validation. GBLUP approaches take into account the
covariance structure inferred from the genomic data. Best predictive
accuracy1,2
1Ober Uet al. Using whole-genome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster.
PLoS Genetics. May 2012. 8 (5): 1-14.
2Jones B. Predicting phenotypes. Nature Reviews Genetics. 2012. 13. doi:10.1038/nrg3267


Whole Genome Analysis - Example from Cypher Genomics


Lab & Technology Operations

Lab
• Results delivered within one business day of
receipt of a patient’s DNA sample
• CLIA certified
• CAP accredited
• NY State Department of Health certified

Technology
• Advanced bioinformatics
• World-class data center operations
• Secure Internet protocols
• HIPAA compliant architecture
• Data integration with Facility Health Information
Management Systems


Vice President Discusses Pharmacogenomics and Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Vice President Discusses Pharmacogenomics and Big Data

Similar to Vice President Discusses Pharmacogenomics and Big Data (20)

Recently uploaded

Recently uploaded (20)

Vice President Discusses Pharmacogenomics and Big Data

Editor's Notes