Victoria López Alonso PhD Medical Bioinformátics Area Instituto de Salud Carlos III Spain Bioinformatics challenges in a personalized medicine pipeline Workshop INBIOMEDvision, MIE 2011
Bridging gaps between Bioinformatics and MI BMI deals with the integrative management and synergic exploitation of the wide and inter-related scope of information that is generated and needed in healthcare settings, biomedical research institutions and health-related industry.
Overview: Personalized medicine in current practice 1- Processing large-scale genomic data 2- Interpretation of functional effect of genomic variation 3- Integration of systems data 4- Translation into medical practice Bioinformatics challenges for Personalized medicine
Personalized medicine in current practice  Translational bioinformatics utilizes computational tools for the analysis of large biological databases and to fully comprehend disease mechanisms by not only understanding the genetics and the proteomics but also by associating them with the clinical data.
Advances of molecular science Human Genome Project in 2003 Finishing the euchromatic sequence of the human genome.  Nature  2004; 431 (7011):  931-945. Phase I HapMap project in 2005 Phase II  and Phase III A haplotype map of the human genome. Nature 2005: 437(7063):1299-1320 Encyclopedia of DNA Elements (ENCODE) project in 2007 Identification and analysis of functional elements in 1% of the human genome. Nature 2007; 447(7146):799-816 1000 Genomes Project in 2008 DNA sequences.  A plan to capture human diversity in 1000 genomes. Science 2008; 319(5863):395 $1000 Genome in …2013 ??
Personalized medicine in current practice  Chemotherapy medications trastuzumab and Imatinib  (Gambacorti-Passerini, 2008;  Hudis, 2007) Targeted pharmacogenetic dosing algorithm is used for warfarin  ( International Warfarin Pharmacogenetics Consortium et al., 2009 )  Incidence of adverse events for drugs  Abacavir,  Carbamazepine and Clozapine  (Dettling et al., 2007; Ferrell and McLeod, 2008, 2002). The inclusion of genetics in EHRs will provide risk assesment.  Clinical assessment incorporating a personal genome .   Ashley et al. Lancet (2010)
Bentley D. “Genomes for Medicine”. (2004). Nature Insight 429, p440-446 Today patient´s genetics are consulted only for few diagnoses and treatments and only in certain medical centers  (cystic fribrosis , breast cancer) With easy access to a well annotated human genome an individual could adquire a genetic health profile including risk and resistance factors that could be used to guide medical decisions. Personalized medicine in current practice
1- Processing large-scale genomic data 2- Interpretation of functional effect of genomic variation 3- Integration of systems data 4- Translation into medical practice Bioinformatics challenges for Personalized medicine Different informatics challenges should be addressed to create the tools to tailor medical care to each individual genome and also to realize the potential of personalized medicine
SNPs  (Single Point Polymorphims) are key enablers in realizing the concept of personalized medicine. Sequencing technologies are becoming accessible Whole genome < 2 weeks 1 error per 100 kb-------30.000 erroneous variant calls The error rate of these technologies is a source of significant challenges in applications, including discovering novel variants 1-Processing large-scale genomic data SNP: frequency in the human population is higher than 1%
100.000 and 300.000 previously undiscovered SNPs Variant discovery---”needle in a haystack” Verification of novel variants due to the false positive rate In addition there are other important classes of variations for clinical applications: short insertion–deletion variants (indels),  copy number variants (CNVs)  structural variants (SVs) 1-Processing large-scale genomic data New algorithms to detect these variations from sequencing  data
1- Processing large-scale genomic data High quality sequence reads must be  placed into their genomic context to identify variants. The challenge is to develop new algorithms to do the “novo assembly” computationally possible. De novo assembly is slow and complicated by repetitive elements. Sequences are mapped to a genomic reference sequence : BLAST have been traditionally used, but their execution speed depends on the genome size.
1- Processing large-scale genomic data New Mapping and alignment algorithms BLAT indexed version of the genome  (Kent, 2002).   Burrows-Wheeler Aligner (BWA)   (Li and Homer, 2010).   Ideally performed in a cluster or by using cloud computing Program must allow for mismatches without resulting in false alignments   Improving of quality control metrics: ratios of base transition, Mendelian inheritance errors (MIE), relative quality scores…
2- Interpretation of functional effect After genomic data has been processed, the functional effect and the impact of the genetic variation must be analyzed  Genome-wide association studies (GWASs)  have been used to assess the statistical associations of SNPs with many important common diseases.  GWAS provides new insights but only a limited number of variants have been characterized and understanding the functional relationship  between variants and phenotypes. https://www.wtccc.org.uk
2- Interpretation of functional effect Important issues for predicting the effect of SNPs are data management, retrieval and quality control. SNP databases: The dbSNP database (20 millions of validated SNPs) The Human Gene Mutation Database (HGMD) (SNPs associated with diseases) SwissVar Online Mendelian  Inheritance in Man (OMIM) database PharmGKB database Catalogue of Somatic Mutations in Cancer (COSMID) Number  of known SNPs Fernald et al. 2011
2-Interpretation of functional effect Computational methods to predict mSNPs: Empirical rules  (Ng and Henikoff,  2003; Ramensky et al., 2002),  Hidden Markov Models (HMMs)  (Thomas and Kejariwal, 2004),  Neural Networks  (Bromberg  et al., 2008; Ferrer-Costa et al., 2005),   Decision Trees  (Dobson  et al., 2006; Krishnan and Westhead, 2003),  Random Forests  (Li et al., 2009; Wainreb et al., 2010) Support Vector Machines  (Calabrese et al., 2009 ). The prediction algorithms input features include: amino acid sequence  protein structure  evolutionary information
New algorithms that include knowledge-based information are being developed on evolutionary information for the prediction of SNPs: PANTHER uses a library of protein family HMM . http://www.pantherdb.org/ PolyPhen uses different sequence-based features. http://genetics.bwh.harvard.edu/pph MutPred evaluates the probabilities of gain or loss of structure and function upon mutations using random forest.  http://mutdb.org/mutpred  SIFT uses a multiple sequence alignment between homolog proteins.  http://sift.jcvi.org  SNAP Sequence  http://rostlab.org/services/snap SNPEffect   http://snpeffect.vib.be SNPs3D Structure-based SVM predictor http://www.snps3d.org 2- Interpretation of functional effect
2- Interpretation of functional effect Experimental test are required to validate genetic predictions.  There are is a need for fast and accurate methods for gene prioritization Eleftherohorinou et al., 2010 Currently the most effective strategy uses the concept of genes that are linked to the biological process of interest. The input data for gene priorization is the functional annotation, the protein–protein interactions,  biological pathways and literature.
2- Interpretation of functional effect SUSPECT:  sequence features, gene expression data, functional terms… ToppGene : mouse phenotype data with human gene annotations and literature MedSim: human disease genes with mouse genes ENDEAVOUR: genes involved in a known biological process G2D and  PolySearch  data mining on biological databases MimMiner: text mining comparing the human phenome and disease phenotypes PhenoPred : uses protein sequence and function  GeneMANIA : uses functional assays The Gene Priorization Portal provides comprehensive descriptions of available predictors:  http://homes.esat.kuleuven.be/~bioiuser/gpp/index.php
2- Interpretation of functional effect Last year, the first edition of the Critical Assessment of Genome Interpretation (CAGI) was organized to assess the available methods for predicting phenotypic impact of genomic variation and to  stimulate future research.  http://genomeinterpretation.org/
3-Integration of systems data There is concern that pharmacogenomics GWAS themselves are susceptible to many limitations:  insufficient sample size, selection biases for genetic variants and environmental interactions may affect the outcome measures Multiple gene–gene interactions may underlie unexplained.  HapMap Project, 2004
3- Integration of systems data Model Selection Methods  have been successful with disease and trait GWAS studies using selection techniques to choose multifactorial models that balance the false positive rate, statistical power and  computational requirements of the search Dimensionality reduction methods Principal Components Analysis Information Gain and  Multifactor Dimensionality Reduction  (ie.  hypertension and familial amyloid polyneuropathy type I) Ritchie and Monsimger, 2010
3- Integration of systems data Naylor and Chen, 2010 No external knowledge sources informs about the biology behind  the interactions. Systems biology  and network approaches address to the problem of complexity integrating molecular data at multiple levels of biology including genomes,  transcriptomes, metabolomes, proteomes and functional and regulatory networks.
3- Integration of systems data The simple “one SNP, one phenotype” approach is insufficient. Most medically relevant phenotypes are thought to be the result of gene–gene and gene–environment  interactions Adeyemo et al., 2010
3- Integration of systems data Limdi and Veenstra , 2008 Drug response  often depends on multiple pharmacokinetic and pharmacodynamic interactions  . Some success: studies of warfarin have linked the majority of variation in response to two genes, CYP2C9 and VKORC1. Improved dosing algorithm.
Goh et al., 2007   3- Integration of systems data Disease–Gene Networks  Chemical structures, Diseases and Protein sequences  Epigenetic data and Drug Phenotypes Pathways and Gene sets Gene Set Enrichment Analysis (GSEA)  SNP Ratio Test  The Prioritizing Risk Pathways method Assumptions must also be examined carefully  ¡¡¡ Combining disparate data sources can result in novel associations
4- Translation into medical practice Much of this research has yet to be translated to the clinic for improved patient care.  One of the areas where bioinformatics can have the greatest clinical impact is pharmacogenomics improving drug prescription and dosing. Pharmacogenomic prescription and dosing algorithms need to be accessible to physicians.   Martin-Sanchez et al. 2006 Warfarindosing could save up to 60% of the cost and reduce possible adverse events
Medical practice needs to be updated to include routine pharmacogenetic testing, educating and training physicians in personalized medicine, and futher clinical trials to prove the efficacy of predictions Bioinformatics also translates discoveries to the clinic by disseminating discoveries through curated, searchable databases  4-Translation into medical practice http://pacdb.org/ http :// www.pharmgkb.org / The database of Genotypes and Phenotypes  The Pharmakogenomics Knowledge Database Pharmacogenetics-Cell line database  www.ncbi.nlm.nih.gov/gap The Adverse Event Reporting System (AERS) www.fda.gov/Drugs/
Biologically and medically focused text mining algorithms can speed the collection of this structured data, such as methods that use sentence syntax and natural language processing to derive drug–gene and gene–gene  interactions from scientific literature. Opportunities for bioinformatics to integrate with the electronic medical record (EMR) 4- Translation into medical practice www.mc.vanderbilt.edu/ www.phenx.org/  BioBank system at Vanderbilt RTI International with NHGRI
http://biotic.isciii.es/ [email_address] Instituto de Salud Carlos III Medical Bioinformatics Area Thanks ¡¡¡ http://www.inbiomedvision.eu/

INBIOMEDvision Workshop at MIE 2011. Victoria López

  • 1.
    Victoria López AlonsoPhD Medical Bioinformátics Area Instituto de Salud Carlos III Spain Bioinformatics challenges in a personalized medicine pipeline Workshop INBIOMEDvision, MIE 2011
  • 2.
    Bridging gaps betweenBioinformatics and MI BMI deals with the integrative management and synergic exploitation of the wide and inter-related scope of information that is generated and needed in healthcare settings, biomedical research institutions and health-related industry.
  • 3.
    Overview: Personalized medicinein current practice 1- Processing large-scale genomic data 2- Interpretation of functional effect of genomic variation 3- Integration of systems data 4- Translation into medical practice Bioinformatics challenges for Personalized medicine
  • 4.
    Personalized medicine incurrent practice Translational bioinformatics utilizes computational tools for the analysis of large biological databases and to fully comprehend disease mechanisms by not only understanding the genetics and the proteomics but also by associating them with the clinical data.
  • 5.
    Advances of molecularscience Human Genome Project in 2003 Finishing the euchromatic sequence of the human genome. Nature 2004; 431 (7011): 931-945. Phase I HapMap project in 2005 Phase II and Phase III A haplotype map of the human genome. Nature 2005: 437(7063):1299-1320 Encyclopedia of DNA Elements (ENCODE) project in 2007 Identification and analysis of functional elements in 1% of the human genome. Nature 2007; 447(7146):799-816 1000 Genomes Project in 2008 DNA sequences. A plan to capture human diversity in 1000 genomes. Science 2008; 319(5863):395 $1000 Genome in …2013 ??
  • 6.
    Personalized medicine incurrent practice Chemotherapy medications trastuzumab and Imatinib (Gambacorti-Passerini, 2008; Hudis, 2007) Targeted pharmacogenetic dosing algorithm is used for warfarin ( International Warfarin Pharmacogenetics Consortium et al., 2009 ) Incidence of adverse events for drugs Abacavir, Carbamazepine and Clozapine (Dettling et al., 2007; Ferrell and McLeod, 2008, 2002). The inclusion of genetics in EHRs will provide risk assesment. Clinical assessment incorporating a personal genome . Ashley et al. Lancet (2010)
  • 7.
    Bentley D. “Genomesfor Medicine”. (2004). Nature Insight 429, p440-446 Today patient´s genetics are consulted only for few diagnoses and treatments and only in certain medical centers (cystic fribrosis , breast cancer) With easy access to a well annotated human genome an individual could adquire a genetic health profile including risk and resistance factors that could be used to guide medical decisions. Personalized medicine in current practice
  • 8.
    1- Processing large-scalegenomic data 2- Interpretation of functional effect of genomic variation 3- Integration of systems data 4- Translation into medical practice Bioinformatics challenges for Personalized medicine Different informatics challenges should be addressed to create the tools to tailor medical care to each individual genome and also to realize the potential of personalized medicine
  • 9.
    SNPs (SinglePoint Polymorphims) are key enablers in realizing the concept of personalized medicine. Sequencing technologies are becoming accessible Whole genome < 2 weeks 1 error per 100 kb-------30.000 erroneous variant calls The error rate of these technologies is a source of significant challenges in applications, including discovering novel variants 1-Processing large-scale genomic data SNP: frequency in the human population is higher than 1%
  • 10.
    100.000 and 300.000previously undiscovered SNPs Variant discovery---”needle in a haystack” Verification of novel variants due to the false positive rate In addition there are other important classes of variations for clinical applications: short insertion–deletion variants (indels), copy number variants (CNVs) structural variants (SVs) 1-Processing large-scale genomic data New algorithms to detect these variations from sequencing data
  • 11.
    1- Processing large-scalegenomic data High quality sequence reads must be placed into their genomic context to identify variants. The challenge is to develop new algorithms to do the “novo assembly” computationally possible. De novo assembly is slow and complicated by repetitive elements. Sequences are mapped to a genomic reference sequence : BLAST have been traditionally used, but their execution speed depends on the genome size.
  • 12.
    1- Processing large-scalegenomic data New Mapping and alignment algorithms BLAT indexed version of the genome (Kent, 2002). Burrows-Wheeler Aligner (BWA) (Li and Homer, 2010). Ideally performed in a cluster or by using cloud computing Program must allow for mismatches without resulting in false alignments Improving of quality control metrics: ratios of base transition, Mendelian inheritance errors (MIE), relative quality scores…
  • 13.
    2- Interpretation offunctional effect After genomic data has been processed, the functional effect and the impact of the genetic variation must be analyzed Genome-wide association studies (GWASs) have been used to assess the statistical associations of SNPs with many important common diseases. GWAS provides new insights but only a limited number of variants have been characterized and understanding the functional relationship between variants and phenotypes. https://www.wtccc.org.uk
  • 14.
    2- Interpretation offunctional effect Important issues for predicting the effect of SNPs are data management, retrieval and quality control. SNP databases: The dbSNP database (20 millions of validated SNPs) The Human Gene Mutation Database (HGMD) (SNPs associated with diseases) SwissVar Online Mendelian Inheritance in Man (OMIM) database PharmGKB database Catalogue of Somatic Mutations in Cancer (COSMID) Number of known SNPs Fernald et al. 2011
  • 15.
    2-Interpretation of functionaleffect Computational methods to predict mSNPs: Empirical rules (Ng and Henikoff, 2003; Ramensky et al., 2002), Hidden Markov Models (HMMs) (Thomas and Kejariwal, 2004), Neural Networks (Bromberg et al., 2008; Ferrer-Costa et al., 2005), Decision Trees (Dobson et al., 2006; Krishnan and Westhead, 2003), Random Forests (Li et al., 2009; Wainreb et al., 2010) Support Vector Machines (Calabrese et al., 2009 ). The prediction algorithms input features include: amino acid sequence protein structure evolutionary information
  • 16.
    New algorithms thatinclude knowledge-based information are being developed on evolutionary information for the prediction of SNPs: PANTHER uses a library of protein family HMM . http://www.pantherdb.org/ PolyPhen uses different sequence-based features. http://genetics.bwh.harvard.edu/pph MutPred evaluates the probabilities of gain or loss of structure and function upon mutations using random forest. http://mutdb.org/mutpred SIFT uses a multiple sequence alignment between homolog proteins. http://sift.jcvi.org SNAP Sequence http://rostlab.org/services/snap SNPEffect http://snpeffect.vib.be SNPs3D Structure-based SVM predictor http://www.snps3d.org 2- Interpretation of functional effect
  • 17.
    2- Interpretation offunctional effect Experimental test are required to validate genetic predictions. There are is a need for fast and accurate methods for gene prioritization Eleftherohorinou et al., 2010 Currently the most effective strategy uses the concept of genes that are linked to the biological process of interest. The input data for gene priorization is the functional annotation, the protein–protein interactions, biological pathways and literature.
  • 18.
    2- Interpretation offunctional effect SUSPECT: sequence features, gene expression data, functional terms… ToppGene : mouse phenotype data with human gene annotations and literature MedSim: human disease genes with mouse genes ENDEAVOUR: genes involved in a known biological process G2D and PolySearch data mining on biological databases MimMiner: text mining comparing the human phenome and disease phenotypes PhenoPred : uses protein sequence and function GeneMANIA : uses functional assays The Gene Priorization Portal provides comprehensive descriptions of available predictors: http://homes.esat.kuleuven.be/~bioiuser/gpp/index.php
  • 19.
    2- Interpretation offunctional effect Last year, the first edition of the Critical Assessment of Genome Interpretation (CAGI) was organized to assess the available methods for predicting phenotypic impact of genomic variation and to stimulate future research. http://genomeinterpretation.org/
  • 20.
    3-Integration of systemsdata There is concern that pharmacogenomics GWAS themselves are susceptible to many limitations: insufficient sample size, selection biases for genetic variants and environmental interactions may affect the outcome measures Multiple gene–gene interactions may underlie unexplained. HapMap Project, 2004
  • 21.
    3- Integration ofsystems data Model Selection Methods have been successful with disease and trait GWAS studies using selection techniques to choose multifactorial models that balance the false positive rate, statistical power and computational requirements of the search Dimensionality reduction methods Principal Components Analysis Information Gain and Multifactor Dimensionality Reduction (ie. hypertension and familial amyloid polyneuropathy type I) Ritchie and Monsimger, 2010
  • 22.
    3- Integration ofsystems data Naylor and Chen, 2010 No external knowledge sources informs about the biology behind the interactions. Systems biology and network approaches address to the problem of complexity integrating molecular data at multiple levels of biology including genomes, transcriptomes, metabolomes, proteomes and functional and regulatory networks.
  • 23.
    3- Integration ofsystems data The simple “one SNP, one phenotype” approach is insufficient. Most medically relevant phenotypes are thought to be the result of gene–gene and gene–environment interactions Adeyemo et al., 2010
  • 24.
    3- Integration ofsystems data Limdi and Veenstra , 2008 Drug response often depends on multiple pharmacokinetic and pharmacodynamic interactions . Some success: studies of warfarin have linked the majority of variation in response to two genes, CYP2C9 and VKORC1. Improved dosing algorithm.
  • 25.
    Goh et al.,2007 3- Integration of systems data Disease–Gene Networks Chemical structures, Diseases and Protein sequences Epigenetic data and Drug Phenotypes Pathways and Gene sets Gene Set Enrichment Analysis (GSEA) SNP Ratio Test The Prioritizing Risk Pathways method Assumptions must also be examined carefully ¡¡¡ Combining disparate data sources can result in novel associations
  • 26.
    4- Translation intomedical practice Much of this research has yet to be translated to the clinic for improved patient care. One of the areas where bioinformatics can have the greatest clinical impact is pharmacogenomics improving drug prescription and dosing. Pharmacogenomic prescription and dosing algorithms need to be accessible to physicians. Martin-Sanchez et al. 2006 Warfarindosing could save up to 60% of the cost and reduce possible adverse events
  • 27.
    Medical practice needsto be updated to include routine pharmacogenetic testing, educating and training physicians in personalized medicine, and futher clinical trials to prove the efficacy of predictions Bioinformatics also translates discoveries to the clinic by disseminating discoveries through curated, searchable databases 4-Translation into medical practice http://pacdb.org/ http :// www.pharmgkb.org / The database of Genotypes and Phenotypes The Pharmakogenomics Knowledge Database Pharmacogenetics-Cell line database www.ncbi.nlm.nih.gov/gap The Adverse Event Reporting System (AERS) www.fda.gov/Drugs/
  • 28.
    Biologically and medicallyfocused text mining algorithms can speed the collection of this structured data, such as methods that use sentence syntax and natural language processing to derive drug–gene and gene–gene interactions from scientific literature. Opportunities for bioinformatics to integrate with the electronic medical record (EMR) 4- Translation into medical practice www.mc.vanderbilt.edu/ www.phenx.org/ BioBank system at Vanderbilt RTI International with NHGRI
  • 29.
    http://biotic.isciii.es/ [email_address] Institutode Salud Carlos III Medical Bioinformatics Area Thanks ¡¡¡ http://www.inbiomedvision.eu/

Editor's Notes

  • #4 This talk outlines recent developments in sequencing technologies and genome analysis methods for application in personalized medicine. New methods are needed in four areas to realize the potential of personalized medicine: (i) processing large scale robust genomic data; (ii) interpreting the functional effect and the impact of genomic variation; (iii) integrating systems data to relate complex genetic interactions with phenotypes; and (iv) Translating these discoveries into medical practice.
  • #9 This talk outlines recent developments in sequencing technologies and genome analysis methods for application in personalized medicine. New methods are needed in four areas to realize the potential of personalized medicine: (i) processing large scale robust genomic data; (ii) interpreting the functional effect and the impact of genomic variation; (iii) integrating systems data to relate complex genetic interactions with phenotypes; and (iv) Translating these discoveries into medical practice.