The Role of The Statisticians in
Personalized Medicine:
An Overview of Statistical
Methods in Bioinformatics
Setia Pramana...
Educational Background
• Universitas Brawijaya Malang, FMIPA, Statistics
department, 1995-1999.
• Hasselt Universiteit, Be...
Now?
• Lecture and Researcher at Sekolah Tinggi Ilmu
Statistik, Jakarta.
• Adjunct Faculty at Medical Epidemiology and
Bio...
Outline
• Personalized Medicine
• Central Dogma
• Microarray Data Analysis
• Next Generation Sequencing
• Summary
Setia Pr...
Personalized Medicine
• Drug Development:
– Takes 10-15 years
– Cost millions USD
• Who: Pharmaceutical, biotechnology, de...
Personalized Medicine
• Drug Development Stages:
- Drug Discovery
- Pre-clinical Development
- Clinical Development 4 Phas...
Patients Heterogeneity
Setia Pramana 7
Patients Heterogeneity
• We’re all different in
- Physiological, demographic characteristics
- Medical history
- Genetic/g...
Patients Heterogeneity
• “One size does not fit all”
• Use a patient’s characteristics to determine best
treatment for him...
Subgroup identification and targeted treatment
• Determine subgroups of patients who share certain
characteristics and wou...
Subgroup identification and targeted treatment
Genotype Phenotype Intervention Outcome
Mutations/SN
Ps
Gene/Protein
Expres...
Advanced Biomedical Technologies
• High-throughput microarrays and molecular imaging
to monitor SNPs, gene and protein exp...
Microarrays
Setia Pramana 14
Central Dogma
Central Dogma
http://compbio.pbworks.com
Setia Pramana 15
Gene
• The full DNA sequence of an organism is called its
genome
• A gene is a segment that specifies the sequence of
one ...
Genomics
• The study of all the genes of a cell, or tissue, at :
– the DNA (genotype), e.g., GWAS SNP, CNV etc…
– mRNA (tr...
Microarray
• DNA microarrays are biotechnologies which
allow the monitoring of expression of
thousand genes.
Setia Pramana...
Applications
• High efficacy and low/no side effect drug
• Genes related disease.
• Biological discovery
– new and better ...
Microarray
Overview of the process
of generating high
throughput gene
expression data using
microarrays.
Setia Pramana 20
The Pipeline
• Experiment design  Lab work  Image processing
• Signal summarization (RMA, GCRMA)
• Normalization
• Data ...
Microarray Data Structure
Setia Pramana 22
Preprocessed Data
Genes C1 C2 C3 T1 T2 T3
G8522 6.78 6.55 6.37 6.89 6.78 6.92
G8523 6.52 6.61 6.72 6.51 6.59 6.46
G8524 5....
Challenges
• Mega data, difficult to visualize
• Too few records (columns/samples), usually < 100
• Too many rows(genes), ...
Microarray Data Analysis Types
• Gene Selection
–find genes for therapeutic targets
• Classification (Supervised)
–identif...
Gene Selection
• Modified t-test
• Significance Analysis of Microarray (SAM)
• Limma (Linear model for microarrays )
• Lin...
Visualization
• Dimensionality reduction
• PCA (Principal Component Analysis)
• Biplot
• Heatmap
• Multi dimensional scali...
Clustering
• Cluster the genes
• Cluster the
arrays/conditions
• Cluster both simultaneously
• K-means
• Hierarchical
• Bi...
Clustering
• Cluster or Classify
genes according to
tumors
• Cluster tumors
according to genes
Setia Pramana 29
Classification
• Linear Discriminat Analysis
• K nearest neighbour
• Logistic regression
• L1 Penalized Logistric regressi...
Aim: To improve understanding of host protein
profiles during disease progression especially in
children.
Classification of Malaria Subtypes
•Identify panel of proteins which could distinguish
between different subtypes.
•Implem...
Penalized Logistic Regression
•Logistic regression is a supervised method for binary
or multi-class classification.
•In hi...
Penalized Logistic Regression
Logistic model:
Maximize the log-likelihood:
•-Penalization (Lasso):
•
37
• Shrinks all regression coefficients () toward zero
and set some of them to zero.
• Performs parameter estimation and va...
Classification of Severe Malaria Anemia vs.
Uncomplicated Malaria group
39
AUC: 0.86
Dose-response Microarray Studies
Setia Pramana 40
Dose-response Microarray Studies
Setia Pramana 41
Implemented in R package IsoGene and IsoGeneGUI.
Dose-response Microarray Studies
Setia Pramana 42
Gene Signature for Prostate Cancer
Setia Pramana 43
Gene Signature for Prostate Cancer
Setia Pramana 44
Gene Signature for Prostate Cancer
Setia Pramana 45
Next Generation Sequencing
Setia Pramana 46
Next Generation Sequencing
Setia Pramana 47
Reading the order of bases of DNA fragments
NGS used for:
• Whole genome re-sequencing
• Metagenomics
• Cancer genomics
• Exome sequencing (targeted)
• RNA-sequencing...
Next Generation Sequencing
Setia Pramana 50
• Produce Massive Data and fast
• Problem is storage and analysis
RNA-seq Pipeline
• Align to a reference genome using Tophat.
Reference
Pramana, et.al 51NBBC 2013
Source: Trapnell et.al, ...
RNA-seq Pipeline
• Measure gene expression using Cufflinks: FPKM
(Fragments Per Kilobase of transcript per Million
mapped ...
Setia Pramana 53
Subtype-specific Transcripts/Isoforms
• Breast invasive carcinoma (BRCA) from the Cancer
Genome Atlas Project (TCGA).
• 32...
Subtype-specific Transcripts/Isoforms
• To discover transcripts/isoforms which are only
significantly (high/low) expressed...
Analysis Flow
329 samples TCGA
Discovery set
179 samples
Validation set
- TCGA 150 samples
- External samples
Classificati...
Subtype-specific Transcripts/Isoforms
Setia Pramana 57
Subtype-specific Transcripts/Isoforms
Setia Pramana 58
Subtype-specific Transcripts/Isoforms
Setia Pramana 59
Software?
• R now is growing, especially in bioinformatics
– Statistics, data analysis, machine learning
– Free
– High Qua...
Summary
• Statistics plays important roles in developing
personalized medicine
• Multidisciplinary field  need collaborat...
Thank you for your attention….
Setia Pramana 62
The Role of Statistician in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics
Upcoming SlideShare
Loading in …5
×

The Role of Statistician in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics

1,623 views

Published on

Invited lecturer given at Teknik Fisika, Institut Teknologi Sepuluh Nopember Surabaya.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,623
On SlideShare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The Role of Statistician in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics

  1. 1. The Role of The Statisticians in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics Setia Pramana Teknik Fisika Fakultas Teknik Industri Institut Teknologi Sepuluh Nopember Surabaya, 12 March 2014 Setia Pramana 1
  2. 2. Educational Background • Universitas Brawijaya Malang, FMIPA, Statistics department, 1995-1999. • Hasselt Universiteit, Belgium, MSc in Applied Statistics 2005-2006. • Hasselt Universiteit, Belgium, MSc in Biostatistics 2006- 2007. • Hasselt Universiteit, Belgium, PhD Statistical Bioinformatics, 2007-2011. • Medical Epidemiology And Biostatistics Dept. Karolinska Institutet, Sweden, Postdoctoral, 2011-2014
  3. 3. Now? • Lecture and Researcher at Sekolah Tinggi Ilmu Statistik, Jakarta. • Adjunct Faculty at Medical Epidemiology and Biostatistics Dept, Karolinska Institutet, Stockholm.
  4. 4. Outline • Personalized Medicine • Central Dogma • Microarray Data Analysis • Next Generation Sequencing • Summary Setia Pramana 4
  5. 5. Personalized Medicine • Drug Development: – Takes 10-15 years – Cost millions USD • Who: Pharmaceutical, biotechnology, device companies, Universities and government research agencies • Regulatory: The US Food and Drug Administration (FDA) • Evaluate: – Safety – can people take it? – Efficacy – does it do anything in humans? – Effectiveness – is it better or at least as good as what is currently available? – Do the benefits outweigh the risks? Setia Pramana 5
  6. 6. Personalized Medicine • Drug Development Stages: - Drug Discovery - Pre-clinical Development - Clinical Development 4 Phases • Statisticians are involved in all stages • Stages are highly regulated • Result is based on most of patients • But .. Patients are created differently! Setia Pramana 6
  7. 7. Patients Heterogeneity Setia Pramana 7
  8. 8. Patients Heterogeneity • We’re all different in - Physiological, demographic characteristics - Medical history - Genetic/genomic characteristics • What works for a patient with one set of characteristics might not work for another! Setia Pramana 8
  9. 9. Patients Heterogeneity • “One size does not fit all” • Use a patient’s characteristics to determine best treatment for him/her • Genomic information is a great potential -- > Personalized medicine: “The right treatment for the right patient at the right time” Setia Pramana 9
  10. 10. Subgroup identification and targeted treatment • Determine subgroups of patients who share certain characteristics and would get better on a particular treatment • Discover biomarkers which can identify the subgroup • Focus on finding and treating a subgroup Setia Pramana 11
  11. 11. Subgroup identification and targeted treatment Genotype Phenotype Intervention Outcome Mutations/SN Ps Gene/Protein Expression Epigenetics Diseases Disability etc Drug Regimes Personalized medicine Setia Pramana 12
  12. 12. Advanced Biomedical Technologies • High-throughput microarrays and molecular imaging to monitor SNPs, gene and protein expressions • Next-Generation Sequencing Setia Pramana 13
  13. 13. Microarrays Setia Pramana 14
  14. 14. Central Dogma Central Dogma http://compbio.pbworks.com Setia Pramana 15
  15. 15. Gene • The full DNA sequence of an organism is called its genome • A gene is a segment that specifies the sequence of one or more protein. Setia Pramana 16
  16. 16. Genomics • The study of all the genes of a cell, or tissue, at : – the DNA (genotype), e.g., GWAS SNP, CNV etc… – mRNA (transcriptomics), Gene expression, – or protein levels (proteomics). • Functional Genomics: study the functionality of specific genes, their relations to diseases, their associated proteins and their participation in biological processes. Setia Pramana 17
  17. 17. Microarray • DNA microarrays are biotechnologies which allow the monitoring of expression of thousand genes. Setia Pramana 18
  18. 18. Applications • High efficacy and low/no side effect drug • Genes related disease. • Biological discovery – new and better molecular diagnostics – new molecular targets for therapy – finding and refining biological pathways • Molecular diagnosis of leukemia, breast cancer, etsc. • Appropriate treatment for genetic signature • Potential new drug targets Setia Pramana 19
  19. 19. Microarray Overview of the process of generating high throughput gene expression data using microarrays. Setia Pramana 20
  20. 20. The Pipeline • Experiment design  Lab work  Image processing • Signal summarization (RMA, GCRMA) • Normalization • Data Analysis: – Differentially Expressed genes – Clustering – Classification – Etc. • Network / Pathways (GSEA etc..) • Biological interpretations Setia Pramana 21
  21. 21. Microarray Data Structure Setia Pramana 22
  22. 22. Preprocessed Data Genes C1 C2 C3 T1 T2 T3 G8522 6.78 6.55 6.37 6.89 6.78 6.92 G8523 6.52 6.61 6.72 6.51 6.59 6.46 G8524 5.67 5.69 5.88 7.43 7.16 7.31 G8525 5.64 5.91 5.61 7.41 7.49 7.41 G8526 4.63 4.85 5.72 5.71 5.47 5.79 G8528 7.81 7.58 7.24 7.79 7.38 8.60 G8529 4.26 4.20 4.82 3.11 4.94 3.08 G8530 7.36 7.45 7.31 7.46 7.53 7.35 G8531 5.30 5.36 5.70 5.41 5.73 5.77 G8532 5.84 5.48 5.93 5.84 5.73 5.75Setia Pramana 23
  23. 23. Challenges • Mega data, difficult to visualize • Too few records (columns/samples), usually < 100 • Too many rows(genes), usually > 10,000 • Too many genes likely leading to False positives • For exploration, a large set of all relevant genes is desired • For diagnostics or identification of therapeutic targets, the smallest set of genes is needed • Model needs to be explainable to biologists Setia Pramana 24
  24. 24. Microarray Data Analysis Types • Gene Selection –find genes for therapeutic targets • Classification (Supervised) –identify disease (biomarker study) –predict outcome / select best treatment • Clustering (Unsupervised) –find new biological classes / refine existing ones –Understanding regulatory relationship/pathway –exploration Setia Pramana 25
  25. 25. Gene Selection • Modified t-test • Significance Analysis of Microarray (SAM) • Limma (Linear model for microarrays ) • Linear Mixed model • Lasso (least absolute selection and shrinkage operator) • Elastic-net • Etc, Setia Pramana 26
  26. 26. Visualization • Dimensionality reduction • PCA (Principal Component Analysis) • Biplot • Heatmap • Multi dimensional scaling • Etc Setia Pramana 27
  27. 27. Clustering • Cluster the genes • Cluster the arrays/conditions • Cluster both simultaneously • K-means • Hierarchical • Biclustering algorithms Setia Pramana 28
  28. 28. Clustering • Cluster or Classify genes according to tumors • Cluster tumors according to genes Setia Pramana 29
  29. 29. Classification • Linear Discriminat Analysis • K nearest neighbour • Logistic regression • L1 Penalized Logistric regression • Neural Network • Support vector machines • Random forest • etc Setia Pramana 31
  30. 30. Aim: To improve understanding of host protein profiles during disease progression especially in children.
  31. 31. Classification of Malaria Subtypes •Identify panel of proteins which could distinguish between different subtypes. •Implement L1-penalized logistic regression
  32. 32. Penalized Logistic Regression •Logistic regression is a supervised method for binary or multi-class classification. •In high-dimensional data (e.g., microarray): More variables than the observations  Classical logistic regression does not work. •Other problems: Variables are correlated (multicolinierity) and over fitting. •Solution: Introduce a penalty for complexity in the model. 36
  33. 33. Penalized Logistic Regression Logistic model: Maximize the log-likelihood: •-Penalization (Lasso): • 37
  34. 34. • Shrinks all regression coefficients () toward zero and set some of them to zero. • Performs parameter estimation and variable selection at the same time. • The choice of λ is crucial and chosen via k-fold cross-validation procedure. • The procedure is implemented in an R package called penalized. 38 L1 Penalized Logistic Regression
  35. 35. Classification of Severe Malaria Anemia vs. Uncomplicated Malaria group 39 AUC: 0.86
  36. 36. Dose-response Microarray Studies Setia Pramana 40
  37. 37. Dose-response Microarray Studies Setia Pramana 41 Implemented in R package IsoGene and IsoGeneGUI.
  38. 38. Dose-response Microarray Studies Setia Pramana 42
  39. 39. Gene Signature for Prostate Cancer Setia Pramana 43
  40. 40. Gene Signature for Prostate Cancer Setia Pramana 44
  41. 41. Gene Signature for Prostate Cancer Setia Pramana 45
  42. 42. Next Generation Sequencing Setia Pramana 46
  43. 43. Next Generation Sequencing Setia Pramana 47 Reading the order of bases of DNA fragments
  44. 44. NGS used for: • Whole genome re-sequencing • Metagenomics • Cancer genomics • Exome sequencing (targeted) • RNA-sequencing • Chip-seq • Genomic Epidemiology Setia Pramana 49
  45. 45. Next Generation Sequencing Setia Pramana 50 • Produce Massive Data and fast • Problem is storage and analysis
  46. 46. RNA-seq Pipeline • Align to a reference genome using Tophat. Reference Pramana, et.al 51NBBC 2013 Source: Trapnell et.al, 2010
  47. 47. RNA-seq Pipeline • Measure gene expression using Cufflinks: FPKM (Fragments Per Kilobase of transcript per Million mapped reads). Reference Gene Transcript 2 Transcript 1 Isoform/Transcript FPKM Gene FPKM Sample 1 Sample 2 Sample 3 Pramana, et.al 52NBBC 2013 Source: Trapnell et.al, 2013
  48. 48. Setia Pramana 53
  49. 49. Subtype-specific Transcripts/Isoforms • Breast invasive carcinoma (BRCA) from the Cancer Genome Atlas Project (TCGA). • 329 tumor samples. • Platform: illumina • Paired-end reads (length 50 bp). • 20 -100 million reads Setia Pramana 54
  50. 50. Subtype-specific Transcripts/Isoforms • To discover transcripts/isoforms which are only significantly (high/low) expressed in a certain cancer subtype. Pramana, et.al 55NBBC 2013
  51. 51. Analysis Flow 329 samples TCGA Discovery set 179 samples Validation set - TCGA 150 samples - External samples Classification to mol-subtypes - Use Swedish microarray data as training data. - Based on gene level FPKM - Median and variance normalization - K-nearest neighbor - Classifier genes selection Subtype-specific Transcript - Transcript level FPKM of all genes - For each transcript: Robust contrast tests. - Multiple testing adjustment. Pramana, et.al 56NBBC 2013
  52. 52. Subtype-specific Transcripts/Isoforms Setia Pramana 57
  53. 53. Subtype-specific Transcripts/Isoforms Setia Pramana 58
  54. 54. Subtype-specific Transcripts/Isoforms Setia Pramana 59
  55. 55. Software? • R now is growing, especially in bioinformatics – Statistics, data analysis, machine learning – Free – High Quality – Open Source – Extendable (you can submit and publish your own package!!) – Can be integrated with other languages (C/C++, Java, Python) – Large active user community – Command-based (-) Setia Pramana 60
  56. 56. Summary • Statistics plays important roles in developing personalized medicine • Multidisciplinary field  need collaboration with different experts. • Bioinformaticians is one of the sexiest job • Big Data in Medicine: Numerous opportunities to be explored and discovered. Setia Pramana 61
  57. 57. Thank you for your attention…. Setia Pramana 62

×