The Role of Statistician in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics

The Role of The Statisticians in
Personalized Medicine:
An Overview of Statistical
Methods in Bioinformatics
Setia Pramana
Teknik Fisika
Fakultas Teknik Industri
Institut Teknologi Sepuluh Nopember
Surabaya, 12 March 2014
Setia Pramana 1

Educational Background
• Universitas Brawijaya Malang, FMIPA, Statistics
department, 1995-1999.
• Hasselt Universiteit, Belgium, MSc in Applied Statistics
2005-2006.
• Hasselt Universiteit, Belgium, MSc in Biostatistics 2006-
2007.
• Hasselt Universiteit, Belgium, PhD Statistical
Bioinformatics, 2007-2011.
• Medical Epidemiology And Biostatistics Dept. Karolinska
Institutet, Sweden, Postdoctoral, 2011-2014

Now?
• Lecture and Researcher at Sekolah Tinggi Ilmu
Statistik, Jakarta.
• Adjunct Faculty at Medical Epidemiology and
Biostatistics Dept, Karolinska Institutet, Stockholm.

Outline
• Personalized Medicine
• Central Dogma
• Microarray Data Analysis
• Next Generation Sequencing
• Summary
Setia Pramana 4

Personalized Medicine
• Drug Development:
– Takes 10-15 years
– Cost millions USD
• Who: Pharmaceutical, biotechnology, device companies,
Universities and government research agencies
• Regulatory: The US Food and Drug Administration (FDA)
• Evaluate:
– Safety – can people take it?
– Efficacy – does it do anything in humans?
– Effectiveness – is it better or at least as good as what is
currently available?
– Do the benefits outweigh the risks?
Setia Pramana 5

Personalized Medicine
• Drug Development Stages:
- Drug Discovery
- Pre-clinical Development
- Clinical Development 4 Phases
• Statisticians are involved in all stages
• Stages are highly regulated
• Result is based on most of patients
• But .. Patients are created differently!
Setia Pramana 6

Patients Heterogeneity
Setia Pramana 7

• We’re all different in
- Physiological, demographic characteristics
- Medical history
- Genetic/genomic characteristics
• What works for a patient with one set of
characteristics might not work for another!
Setia Pramana 8

• “One size does not fit all”
• Use a patient’s characteristics to determine best
treatment for him/her
• Genomic information is a great potential
-- > Personalized medicine:
“The right treatment for the right patient at the right
time”
Setia Pramana 9

Subgroup identification and targeted treatment
• Determine subgroups of patients who share certain
characteristics and would get better on a particular
treatment
• Discover biomarkers which can identify the subgroup
• Focus on finding and treating a subgroup
Setia Pramana 11

Subgroup identification and targeted treatment
Genotype Phenotype Intervention Outcome
Mutations/SN
Ps
Gene/Protein
Expression
Epigenetics
Diseases
Disability
etc
Drug
Regimes
Personalized
medicine
Setia Pramana 12

Advanced Biomedical Technologies
• High-throughput microarrays and molecular imaging
to monitor SNPs, gene and protein expressions
• Next-Generation Sequencing
Setia Pramana 13

Central Dogma
Central Dogma
http://compbio.pbworks.com
Setia Pramana 15

Gene
• The full DNA sequence of an organism is called its
genome
• A gene is a segment that specifies the sequence of
one or more protein.
Setia Pramana 16

Genomics
• The study of all the genes of a cell, or tissue, at :
– the DNA (genotype), e.g., GWAS SNP, CNV etc…
– mRNA (transcriptomics), Gene expression,
– or protein levels (proteomics).
• Functional Genomics: study the functionality of specific
genes, their relations to diseases, their associated
proteins and their participation in biological processes.
Setia Pramana 17

Microarray
• DNA microarrays are biotechnologies which
allow the monitoring of expression of
thousand genes.
Setia Pramana 18

Applications
• High efficacy and low/no side effect drug
• Genes related disease.
• Biological discovery
– new and better molecular diagnostics
– new molecular targets for therapy
– finding and refining biological pathways
• Molecular diagnosis of leukemia, breast cancer, etsc.
• Appropriate treatment for genetic signature
• Potential new drug targets
Setia Pramana 19

Microarray
Overview of the process
of generating high
throughput gene
expression data using
microarrays.
Setia Pramana 20

The Pipeline
• Experiment design  Lab work  Image processing
• Signal summarization (RMA, GCRMA)
• Normalization
• Data Analysis:
– Differentially Expressed genes
– Clustering
– Classification
– Etc.
• Network / Pathways (GSEA etc..)
• Biological interpretations
Setia Pramana 21

Microarray Data Structure
Setia Pramana 22

Preprocessed Data
Genes C1 C2 C3 T1 T2 T3
G8522 6.78 6.55 6.37 6.89 6.78 6.92
G8523 6.52 6.61 6.72 6.51 6.59 6.46
G8524 5.67 5.69 5.88 7.43 7.16 7.31
G8525 5.64 5.91 5.61 7.41 7.49 7.41
G8526 4.63 4.85 5.72 5.71 5.47 5.79
G8528 7.81 7.58 7.24 7.79 7.38 8.60
G8529 4.26 4.20 4.82 3.11 4.94 3.08
G8530 7.36 7.45 7.31 7.46 7.53 7.35
G8531 5.30 5.36 5.70 5.41 5.73 5.77
G8532 5.84 5.48 5.93 5.84 5.73 5.75Setia Pramana 23

Challenges
• Mega data, difficult to visualize
• Too few records (columns/samples), usually < 100
• Too many rows(genes), usually > 10,000
• Too many genes likely leading to False positives
• For exploration, a large set of all relevant genes is
desired
• For diagnostics or identification of therapeutic
targets, the smallest set of genes is needed
• Model needs to be explainable to biologists
Setia Pramana 24

Microarray Data Analysis Types
• Gene Selection
–find genes for therapeutic targets
• Classification (Supervised)
–identify disease (biomarker study)
–predict outcome / select best treatment
• Clustering (Unsupervised)
–find new biological classes / refine existing ones
–Understanding regulatory relationship/pathway
–exploration
Setia Pramana 25

Gene Selection
• Modified t-test
• Significance Analysis of Microarray (SAM)
• Limma (Linear model for microarrays )
• Linear Mixed model
• Lasso (least absolute selection and shrinkage operator)
• Elastic-net
• Etc,
Setia Pramana 26

Visualization
• Dimensionality reduction
• PCA (Principal Component Analysis)
• Biplot
• Heatmap
• Multi dimensional scaling
• Etc
Setia Pramana 27

Clustering
• Cluster the genes
• Cluster the
arrays/conditions
• Cluster both simultaneously
• K-means
• Hierarchical
• Biclustering algorithms
Setia Pramana 28

Clustering
• Cluster or Classify
genes according to
tumors
• Cluster tumors
according to genes
Setia Pramana 29

Classification
• Linear Discriminat Analysis
• K nearest neighbour
• Logistic regression
• L1 Penalized Logistric regression
• Neural Network
• Support vector machines
• Random forest
• etc
Setia Pramana 31

Aim: To improve understanding of host protein
profiles during disease progression especially in
children.

Classification of Malaria Subtypes
•Identify panel of proteins which could distinguish
between different subtypes.
•Implement L1-penalized logistic regression

Penalized Logistic Regression
•Logistic regression is a supervised method for binary
or multi-class classification.
•In high-dimensional data (e.g., microarray): More
variables than the observations  Classical logistic
regression does not work.
•Other problems: Variables are correlated
(multicolinierity) and over fitting.
•Solution: Introduce a penalty for complexity in the
model.
36

Penalized Logistic Regression
Logistic model:
Maximize the log-likelihood:
•-Penalization (Lasso):
•
37

• Shrinks all regression coefficients () toward zero
and set some of them to zero.
• Performs parameter estimation and variable
selection at the same time.
• The choice of λ is crucial and chosen via k-fold
cross-validation procedure.
• The procedure is implemented in an R package
called penalized.
38
L1 Penalized Logistic Regression

Classification of Severe Malaria Anemia vs.
Uncomplicated Malaria group
39
AUC: 0.86

Dose-response Microarray Studies
Setia Pramana 40

Setia Pramana 41
Implemented in R package IsoGene and IsoGeneGUI.

Setia Pramana 42

Gene Signature for Prostate Cancer
Setia Pramana 43

Setia Pramana 44

Setia Pramana 45

Next Generation Sequencing
Setia Pramana 46

Setia Pramana 47
Reading the order of bases of DNA fragments

NGS used for:
• Whole genome re-sequencing
• Metagenomics
• Cancer genomics
• Exome sequencing (targeted)
• RNA-sequencing
• Chip-seq
• Genomic Epidemiology
Setia Pramana 49

Setia Pramana 50
• Produce Massive Data and fast
• Problem is storage and analysis

RNA-seq Pipeline
• Align to a reference genome using Tophat.
Reference
Pramana, et.al 51NBBC 2013
Source: Trapnell et.al, 2010

RNA-seq Pipeline
• Measure gene expression using Cufflinks: FPKM
(Fragments Per Kilobase of transcript per Million
mapped reads).
Reference Gene
Transcript 2
Transcript 1
Isoform/Transcript FPKM
Gene FPKM
Sample 1
Sample 2
Sample 3
Pramana, et.al 52NBBC 2013 Source: Trapnell et.al, 2013

Subtype-specific Transcripts/Isoforms
• Breast invasive carcinoma (BRCA) from the Cancer
Genome Atlas Project (TCGA).
• 329 tumor samples.
• Platform: illumina
• Paired-end reads (length 50 bp).
• 20 -100 million reads
Setia Pramana 54

• To discover transcripts/isoforms which are only
significantly (high/low) expressed in a certain cancer
subtype.

Analysis Flow
329 samples TCGA
Discovery set
179 samples
Validation set
- TCGA 150 samples
- External samples
Classification to mol-subtypes
- Use Swedish microarray data as
training data.
- Based on gene level FPKM
- Median and variance normalization
- K-nearest neighbor
- Classifier genes selection
Subtype-specific Transcript
- Transcript level FPKM of all
genes
- For each transcript: Robust
contrast tests.
- Multiple testing adjustment.

Setia Pramana 57

Setia Pramana 58

Setia Pramana 59

Software?
• R now is growing, especially in bioinformatics
– Statistics, data analysis, machine learning
– Free
– High Quality
– Open Source
– Extendable (you can submit and publish your own package!!)
– Can be integrated with other languages (C/C++, Java, Python)
– Large active user community
– Command-based (-)
Setia Pramana 60

Summary
• Statistics plays important roles in developing
personalized medicine
• Multidisciplinary field  need collaboration with
different experts.
• Bioinformaticians is one of the sexiest job
• Big Data in Medicine: Numerous opportunities to be
explored and discovered.
Setia Pramana 61

Thank you for your attention….
Setia Pramana 62

The Role of Statistician in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to The Role of Statistician in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics

Similar to The Role of Statistician in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics (20)

More from Setia Pramana

More from Setia Pramana (20)

Recently uploaded

Recently uploaded (20)

The Role of Statistician in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics