1
T-BioInfo is designed for processing, analysis and
integration of multi-omics data. The platform is used in
multiple research groups to extract meaningful insights
from large multi-omics datasets. Our current effort
expands to education, by enabling more people to
extract meaningful, data-driven insights from omics
datasets with biomedical applications. To learn more
about the platform and it’s research and educational
features, follow the highlighted links .
T-bio.info | edu.t-bio.info | server.t-bio.info
2
3
4
5
Modeling Precision Medicine
Machine Learning forTranscriptomics Data: Extracting Meaningful
insights from high-throughput biomedical data.
6
Clinical Subtypes Molecular Subtypes
7
Diagnosis, Prognosis, Response toTreatment
8
Survival prediction
Treatment Selection
OncotypeDXPAM50
Daemen et al., 2013, “Modeling precision treatment of breast cancer”: an analysis of over 70 different Breast Cancer cell lines and over 90 different
therapeutic agents. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110 9
Files we will use in this session
10
BREAK
11
Q&A
Part 1:
RNA-Seq Processing
from raw reads to a table of expression
12
RNA-Seq: overview
.…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….
Genome
13
Gene A Gene B Gene C
Transcr. ATranscript A Transcr. ATranscript C
14
.…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….Gene A Gene B Gene C
Transcr. ATranscript A Transcr. ATranscript C
Reads
RNA-Seq: overview
15
.…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….Gene A Gene B Gene C
Transcr. ATranscript A Transcr. ATranscript C
Reads
RNA-Seq: some details
1. Shattering 2. Adapters ligation 3. PCR amplification 4. “Reading”
Preprocessing:
• Adapters removal plus additional
• Removing PCR duplicates
16
Quantification of expression levels
Mapping
• Mapping on the set of known transcripts
• Mapping on genome (and potential
identification of novel transcripts)
• Combined strategy
RNA-Seq: overview
17
RNA-Seq: basic pipeline
18
Data Processing Practice
Create a pipeline:
1. Upload same SVL files
2. pre-processing steps:Trimmomatic, PCRclean
3. Mapping on Genome: HiSat2
4. IsoformConstruction: Cufflinks
5. GTF Merging: Cuffmerge
6. Mapping onTranscripts: Bowtie2-t
7. Quantification: RSEMExpTable
19
RNA-Seq: extended pipeline
20
ExpressionTable
Sample Name
Gene ID What is this number?
Standard Measures of RNA Quantification:
• Counts
• FPKM – fragments per kilobase per million mapped reads:
Number of reads mapped on the gene
((total number of mapped reads – in millions) x (gene length in
kilobases))
• TPM – transcripts per million
For one sample TPMg = C x FPKMg, where C is selected in such a way that sum of all
million. Constants C are different for different samples.
21
Linear scale vs Log-scale
Relative differences are biologically more meaningful than absolute.
are simplified if a log-scaling is performed:
Log-scaled measure =
log2 (linear-scale measure + shift)
For relatively large values:
difference equal to 1 in log-scale is a 2x difference in linear scale;
difference equal to 3 in log-scale is a 8x difference in linear scale. etc;
difference equal to -1 in log-scale is a 2x difference in linear scale, but in the opposite direction.
22
Preprocessing:
• Adapters removal plus additional
• Removing PCR duplicates
23
Quantification of expression levels
Mapping
• Mapping on the set of known transcripts
• Mapping on genome (and potential
identification of novel transcripts)
• Combined strategy
RNA-Seq: overview
Comparison: the role of preprocessing
24
High expression can be affected by pre-processing steps like PCR-clean and “Trimmomatic”
BREAK
25
Q&A
BREAK
26
Q&A
Error Correction – CORAL, ECHO, RACER, eMER
Different Mappers – HiSat,TopHat, STAR, BWA
Other Sections:
• Differential Expression – CuffDiff, EDGER, DESEQ
• Segmentation - BinS
Part 2:
Machine Learning
Data exploration and classification
27
28
Unsupervised Machine Learning
Dog
Dog
Dog
Cat
Cat
Cat
29
Group 1
Group 2
Outlier
Unsupervised analysis: PCA
30
• Explore data
• Visualize
Why use Principal Component
Analysis?
• Data Filtering
• Outliers
• Interpretation
Considerations:
31
Unsupervised analysis: PCA
32
Unsupervised analysis: PCA
PCA 7,000 genes PCA PAM50 (35) genes
Normal-like
Basal
Claudin-low
Luminal
33
Unsupervised analysis: Hierarchical Clustering
• Identify groups
• Associate sample to group
Why use clustering?
• Various methods
• Random selection in some methods
• Interpretation
Considerations:
34
Unsupervised analysis: Hierarchical Clustering
Unsupervised analysis: hierarchical clustering
Dendrogram
35
2 clusters
4 clusters
8 clusters
36
Unsupervised Analysis Practice
• Remove sample IDs
• Mark Group Names as ID
• Run H-clust
CellLines_ExprData_marked.txt
BREAK
37
Q&A
38
DogsCats
?????
Training Set Test Set
Supervised Machine Learning
39
Step-wise Linear Discriminant Analysis (swLDA)
40
SupportVector Machine (SVM) with Linear Kernel
d
d
41
SupportVector Machine (SVM) with Linear Kernel
?
?
42
Support Vector Machine (SVM) with Linear Kernel
• Fitting classifier on training set and predicting classes on the test set
• Is it possible to tune 7000 coefficients by 52 samples?
• Some algorithms do feature selection: swLDA, random forest
• Other algorithms won’t work if number of features >> number of
samples
• Curse of dimensionality
43
Considerations Supervised analysis
44
• Extracting 15 highly informative genes from the swLDA classifier
• How other supervised learning algorithms can be applied (e.g.,
SVM)
• Feature selection can also improve quality of unsupervised learning
analysis
Step-wise Linear Discriminant Analysis (swLDA)
45
Classification Practice
• Organize the table with 15
genes by sample type
• Color expression (green –
low; red – high)
• Which genes stand out?
• Which sample stand out?
• What groups are hard to
detect?
CellLines_15Genes_market.txt
46
Classification Practice: PCA of 15 gene table
47
Hierarchical Clustering of 15 gene table
N-like Basal
C-low
Luminal4 clusters
BREAK
48
Q&A
Part 3:
Interpretation
Annotating and Interpreting Gene Expression
49
Gene annotation: ENSG to Gene Symbols plus GO
50
51
Annotation Practice
52
http://www.oncotarget.com/index.php?journal=oncotarget&page=arti
cle&op=view&path[]=23869&path[]=75083
https://www.nature.com/articles/1208329
BREAK
53
Q&A
1. PCA plot using top 15 genes
from differential expression analysis
54
Homework:
Separation of samples from various sources:TCGA and PDX
55
2. New Datasets
56
Part 1: Conventional Machine Learning Approaches for Next
Generation Sequencing
Rapid RNA-seq processing for expression quantification applying
logical pipeline construction and pre-processing considerations.
hands-on exercises, participants will explore the expression
using conventional unsupervised machine learning methods and
supervised classifiers with and without feature extraction. Using
BioInfo platform, participants will learn about the logic and
considerations of applying such methods and be prepared for
independent downstream analysis and visualization of data
downloaded R scripts produced by the system. The
produced/downloaded code will be reviewed, customized and
subsequent session.
T-bio.info | edu.t-bio.info (FREE) | server.t-bio.info (14 days DEMO)
57
58
Required installations:
R >= 3.4
R Studio
gplots
ggfortify
ggplot2
ggpubr
e1071
mda
MASS
klaR
Part 2: Combining custom software with R to
streamline analysis workflows and visualize ‘Omics
data insights.
Differential Gene Expression, Gene Set Enrichment
Analysis
R visualization from scratch: utilize the same dataset for
basic data exploration and visualization in R.
This session will strengthen the participants ability to
transition to script-based workflows in RNA-seq
downstream analysis and visualization. Participants will
learn about downstream capabilities of R-based workflow
to transform and manipulate tables and visualize findings
in a meaningful way.
59
Download and Modify R Scripts
60
Differential expression analysis
Quantities related to the degree of differential expression:
• Difference between mean expression levels – fold change
(please, pay attention to scale);
• Statistical significance – p-value, adjusted p-value (e.g., FDR)
• Level of Expression (caution with low-expressed genes from the
analysis)
61
• Hard to interpret when number of groups is greater than two, so we can use Claudin-low vs normal-
like groups.
• Differential Expression is a natural and easy to interpret feature selection procedure.
• Pathway enrichment analysis can be applied to the resulting table 62
Differential expression analysis
63
Differential expression analysis
64
Differential expression analysis
Gene set / pathway enrichment analysis
GAGE -
• Use only lists (thresholding required): one of the standard tools here isThe
Database for Annotation,Visualization and Integrated Discovery – DAVID
(https://david.ncifcrf.gov/home.jsp, https://david-d.ncifcrf.gov/).
• Takes into consideration level of differential expression
65
66
Gene set / pathway enrichment analysis
67
Gene set / pathway enrichment analysis
68
Gene set / pathway enrichment analysis
Regulation of Actin Cytoskeleton B Cell Receptor Signaling Pathway
69
Required installations:
R >= 3.4
R Studio
gplots
ggfortify
ggplot2
ggpubr
e1071
mda
MASS
klaR
Part 2: Combining custom software with R to
streamline analysis workflows and visualize ‘Omics
data insights.
Differential Gene Expression, Gene Set Enrichment
Analysis
R visualization from scratch: utilize the same dataset for
basic data exploration and visualization in R.
This session will strengthen the participants ability to
transition to script-based workflows in RNA-seq
downstream analysis and visualization. Participants will
learn about downstream capabilities of R-based workflow
to transform and manipulate tables and visualize findings
in a meaningful way.
70
R Studio
71

May 15 workshop

  • 1.
  • 2.
    T-BioInfo is designedfor processing, analysis and integration of multi-omics data. The platform is used in multiple research groups to extract meaningful insights from large multi-omics datasets. Our current effort expands to education, by enabling more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. To learn more about the platform and it’s research and educational features, follow the highlighted links . T-bio.info | edu.t-bio.info | server.t-bio.info 2
  • 3.
  • 4.
  • 5.
  • 6.
    Modeling Precision Medicine MachineLearning forTranscriptomics Data: Extracting Meaningful insights from high-throughput biomedical data. 6
  • 7.
  • 8.
    Diagnosis, Prognosis, ResponsetoTreatment 8 Survival prediction Treatment Selection OncotypeDXPAM50
  • 9.
    Daemen et al.,2013, “Modeling precision treatment of breast cancer”: an analysis of over 70 different Breast Cancer cell lines and over 90 different therapeutic agents. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110 9
  • 10.
    Files we willuse in this session 10
  • 11.
  • 12.
    Part 1: RNA-Seq Processing fromraw reads to a table of expression 12
  • 13.
    RNA-Seq: overview .…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA…. Genome 13 Gene AGene B Gene C Transcr. ATranscript A Transcr. ATranscript C
  • 14.
    14 .…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….Gene A GeneB Gene C Transcr. ATranscript A Transcr. ATranscript C Reads RNA-Seq: overview
  • 15.
    15 .…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….Gene A GeneB Gene C Transcr. ATranscript A Transcr. ATranscript C Reads RNA-Seq: some details 1. Shattering 2. Adapters ligation 3. PCR amplification 4. “Reading”
  • 16.
    Preprocessing: • Adapters removalplus additional • Removing PCR duplicates 16 Quantification of expression levels Mapping • Mapping on the set of known transcripts • Mapping on genome (and potential identification of novel transcripts) • Combined strategy RNA-Seq: overview
  • 17.
  • 18.
    18 Data Processing Practice Createa pipeline: 1. Upload same SVL files 2. pre-processing steps:Trimmomatic, PCRclean 3. Mapping on Genome: HiSat2 4. IsoformConstruction: Cufflinks 5. GTF Merging: Cuffmerge 6. Mapping onTranscripts: Bowtie2-t 7. Quantification: RSEMExpTable
  • 19.
  • 20.
  • 21.
    Standard Measures ofRNA Quantification: • Counts • FPKM – fragments per kilobase per million mapped reads: Number of reads mapped on the gene ((total number of mapped reads – in millions) x (gene length in kilobases)) • TPM – transcripts per million For one sample TPMg = C x FPKMg, where C is selected in such a way that sum of all million. Constants C are different for different samples. 21
  • 22.
    Linear scale vsLog-scale Relative differences are biologically more meaningful than absolute. are simplified if a log-scaling is performed: Log-scaled measure = log2 (linear-scale measure + shift) For relatively large values: difference equal to 1 in log-scale is a 2x difference in linear scale; difference equal to 3 in log-scale is a 8x difference in linear scale. etc; difference equal to -1 in log-scale is a 2x difference in linear scale, but in the opposite direction. 22
  • 23.
    Preprocessing: • Adapters removalplus additional • Removing PCR duplicates 23 Quantification of expression levels Mapping • Mapping on the set of known transcripts • Mapping on genome (and potential identification of novel transcripts) • Combined strategy RNA-Seq: overview
  • 24.
    Comparison: the roleof preprocessing 24 High expression can be affected by pre-processing steps like PCR-clean and “Trimmomatic”
  • 25.
  • 26.
    BREAK 26 Q&A Error Correction –CORAL, ECHO, RACER, eMER Different Mappers – HiSat,TopHat, STAR, BWA Other Sections: • Differential Expression – CuffDiff, EDGER, DESEQ • Segmentation - BinS
  • 27.
    Part 2: Machine Learning Dataexploration and classification 27
  • 28.
  • 29.
  • 30.
    Unsupervised analysis: PCA 30 •Explore data • Visualize Why use Principal Component Analysis? • Data Filtering • Outliers • Interpretation Considerations:
  • 31.
  • 32.
    32 Unsupervised analysis: PCA PCA7,000 genes PCA PAM50 (35) genes Normal-like Basal Claudin-low Luminal
  • 33.
    33 Unsupervised analysis: HierarchicalClustering • Identify groups • Associate sample to group Why use clustering? • Various methods • Random selection in some methods • Interpretation Considerations:
  • 34.
  • 35.
    Unsupervised analysis: hierarchicalclustering Dendrogram 35 2 clusters 4 clusters 8 clusters
  • 36.
    36 Unsupervised Analysis Practice •Remove sample IDs • Mark Group Names as ID • Run H-clust CellLines_ExprData_marked.txt
  • 37.
  • 38.
    38 DogsCats ????? Training Set TestSet Supervised Machine Learning
  • 39.
  • 40.
    40 SupportVector Machine (SVM)with Linear Kernel d d
  • 41.
    41 SupportVector Machine (SVM)with Linear Kernel ?
  • 42.
    ? 42 Support Vector Machine(SVM) with Linear Kernel
  • 43.
    • Fitting classifieron training set and predicting classes on the test set • Is it possible to tune 7000 coefficients by 52 samples? • Some algorithms do feature selection: swLDA, random forest • Other algorithms won’t work if number of features >> number of samples • Curse of dimensionality 43 Considerations Supervised analysis
  • 44.
    44 • Extracting 15highly informative genes from the swLDA classifier • How other supervised learning algorithms can be applied (e.g., SVM) • Feature selection can also improve quality of unsupervised learning analysis Step-wise Linear Discriminant Analysis (swLDA)
  • 45.
    45 Classification Practice • Organizethe table with 15 genes by sample type • Color expression (green – low; red – high) • Which genes stand out? • Which sample stand out? • What groups are hard to detect? CellLines_15Genes_market.txt
  • 46.
  • 47.
    47 Hierarchical Clustering of15 gene table N-like Basal C-low Luminal4 clusters
  • 48.
  • 49.
    Part 3: Interpretation Annotating andInterpreting Gene Expression 49
  • 50.
    Gene annotation: ENSGto Gene Symbols plus GO 50
  • 51.
  • 52.
  • 53.
  • 54.
    1. PCA plotusing top 15 genes from differential expression analysis 54 Homework:
  • 55.
    Separation of samplesfrom various sources:TCGA and PDX 55 2. New Datasets
  • 56.
    56 Part 1: ConventionalMachine Learning Approaches for Next Generation Sequencing Rapid RNA-seq processing for expression quantification applying logical pipeline construction and pre-processing considerations. hands-on exercises, participants will explore the expression using conventional unsupervised machine learning methods and supervised classifiers with and without feature extraction. Using BioInfo platform, participants will learn about the logic and considerations of applying such methods and be prepared for independent downstream analysis and visualization of data downloaded R scripts produced by the system. The produced/downloaded code will be reviewed, customized and subsequent session. T-bio.info | edu.t-bio.info (FREE) | server.t-bio.info (14 days DEMO)
  • 57.
  • 58.
    58 Required installations: R >=3.4 R Studio gplots ggfortify ggplot2 ggpubr e1071 mda MASS klaR Part 2: Combining custom software with R to streamline analysis workflows and visualize ‘Omics data insights. Differential Gene Expression, Gene Set Enrichment Analysis R visualization from scratch: utilize the same dataset for basic data exploration and visualization in R. This session will strengthen the participants ability to transition to script-based workflows in RNA-seq downstream analysis and visualization. Participants will learn about downstream capabilities of R-based workflow to transform and manipulate tables and visualize findings in a meaningful way.
  • 59.
  • 60.
  • 61.
    Differential expression analysis Quantitiesrelated to the degree of differential expression: • Difference between mean expression levels – fold change (please, pay attention to scale); • Statistical significance – p-value, adjusted p-value (e.g., FDR) • Level of Expression (caution with low-expressed genes from the analysis) 61
  • 62.
    • Hard tointerpret when number of groups is greater than two, so we can use Claudin-low vs normal- like groups. • Differential Expression is a natural and easy to interpret feature selection procedure. • Pathway enrichment analysis can be applied to the resulting table 62 Differential expression analysis
  • 63.
  • 64.
  • 65.
    Gene set /pathway enrichment analysis GAGE - • Use only lists (thresholding required): one of the standard tools here isThe Database for Annotation,Visualization and Integrated Discovery – DAVID (https://david.ncifcrf.gov/home.jsp, https://david-d.ncifcrf.gov/). • Takes into consideration level of differential expression 65
  • 66.
    66 Gene set /pathway enrichment analysis
  • 67.
    67 Gene set /pathway enrichment analysis
  • 68.
    68 Gene set /pathway enrichment analysis Regulation of Actin Cytoskeleton B Cell Receptor Signaling Pathway
  • 69.
    69 Required installations: R >=3.4 R Studio gplots ggfortify ggplot2 ggpubr e1071 mda MASS klaR Part 2: Combining custom software with R to streamline analysis workflows and visualize ‘Omics data insights. Differential Gene Expression, Gene Set Enrichment Analysis R visualization from scratch: utilize the same dataset for basic data exploration and visualization in R. This session will strengthen the participants ability to transition to script-based workflows in RNA-seq downstream analysis and visualization. Participants will learn about downstream capabilities of R-based workflow to transform and manipulate tables and visualize findings in a meaningful way.
  • 70.
  • 71.