May 15 workshop

T-BioInfo is designed for processing, analysis and
integration of multi-omics data. The platform is used in
multiple research groups to extract meaningful insights
from large multi-omics datasets. Our current effort
expands to education, by enabling more people to
extract meaningful, data-driven insights from omics
datasets with biomedical applications. To learn more
about the platform and it’s research and educational
features, follow the highlighted links .
T-bio.info | edu.t-bio.info | server.t-bio.info
2

Modeling Precision Medicine
Machine Learning forTranscriptomics Data: Extracting Meaningful
insights from high-throughput biomedical data.
6

Clinical Subtypes Molecular Subtypes
7

Diagnosis, Prognosis, Response toTreatment
8
Survival prediction
Treatment Selection
OncotypeDXPAM50

Daemen et al., 2013, “Modeling precision treatment of breast cancer”: an analysis of over 70 different Breast Cancer cell lines and over 90 different
therapeutic agents. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110 9

Files we will use in this session
10

Part 1:
RNA-Seq Processing
from raw reads to a table of expression
12

RNA-Seq: overview
.…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….
Genome
13
Gene A Gene B Gene C
Transcr. ATranscript A Transcr. ATranscript C

14
.…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….Gene A Gene B Gene C
Reads
RNA-Seq: overview

15
.…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….Gene A Gene B Gene C
Reads
RNA-Seq: some details
1. Shattering 2. Adapters ligation 3. PCR amplification 4. “Reading”

Preprocessing:
• Adapters removal plus additional
• Removing PCR duplicates
16
Quantification of expression levels
Mapping
• Mapping on the set of known transcripts
• Mapping on genome (and potential
identification of novel transcripts)
• Combined strategy
RNA-Seq: overview

18
Data Processing Practice
Create a pipeline:
1. Upload same SVL files
2. pre-processing steps:Trimmomatic, PCRclean
3. Mapping on Genome: HiSat2
4. IsoformConstruction: Cufflinks
5. GTF Merging: Cuffmerge
6. Mapping onTranscripts: Bowtie2-t
7. Quantification: RSEMExpTable

20
ExpressionTable
Sample Name
Gene ID What is this number?

Standard Measures of RNA Quantification:
• Counts
• FPKM – fragments per kilobase per million mapped reads:
Number of reads mapped on the gene
((total number of mapped reads – in millions) x (gene length in
kilobases))
• TPM – transcripts per million
For one sample TPMg = C x FPKMg, where C is selected in such a way that sum of all
million. Constants C are different for different samples.
21

Linear scale vs Log-scale
Relative differences are biologically more meaningful than absolute.
are simplified if a log-scaling is performed:
Log-scaled measure =
log2 (linear-scale measure + shift)
For relatively large values:
difference equal to 1 in log-scale is a 2x difference in linear scale;
difference equal to 3 in log-scale is a 8x difference in linear scale. etc;
difference equal to -1 in log-scale is a 2x difference in linear scale, but in the opposite direction.
22

Preprocessing:
• Adapters removal plus additional
• Removing PCR duplicates
23
Quantification of expression levels
Mapping
• Mapping on the set of known transcripts
• Mapping on genome (and potential
identification of novel transcripts)
• Combined strategy
RNA-Seq: overview

Comparison: the role of preprocessing
24
High expression can be affected by pre-processing steps like PCR-clean and “Trimmomatic”

BREAK
26
Q&A
Error Correction – CORAL, ECHO, RACER, eMER
Different Mappers – HiSat,TopHat, STAR, BWA
Other Sections:
• Differential Expression – CuffDiff, EDGER, DESEQ
• Segmentation - BinS

Part 2:
Machine Learning
Data exploration and classification
27

28
Unsupervised Machine Learning
Dog
Dog
Dog
Cat
Cat
Cat

Unsupervised analysis: PCA
30
• Explore data
• Visualize
Why use Principal Component
Analysis?
• Data Filtering
• Outliers
• Interpretation
Considerations:

32
Unsupervised analysis: PCA
PCA 7,000 genes PCA PAM50 (35) genes
Normal-like
Basal
Claudin-low
Luminal

33
Unsupervised analysis: Hierarchical Clustering
• Identify groups
• Associate sample to group
Why use clustering?
• Various methods
• Random selection in some methods
• Interpretation
Considerations:

34
Unsupervised analysis: Hierarchical Clustering

Unsupervised analysis: hierarchical clustering
Dendrogram
35
2 clusters
4 clusters
8 clusters

36
Unsupervised Analysis Practice
• Remove sample IDs
• Mark Group Names as ID
• Run H-clust
CellLines_ExprData_marked.txt

38
DogsCats
?????
Training Set Test Set
Supervised Machine Learning

39
Step-wise Linear Discriminant Analysis (swLDA)

40
SupportVector Machine (SVM) with Linear Kernel
d
d

41
SupportVector Machine (SVM) with Linear Kernel
?

?
42
Support Vector Machine (SVM) with Linear Kernel

• Fitting classifier on training set and predicting classes on the test set
• Is it possible to tune 7000 coefficients by 52 samples?
• Some algorithms do feature selection: swLDA, random forest
• Other algorithms won’t work if number of features >> number of
samples
• Curse of dimensionality
43
Considerations Supervised analysis

44
• Extracting 15 highly informative genes from the swLDA classifier
• How other supervised learning algorithms can be applied (e.g.,
SVM)
• Feature selection can also improve quality of unsupervised learning
analysis
Step-wise Linear Discriminant Analysis (swLDA)

45
Classification Practice
• Organize the table with 15
genes by sample type
• Color expression (green –
low; red – high)
• Which genes stand out?
• Which sample stand out?
• What groups are hard to
detect?
CellLines_15Genes_market.txt

46
Classification Practice: PCA of 15 gene table

47
Hierarchical Clustering of 15 gene table
N-like Basal
C-low
Luminal4 clusters

Part 3:
Interpretation
Annotating and Interpreting Gene Expression
49

Gene annotation: ENSG to Gene Symbols plus GO
50

52
http://www.oncotarget.com/index.php?journal=oncotarget&page=arti
cle&op=view&path[]=23869&path[]=75083
https://www.nature.com/articles/1208329

1. PCA plot using top 15 genes
from differential expression analysis
54
Homework:

Separation of samples from various sources:TCGA and PDX
55
2. New Datasets

56
Part 1: Conventional Machine Learning Approaches for Next
Generation Sequencing
Rapid RNA-seq processing for expression quantification applying
logical pipeline construction and pre-processing considerations.
hands-on exercises, participants will explore the expression
using conventional unsupervised machine learning methods and
supervised classifiers with and without feature extraction. Using
BioInfo platform, participants will learn about the logic and
considerations of applying such methods and be prepared for
independent downstream analysis and visualization of data
downloaded R scripts produced by the system. The
produced/downloaded code will be reviewed, customized and
subsequent session.
T-bio.info | edu.t-bio.info (FREE) | server.t-bio.info (14 days DEMO)

58
Required installations:
R >= 3.4
R Studio
gplots
ggfortify
ggplot2
ggpubr
e1071
mda
MASS
klaR
Part 2: Combining custom software with R to
streamline analysis workflows and visualize ‘Omics
data insights.
Differential Gene Expression, Gene Set Enrichment
Analysis
R visualization from scratch: utilize the same dataset for
basic data exploration and visualization in R.
This session will strengthen the participants ability to
transition to script-based workflows in RNA-seq
downstream analysis and visualization. Participants will
learn about downstream capabilities of R-based workflow
to transform and manipulate tables and visualize findings
in a meaningful way.

59
Download and Modify R Scripts

Differential expression analysis
Quantities related to the degree of differential expression:
• Difference between mean expression levels – fold change
(please, pay attention to scale);
• Statistical significance – p-value, adjusted p-value (e.g., FDR)
• Level of Expression (caution with low-expressed genes from the
analysis)
61

• Hard to interpret when number of groups is greater than two, so we can use Claudin-low vs normal-
like groups.
• Differential Expression is a natural and easy to interpret feature selection procedure.
• Pathway enrichment analysis can be applied to the resulting table 62

63

64

Gene set / pathway enrichment analysis
GAGE -
• Use only lists (thresholding required): one of the standard tools here isThe
Database for Annotation,Visualization and Integrated Discovery – DAVID
(https://david.ncifcrf.gov/home.jsp, https://david-d.ncifcrf.gov/).
• Takes into consideration level of differential expression
65

66

67

68
Regulation of Actin Cytoskeleton B Cell Receptor Signaling Pathway

69
Required installations:
R >= 3.4
R Studio
gplots
ggfortify
ggplot2
ggpubr
e1071
mda
MASS
klaR
Part 2: Combining custom software with R to
streamline analysis workflows and visualize ‘Omics
data insights.
Differential Gene Expression, Gene Set Enrichment
Analysis
R visualization from scratch: utilize the same dataset for
basic data exploration and visualization in R.
This session will strengthen the participants ability to
transition to script-based workflows in RNA-seq
downstream analysis and visualization. Participants will
learn about downstream capabilities of R-based workflow
to transform and manipulate tables and visualize findings
in a meaningful way.

May 15 workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to May 15 workshop

Similar to May 15 workshop (20)

Recently uploaded

Recently uploaded (20)

May 15 workshop