1
T-BioInfo is designed for processing, analysis and
integration of multi-omics data. The platform is used in
multiple research groups to extract meaningful insights
from large multi-omics datasets. Our current effort
expands to education, by enabling more people to
extract meaningful, data-driven insights from omics
datasets with biomedical applications. To learn more
about the platform and it’s research and educational
features, follow the highlighted links .
T-bio.info | edu.t-bio.info | server.t-bio.info
2
3
LBRN Summer Program
June 1
Program
Announcement
August 15
Independent
Projects
Part 1:
RNA-Seq Processing
from raw reads to a table of expression
4
5
Processing NGS Data (RNA-seq)
1. Pre-processing:
• AdaptorTrimming
• PCR duplicates
2. Alignment:
• Aligning reads to exons
• Exon Junctions
• GTF updates
3. Quantification
• Table of Expression
• NormalizationTechniques
4. Differential Gene Expression
• Student’sT-test
• P-value
• False Discovery Rate (FDR)
• Fold Change
6
RNA-seq and Precision
Medicine
7
Modeling Precision Medicine
Machine Learning forTranscriptomics Data: Extracting Meaningful
insights from high-throughput biomedical data.
8
Clinical Subtypes Molecular Subtypes
9
Diagnosis, Prognosis, Response toTreatment
10
Survival prediction
Treatment Selection
OncotypeDXPAM50
Daemen et al., 2013, “Modeling precision treatment of breast cancer”: an analysis of over 70 different Breast Cancer cell lines and over 90 different
therapeutic agents. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110
Sample 1 Sample 2 Sample 3 Sample 4
gene 1 4 3 3 7
gene 2 6 5 5 8
gene 3 6 6 6 6
gene 4 1 2 1 2
gene 5 9 10 1 5
gene 6 12 4 0 5
gene 7 1 7 9 8
gene 8 4 8 3 10 11
12
Cell line
Transcriptiona
l subtype
21MT1 Basal
21NT Basal
21PT Basal
HCC1143 Basal
HCC1569 Basal
HCC1806 Basal
HCC1937 Basal
HCC1954 Basal
HCC3153 Basal
HCC70 Basal
JIMT1 Basal
MX1 Basal
SUM149PT Basal
SUM229PE Basal
BT549 Claudin-low
HCC1395 Claudin-low
HCC38 Claudin-low
HS578T Claudin-low
MDAMB231 Claudin-low
SUM1315MO2 Claudin-low
600MPE Luminal
AU565 Luminal
BT474 Luminal
BT483 Luminal
CAMA1 Luminal
EFM192A Luminal
EFM192B Luminal
EFM192C Luminal
HCC1419 Luminal
HCC1428 Luminal
HCC202 Luminal
LY2 Luminal
MCF7 Luminal
MDAMB134VI Luminal
MDAMB175VII Luminal
MDAMB361 Luminal
MDAMB453 Luminal
SKBR3 Luminal
SUM225CWN Luminal
SUM52PE Luminal
T47D Luminal
T47D_KBluc Luminal
UACC812 Luminal
UACC893 Luminal
ZR751 Luminal
ZR7530 Luminal
ZR75B Luminal
184A1 Normal-like
184B5 Normal-like
MCF10A Normal-like
MCF10F Normal-like
MCF12A Normal-like
13
MCF10A (Normal Like) HCC38 (Claudin Low)
Files we will use in this session
14
15
.fastq, .fq .fa .gtf, .gff
GeneralTransfer FormatRAW READS (Read Sequence File)
BREAK
16
Q&A
Part 1:
RNA-Seq Processing
from raw reads to a table of expression
17
18
RNA TRANSCRIPTION
1. DNA is transcribed into RNA
2. RNA is transformed into mRNA
3. RNA is translated into proteins
19
GTF file
FASTQ file
Mapping reads: Genes and Isoforms
20
Unmapped reads
Mapped reads
Mapping on Junctions
21
Introns vs. Exons
1. Exons from different genes spliced together
2. Highly conserved exons from different genes
3. Introns sometimes are left !
What is in an intron
1. Contains Information on how to combine exons
2. Includes species markers (nucleotide sequences)
3. Non-coding regions (microRNAs, repeats)
Gene Fusion
1. Chromosome aberration
2. Gene fusion (terminator region mutation)
Exceptions to RNA TRANSCRIPTION
Technical Considerations
22
RNA-Seq: technical overview
.…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….
Genome
23
Gene A Gene B Gene C
Transcr. ATranscript A Transcr. ATranscript C
24
.…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….Gene A Gene B Gene C
Transcr. ATranscript A Transcr. ATranscript C
Reads
RNA-Seq: overview
25
.…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….Gene A Gene B Gene C
Transcr. ATranscript A Transcr. ATranscript C
Reads
RNA-Seq: some details
1. Shattering 2. Adapters ligation 3. PCR amplification 4. “Reading”
Preprocessing:
• Adapters removal plus additional
• Removing PCR duplicates
26
Quantification of expression levels
Mapping
• Mapping on the set of known transcripts
• Mapping on genome (and potential
identification of novel transcripts)
• Combined strategy
RNA-Seq: overview
Server.t-bio.info
27
Making an RNA-seq pipeline
28
RNA-Seq: basic (and fastest) pipeline
29
RNA-Seq: extended pipeline
30
RNA-Seq: extended pipeline (detail)
31
Pipeline Results
https://server.t-bio.info/pipelines/3875323
32
ExpressionTable (FPKM orTPM)
Sample Name
Gene ID What is this number?
Standard Measures of RNA Quantification:
• Counts
• FPKM – fragments per kilobase per million mapped reads:
Number of reads mapped on the gene
((total number of mapped reads – in millions) x (gene length in
kilobases))
• TPM – transcripts per million
For one sample TPMg = C x FPKMg, where C is selected in such a way that sum of all
million. Constants C are different for different samples.
33
Linear scale vs Log-scale
Relative differences are biologically more meaningful than absolute.
are simplified if a log-scaling is performed:
Log-scaled measure =
log2 (linear-scale measure + shift)
For relatively large values:
difference equal to 1 in log-scale is a 2x difference in linear scale;
difference equal to 3 in log-scale is a 8x difference in linear scale. etc;
difference equal to -1 in log-scale is a 2x difference in linear scale, but in the opposite direction.
34
Comparison: the role of preprocessing
35
High expression can be affected by pre-processing steps like PCR-clean and “Trimmomatic”
Preprocessing:
• Adapters removal plus additional
• Removing PCR duplicates
36
Quantification of expression levels
Mapping
• Mapping on the set of known transcripts
• Mapping on genome (and potential
identification of novel transcripts)
• Combined strategy
RNA-Seq: overview
37
Exploring Gene Expression
BREAK
38
Q&A
BREAK
39
Q&A
Error Correction – CORAL, ECHO, RACER, eMER
Different Mappers – HiSat,TopHat, STAR, BWA
Other Sections:
• Differential Expression – CuffDiff, EDGER, DESEQ
• Segmentation - BinS
40
41
DESEQ2
EDGER
42
Interpretation
Annotating and Interpreting Gene Expression
43
Gene annotation: ENSG to Gene Symbols plus GO
44
45
Annotation Practice
46
http://www.oncotarget.com/index.php?journal=oncotarget&page=arti
cle&op=view&path[]=23869&path[]=75083
https://www.nature.com/articles/1208329

June 25-26, Workshop

Editor's Notes

  • #11 HER2 – human epidermal factor receptor (encoded by the ERBB2 gene) HER2 is a member of the human epidermal growth factor receptor (HER/EGFR/ERBB) family. Amplification or over-expression of this oncogene has been shown to play an important role in the development and progression of certain aggressive types of breast cancer. In recent years the protein has become an important biomarker and target of therapy for approximately 30% of breast cancer patients. Herceptin