Generative AI on Enterprise Cloud with NiFi and Milvus
Exome breast cancer-edu-tk-sb
1. Modeling precision
treatment of breast cancer
Daemen A, Griffith OL, Heiser LM, et al. Modeling precision
treatment of breast cancer. Genome Biol. 2013;14(10):R110.
doi:10.1186/gb-2013-14-10-r110.
2. Biological Background
• Breast Cancer has a total of six subtypes that have been previously identified and well documented. These subtypes
respond differently to chemotherapy and hormone treatments. Currently doctors only test for a handful of molecular
signatures and over 40% of those patients cancer does not fit in those categories. This study focuses on over 70
different Breast Cancer cell lines on over 90 different therapeutic agents. This includes SNP Array, RNA-seq,
exome-seq (exome capture), genome-wide methylation, and RPPA protein abundance studies as well as integrating
a number of algorithmic methods to identify molecular features including: least squares-support vector machine and
random forest algorithms.
• Cell lines are often used in research, as they mirror many of the molecular characteristics found in tumors in clinical
studies. This means they can be used for producing pre-clinical models for predictive marker development. This
work was able to develop predictive drug response signatures and this research can be built upon with future clinical
models. One issue with this study is a cell panel does not capture features such as tumor microenvironment that is
critical to understanding tumors.
• Types of Cell Lines Used :
• Luminal- often chemotherapy responsive and endocrine responsive
• Basal- often chemotherapy responsive and endocrine nonresponsive
• claudin-low- intermediate response to chemotherapy
• normal and normal like cell- breast cancer cell lines arose after chemical exposure (such as 184A1) and primary cell lines
• Unknown- includes cell lines that are contaminated (such as MT3) and lymphoblastoid cell lines (such as HCC1007)
3. Total List of Cell Lines- Total of 84 in the Study
Basal 21NT Basal* CAL120 Luminal BT474 Luminal ZR7530
Basal BT20 Basal* CAL148 Luminal BT483 Luminal ZR75B
Basal HCC1143 Basal* CAL851 Luminal CAMA1 Luminal MDAMB175VII
Basal HCC1187 Basal* CAL51 Luminal HCC1419 Luminal SUM225CWN
Basal HCC1569 Basal* HDQP1 Luminal HCC1428 Luminal SUM44PE
Basal HCC1806 Basal& COLO824 Luminal HCC202 Luminal UACC893
Basal HCC1937 Claudin-low BT549 Luminal HCC2185 Luminal EFM192A
Basal HCC1954 Claudin-low HCC1395 Luminal LY2 Luminal EFM192B
Basal HCC3153 Claudin-low HCC38 Luminal MCF7 Luminal EFM192C
Basal HCC70 Claudin-low HS578T Luminal MDAMB134VI Luminal HCC2218
Basal MX1 Claudin-low MDAMB157 Luminal MDAMB361 Matched normal HCC1143BL
Basal SUM149PT Claudin-low MDAMB231 Luminal MDAMB415 Matched normal HCC38BL
Basal SUM229PE Claudin-low SUM1315MO2 Luminal MDAMB453 Normal-like 184A1
Basal 21MT1 Claudin-low SUM159PT Luminal ERBB2-amp Normal-like 184B5
Basal MDAMB468 Claudin-low HBL100 Luminal SKBR3 Normal-like MCF10A
Basal 21PT Claudin-low MDAMB436 Luminal SUM185PE Normal-like MCF12A
Basal JIMT1 Luminal* EFM19 Luminal SUM52PE Normal-like MCF10F
Basal SUM102PT Luminal* EVSAT Luminal T47D Normal-like S1
Basal 21MT2 Luminal* MFM223 Luminal T47D_KBluc Normal-like^ PMC42
Basal HCC1599 Luminal 600MPE Luminal UACC812 Unknown# T4
Basal MB157 Luminal AU565 Luminal ZR751 Unknown# HCC1008
Unknown# MT3
A mix of cell lines were used in this study. This includes basal, basal like, claudin low, luminal, normal, normal like and unknown that were used for the
study.
4. Drug List Used For Breast Cancer Study and Their associated mean GI50
17-AAG 7.035 BIBW2992 6.396 Doxorubicin 6.616
GSK1120
212
5.815
Geldanam
ycin
7.594
Lestaurtinib
(CEP-701)
6.226
Oxaliplati
n
5.108 L-779450 4.745 Topotecan 6.865
ZM447439
5.110 Baicalein 4.292
ERKi II
(FR180304)
4.443
GSK1059
868
4.885
Gemcitabi
ne
6.652 MG-132 6.738
Oxamflati
n
6.053 Rapamycin 6.697 Tamoxifen 4.387
5-FU 3.972 Bortezomib 7.854 Epirubicin 6.525
GSK1838
705
5.246
Glycyl
H1152
4.894 MLN4924 6.414 PD98059 4.432 Vorinostat 4.123
Temsirolimu
s
6.013
5-FdUR 3.970 CGC-11047 3.964 Erlotinib 4.695
GSK4613
64
7.076 ICRF-193 4.965
Mebendazol
e
6.064
PF-
2341066
5.543 SB-3CT 4.169
Trichostatin
A
5.071
AG1478 4.526 CGC-11144
6.256
3
Etoposide 5.39
GSK2119
563
6.08 IKK 16 5.483
Methotrexat
e
4.668
PF-
3084014
4.646 Ispinesib 7.154
Tykerb:IGF1
R (1:1)
6.209
Sigma AKT1-
2 inhibitor
5.460 CPT-11 5.086 Everolimus 6.404
GSK2126
458
7.933
Ibandrona
te sodium
salt
4.242 NSC663284 5.645
PF-
3814735
5.695 Bosutinib 5.631 VX-680 5.445
Triciribine 5.593 Carboplatin 4.320
FTase
inhibitor I
4.411
GSK2141
795
6.584 Imatinib 4.713 NU6102 4.743
PF-
4691502
6.889 Sorafenib 4.287 Valproic acid 2.768
AS-252424 4.813 Cisplatin 5.061 Fascaplysin 6.743
GSK1059
615
6.311 Gefitinib 5.148 Nelfinavir 4.989 Paclitaxel 7.908
Sunitinib
Malate
5.210 Velcade 7.962
AZD6244 4.705 Disulfiram 5.700 GSK923295 7.044
GSK6503
94
4.379
Ixabepilon
e
7.917 Nutlin 3a 4.687
Pemetrex
ed
3.222 TCS PIM-11 4.090 Vinorelbine 7.549
BEZ235 5.811 Docetaxel 8.250
GSK107091
6
5.771 Lapatinib 5.164 LBH589 6.948
Olomoucine
II
5.294
Purvalanol
A
4.128
TCS2312
dihydrochlorid
e
6.248 XRP44X 5.706
GI50 is the concentration for 50% of maximal inhibition of cell proliferation, and should be used for cytostatic (as opposed to
cytotoxic) agents. GI50 dichotomization threshold for each compound, with the mean GI50 for the 48 core cell lines.
https://www.dropbox.com/s/kjim8g5szr8fwa6/gi50_threshold_48.xlsx?dl=0
5. Data Information
• Exome-Seq -GSE48215
• 75 breast cancer cell lines underwent exome-seq to identify mutations
• Samples by Type: Basal 17, Claudin-low 8, Luminal 29, Non-Malignant 5,
Unknown 16
• Only sorted BAM files uploaded after BWA align&sampe aligned to hg19
• RNA-Seq -GSE48213
• 56 Cell lines were profiled in thier baseline, unperturbed state.
• Samples by Type: Basal 15, Claudin-low 7, Luminal 32, Non-malignant 6,
Unknown 4
• Agilent Bioanalyzer High Sensitivity chip
• Pipeline: http://use.t-bioinfo.com:3000/pipelines/38717146
• Methylation by Array –GSE42944
• DNA methylation in 55 Breast Cancer samples
• Extraction protocol - DNA was extracted with TNES/PK
(Tris/NaCl/EDTA/SDS/proteinase K) lysis buffer prior to sodium bisulfite
conversion using the Zymo Research EZ DNA Methylation Kit.
• Hybridization protocol - Bisulphite-converted DNA was amplified, fragmented
and hybridized to Illumina Infinium Human Methylation27 BeadChips using
the standard Illumina protocol.
• Affymetrix Array
• DNA copy number array
• EGAS00000000059 + EGAS00001000585
• https://www.ebi.ac.uk/ega/search/site/EGAS00000000059
This study developed candidate response signatures by analyzing associations between biological responses to therapy
and pretreatment omics signatures.
Pretreatment Measurements:
1. mRNA expression (Affymetrix) (56 cell lines)
2. Genome copy number (SNP6) (74 cell lines)
3. Protein expression (RPPA) (49 cell lines)
4. Gene mutation (exome-seq) (75 cell lines)
5. Transcriptome sequencing (RNA-seq) (56 cell
lines)
6. Methylation assay (47 cell lines)
The software applies signatures of response developed in vitro to
measurements of expression, copy number, and/or methylation for
individual samples and produces a list of recommended
treatments ranked according to predicted probability of response
and in vitro GI50 dynamic range
6. Breast Cancer Cell Lines
• A total of 84 breast cancer cell lines were assembled, and a total of 70 were tested for their response to compounds by growth inhibition assays. A
total of 56 cell lines underwent RNA sequencing and 75 samples underwent Exome sequencing. A total of 33 cell lines were included in all data sets.
The cell lines and compounds that were shown to be low levels of response to variation in response to cells.
RNA-seq Breast Cell Lines
MCF7 184A1
MDAMB134VI 184B5
MDAMB231 21NT
MDAMB361 600MPE
MDAMB453 AU565
MX1 BT474
SKBR3 BT483
SUM1315MO2 BT549
SUM149PT CAMA1
SUM229PE HCC1143
SUM52PE HCC1395
T47D HCC1419
T47D_KBluc HCC1428
UACC812 HCC1569
ZR751 HCC1806
ZR7530 HCC1937
ZR75B HCC1954
21MT1 HCC202
MCF10F HCC3153
MDAMB175VII HCC38
SUM225CWN HCC70
UACC893 HS578T
21PT LY2
JIMT1 MCF10A
EFM192A MCF12A
EFM192B HCC1599
EFM192C HCC2218
21MT2 MB157
Exome-seq Breast Cell Lines
184A1 SKBR3 CAL51
184B5 SUM1315MO2 EVSAT
21NT SUM149PT HCC1143BL
600MPE SUM159PT HCC2218
AU565 SUM185PE HCC38BL
BT20 SUM229PE HDQP1
BT474 SUM52PE MFM223
BT483 T47D MT3
BT549 T47D_KBluc PMC42
CAMA1 UACC812 EFM192A
HCC1143 ZR751 EFM192B
HCC1187 ZR7530 EFM192C
HCC1395 ZR75B EFM19
HCC1428 21MT1 21MT2
HCC1569 MCF10F MDAMB231
HCC1806 MDAMB175VII MDAMB361
HCC1937 SUM225CWN MDAMB415
HCC1954 SUM44PE MDAMB453
HCC202 MDAMB436 MX1
HCC2185 MDAMB468
HCC3153 UACC893
HCC38 21PT
HCC70 JIMT1
LY2 SUM102PT
MCF10A T4
MCF12A CAL120
MDAMB134VI CAL148
MDAMB157 CAL851
* In red did not get included in drug analysis
*not included in exome-sequencing * Not included in RNA-sequencing
Cell lines with all Datasets
(RNA-seq, Exome-seq,
exon-array, methylation,
drug analysis)
600MPE MCF10A
AU565 MCF12A
BT474 MDAMB134VI
BT483 MDAMB231
BT549 MDAMB361
CAMA1 MDAMB453
HCC1143 SKBR3
HCC1428 SUM1315MO2
HCC1569 SUM149PT
HCC1937 SUM52PE
HCC1954 T47D
HCC202 UACC812
HCC3153 ZR751
HCC38 ZR7530
HCC70 ZR75B
LY2 MDAMB175VII
SUM225CWN
*no RPPA availability
7. RPPA a protein array designed a a micro- or nano- scaled dot-blot platform that allows measurements of protein
expression levels in a large number of biological samples.
This can be characterized the basal protein expression and modification levels, growth factor, or ligand induced
effects. This can be used to validate therapeutic targets and evaluate drug pharmacodynamics.
The RPPA assays whose protein lysate requirements are generally in the picogram to nanogram range and
hundreds of proteins can be analyzed simultaneously under identical conditions.
Measuring Protein Abundance: Reverse Phase Protein Lysate
1. Lysis and Printing
2. Staining and Measuring
3. Analysis
8. Study Highlights
• The researchers found predictive signatures of
responses across all levels of the genome.
• The current system to determine treatment uses ER
and ERBB2 status, but this study suggest that more
significant features should be included in the
treatment decision.
• Using the Patient Response toolbox in R’, each
patient would get a total of 22 therapeutic
compounds ranked according to a patient’s likeli-
hood of response and in vitro GI50 dynamic range.
• Building upon this work, the long term goal is to
select therapeutic compounds most likely to be
effective in an individual patient.
9. Application
• Building upon the work in this study, a more comprehensive
genome wide platforms could be used for discovery and one
identified, significant features could be migrated to alternative
platforms for a lab diagnostic.
11. Expected results:
• Gene expression
• Isoform expression
• Exon expression
Quantile
Normalization
PCA pictures
• It is expected that cell lines will appear as “clouds” on PCA pictures if the lines different
enough.
• Some cell lines can be found more close to each other on PCA graph and other cell lines can
be placed on a distance. It can be concordant with Transcriptional subtype + ERBB2 status
or more similar cell lines probably respond on treatment also similarly.
• Genes expression expected be less informative then isoform and exon expression. Batch
effect can be found in exon expression, but not in gene expression.
• Upregulated and downregulated genes and isoforms can provide meaningful pathways in
DAVID, already found in breast cancer and also unknown yet. It would be interesting result
if lines will have different pathways as different types of cancer.
17. Junk-RNA
Reads that were
not mapped on
genome (RSEM
output
NotMapped reads)
will be mapped on
ncRNA database
and RepBase
database.
18. putative RE (kchains)
• Reads that were
not mapped on
ncRNA+RepBase
will be analyzed
for putative
ncRNA and/or
Repeats using
BiClustering
procedure.
• Kchain extension
and annotation.
19. Expected results:
• RE abundance
• kchains abundance
Quantile
Normalization
PCA pictures
(points and
clouds)
• RE and kchains on PCA graph can reveal cell-lines similarities and differences.
• Cell-lines can be found on PCA graphs by their RE and kchain abundances more cell-line-
specific than genes, isoform and exon expression, especially kchains abundances.
• positions of TE/REs (rows of the table) can be analyzed also by classification of RE –
abundance of some types of RE can be higher then others in specific cell-lines.
• Kchain extension and annotation can give some more genes that up- or down-regulated
in different cell lines
21. Expected results:
List of prospective
mutations
(chromosome/
position)
Analysis of probability
of every mutation
Known and new
markers for every
cell-line
• List of cancer markers (positions of mutations)
• Known and new mutations
• Genome regions with the biggest rate of mutation frequency
22. BiAssociation
Cell lines ---------------------
Traits (Treatment reaction
and mutations)
HCC1143 HCC1806
17-AAG 6.86 3.76
5-FU…
7.05 4.61
Chr..Pos.. 1 0
Chr..Pos.. 0 1
Cell lines
Expression and
abundance (with
line-specificity)
HCC1143 HCC1806
Genes… 0
Isoforms…
Exons…
REs…
Kchains..
First table - table of traits with GI50 values (drug
response) and mutations presented as a tabe with
values like 0;0,5;1 for every cell line (sample).
Second table is table of genes, isoforms, exons
expression and REs and kchains abundances (which
have maximum in one of cell lines).
23. Expected results of BiAssociation and P-
clustering
• Expected that we will find similar
association as it was fund in initial paper
(between breast cancer markers in
different cell lines and drug response of
them).
• We will probably find more markers
between isoforms and exons, and also
Res (known and putative) and they can
also be associated with specific drug
response.
• P-clustering can give modules of co-
associated features (drug response,
expression, mutations and etc)
25. Educational tasks:
1) Cell-line (or species- or set of data-) specificity by gene, isoform,
exon expression and REs (known and putative) abundances.
Defining of data: PCA-visualization, batch effect, kchain extension and
annotation, artefacts (probably).
2) Mutations: how to find and annotate?
3) BiAssociation of data from different kind of sources
Editor's Notes
Breast cancer is a clinically and genomically heterogeneous disease. Six subtypes were defined approximately a decade ago based on transcriptional characteristics and were designated luminal A, luminal B, ERBB2-enriched, basal-like, claudin-low and normal-like [1,2]. New cancers can be assigned to these subtypes using a 50-gene tran- scriptional signature designated the PAM50 [1]. However, the number of distinct subtypes is increasing steadily as multiple data types are integrated. Integration of genome copy number and transcriptional profiles defines 10 subtypes [3], and adding mutation status [4], methylation pattern [5], pattern of splice variants [6], protein and phosphoprotein expression [7] and microRNA expression and pathway activity [8] may define still more subtypes. The Cancer Genome Atlas (TCGA) project and other international genomics efforts were founded to improve our understanding of the molecular landscapes of most major tumor types with the ultimate goal of increasing the precision with which individual cancers are man- aged. One application of these data is to identify mo- lecular signatures that can be used to assign specific treatment to individual patients. However, strategies to develop optimal predictive marker sets are still being explored. Indeed, it is not yet clear which molecular data types (genome, transcriptome, proteome, and so on) will be most useful as response predictors.