SlideShare a Scribd company logo
Predicting Gene Expression By Copy Number Alterations and
Methylation Data in Cancer
Abstract
Cancer cells harm the body through proliferation, caused by (epi)-genetic
mutations or factors, such as copy number alterations and methylation changes
influencing genetic activity through differential gene expression. Cancer is driven by
irregular expression of certain oncogenes that alter the cell cycle, cause proliferation, or
inhibit normal function.
To investigate these oncogenes, I created a bioinformatics tool to model
oncogene expression as a linear model predicted by copy number and methylation. The
statistical model identifies oncogenes and tumor-suppressor genes relevant to specific
cancers and which (epi)-genomic factors drive their expression. It extracts the latest live
data and selects only relevant copy number data and methylation clusters in analysis,
increasing accuracy of the expression model.
We used the R programming language to package the model, using data from
The Cancer Genome Atlas (TCGA), a cancer database, to retrieve expression data.
Starting with glioblastoma and rapidly expanding to include all cancers in TCGA, I
developed a program that summarizes the disease by genes most instrumental to its
expression and offers a comprehensive individual analysis per gene. A ranking
algorithm sorts genes by relevance and displays influencing factors. Conclusions about
disease-specific gene expression can be quickly derived with this package.
1
Predicting Gene Expression By Copy Number Alterations and
Methylation Data in Cancer
1 INTRODUCTION
Cancer is a life-threatening disease that kills over 20,000 people every day. Multiple
types of cancers exist and they can strike at any part of the body. Cancer is caused by irregular
gene expression, and genes control all functions of the body.1
Certain cancer-causing genes,
labeled as oncogenes when mutated to inhibit normal function, can lead to unwanted cell
proliferation, cell-cycle irregularities, and tumor progression when activated. It is important to
identify these oncogenes; however, cancer is a very general term for multiple specific diseases
and specific cancers have different oncogenes that control their expression. Our goal is to find a
way to identify a wide array of these oncogenes and their level of impact on a specific cancer. It
is also desired to find which genetic or epigenetic factors control the expression of these
oncogenes.
Throughout the field of cancer biology, researchers search for genes that control
cancerous mechanisms, such as cell cycle acceleration, proliferation, and malignant behavior.
Many advances have been made in the field with regard to finding these oncogenes, and targeted
drug delivery to inhibit the activity of these oncogenes has greatly increased life expectancy for
cancer.2
The death rate due to cancer goes down each year due to medicinal advancement and the
discovery of these important oncogenes.3
Bioinformaticians have studied cancer using the wide
variety of data available in databases spanning the world wide web. Many databases, websites,
and critical papers have investigated the role of specific genes in specific cancers, allowing for
medicinal treatment and drug delivery to accommodate for these genes and greatly improve
preventive treatment.4
This data analysis is huge to the field of cancer, as without targeted gene
analysis new cancer-causing pathways and mechanisms would not be discovered. In lieu of this
2
information, we decided to create a comprehensive tool that studies gene expression using
genetic or epigenetic predictors to ultimately provide a comprehensive gene analysis that
pinpoints the most important oncogenes in any given cancer and understand their expression.
Further research can be applied on these genes of high importance in cancer to facilitate cancer
research and improve gene-targeted drug delivery to cure various forms of cancer. We aim to
create a product that can help to identify more of the many genes listed as cancer biomarkers.5
A bioinformatics package was created to model oncogene expression in cancer. Gene
expression was modeled with a linear fit, using copy number alterations and methylation data as
predictors for the model. Copy number alterations specifically refer to the amount of copies of
the gene relative to its expression. We retrieve this data through a GISTIC analysis of raw copy
number data.6
Methylation is an epigenetic factor that controls gene expression by attaching a
methyl-group to a CpG site, commonly located on the promoter region of the gene, which
compresses the DNA and generally inhibits transcription.7
Raw human methylation data with
general consensus clustering is used to study methylation directly related to these oncogenes.
Methylation can also be modeled using MethylMix, an algorithm which uses differential
and functional methylation states to group methylation into clusters.8
Gene expression can be
mapped using mRNA sequencing data (mRNAseq) or gene expression microarray data, both
standard ways to acquire gene expression data from patients.
The package acquires this wealth of data from The Cancer Genome Atlas (TCGA). The
Cancer Genome Atlas is a source that is home to data from over 30 cancers and compiles genetic
data retrieved from multiple patients. TCGA is where the methylation, copy number, and gene
expression data are retrieved. The use of the R programming language for statistics was essential
in the creation of this package. The finished package will be released for users to enter the cancer
3
on which they desire to conduct research and to request the types of expression and methylation
data they want. The package allows users to quickly come to conclusions on which oncogenes
drive which cancers, and whether copy number, methylation, or both predictors drive the
expression of these oncogenes. The R package, once completed, will be released on
Bioconductor,9
a home for bioinformaticians to release software and will be available to any user
who quickly wants to come to conclusions on oncogenes in cancers.
For users, the package will run using the following general pipeline: The datasets, with
the specific types of data they desire for methylation and gene expression, will be downloaded
securely and locally to their hard drive, where a series of preprocessing steps will occur,
including batch correction, missing value estimation, and patient matching for multiple datasets.
A pool of genes of interest is extracted from the datasets, applying a variance filter to select a
percent of genes specified by the user. A linear model is applied to these genes, combining only
relevant and statistically significant parts of the model for each gene to ensure the most accurate
predictive model for the gene and its drivers. Copy number alterations are added to the model by
relevance and relevant methylation clusters are combined in a linear combination to produce the
final methylation coefficient. Only relevant components of the expression model of the gene,
whether it is copy number alterations, methylation clusters, or both, are included in the
individual gene model. In this manner, extremely accurate and unique models are created for
each gene of interest for a given cancer.
It was decided to run our software on glioblastoma (GBM), a rare yet well-documented
brain cancer, to compare the results retrieved from our package to established results on the
disease. The package then was quickly expanded to be able to implement the model on all
cancers available through TCGA. The code produces individual gene graphs, presents raw data,
4
and shows a full disease summary view, indicating the most important genes for specific cancers
and how they are driven. The disease summary allows users to see incredibly accurate models for
expression that are fitted for multiple genes of interest and show how a specific cancer is being
driven as a whole, making it a great tool to find genes in cancers for further research.
Overall, the model was a success in finding genes relevant to the expression of GBM. It
is able to differentiate between genes driven only by copy number alterations, by methylation
clusters, or both. The package is still under development and it will be released soon. In future
updates to the software, more complex, higher dimensional models may be used. 3D graphs may
be used to display these complex gene models, and new predictors may accompany copy number
and methylation as drivers for the expression model. A grounds-up, simple approach to creating
a gene overview tool for cancers is necessary and is now implemented through using data from
TCGA and analysis via the R programming language.
2 MATERIALS AND METHODS
The expression model using copy number and methylation as predictors is implemented
as a user package rather than a 1-time analysis of genes. As such, the methods outlined here can
be thought of as a pipeline that the package goes through upon its execution rather than a
traditional procedure.
2.1 Downloading Data
A designated R script is used to dynamically download data from the Broad Institute’s
TCGA server. I initially wrote the script using the Broad Institute’s Firehose tool to acquire the
data, but a graduate student updated the script to use curl and wget commands to retrieve data
from Broad Institute URLs; the script can retrieve all the files we need to process and analyze.
There are two types of data the user can select from for gene expression, which are mRNAseq
5
data and microarray data. Selections for methylation data include raw clustered methylation and
MethylMix data. Copy number data used in the model are raw data run through a GISTIC
analysis. The user decides for which cancer to retrieve data, and by the end of the downloading
phase, expression, copy number, and methylation datasets are locally stored on the user’s
computer.
2.2 Preprocessing
While the datasets are retrieved, many steps of preprocessing must be achieved to get the
datasets ready for analysis. All these separate processes are stored in separate datasets so the user
has access to the raw data at every step. A series of scripts written by my mentor processes the
data to move to the next step. The code begins by estimating missing values using the k-nearest-
neighbor method. A 10% threshold of missing values is used to determine whether to eliminate a
gene or patient from the dataset. Standard TCGA batch correction is done on the resulting dataset
to normalize patient data across multiple facilities of data collection. I also had to accommodate
for mismatched patients and genes to ensure modeling is accurate across those multiple datasets.
The algorithm only tests patients with a pure form of the cancer, so I had to remove pan-cancer,
metastatic, and healthy control patients. After these steps, the data are ready for linear modeling
analysis.
2.3 Linear Modeling of Genomic and Epigenomic Data
After the data are ready, the program starts analysis by selecting a pool of genes, which
we label as the genes of interest, given a user prompt for the percentage of total genes available.
To select genes of interest, a filter that maximizes data information regarding deviation, such as
variance, interquartile range, and mean absolute deviation, sorts all genes. The code selects a
percent of all genes to be the genes of interest based on user input. Individual gene analysis is
6
performed on these genes. The variance filter is a very basic and fast way to generally take a
percent of genes for analysis.
The linear model for expression (EXP) using copy number (CN) and methylation (MET)
as predictors can be easily interpreted as a linear combination of beta values for copy number
alterations and methylation data for each gene, and can be stated as follows.
EXP GENE = β1 * CN GENE + β2 * MET GENE .
The linear model executes by splitting itself up into its two-dimensional components and
combining the statistically significant beta values into the above equation. The first step in this
process is to model expression based on copy number and methylation separately:
EXP GENE = β1 * CN GENE , EXP GENE = β2 * MET GENE .
The copy number portion of the model is rendered separately, and if the beta values of
this separate regression are significant, this beta is included in the overall model for gene
expression. On the other hand, methylation has another nuance in clusters. The separate model
for expression based on methylation can be modeled as a linear combination of the different n
methylation clusters for each gene as shown here:
EXP GENE =
C=1
n
å β2-ClusterC * MET GENE-ClusterC .
A linear regression is applied on the clustered methylation model and significant
coefficients are included in a second regression model for methylation, including only significant
beta values. A linear regression is applied on the subsequent equation to retrieve fitted values.
The linear coefficient for these fitted values can be determined as the significant beta value for
methylation as a whole. So, the linear model can now be summed up in the following simple
equation based on significance:
EXP GENE = β1-signif * CN GENE + β2-signif * MET GENE .
7
If an entire driver, whether it be copy number or methylation, is insignificant, the final
model will simply be rendered without that driver. Sometimes both copy number and
methylation data are completely insignificant, implying that the gene has little to no role in
cancer. In this way, we have created a powerful tool that lists the most significant genes and how
they are expressed, creating a unique model for each and every gene.
The beta values are filled in once the regression is applied. For a high correlation with a
gene to a specific cancer, we look for low p-values and r-squared values, statistics associated
with linear regression. Throughout the process of evaluation for statistical significance, we use a
p-value filter determined by the user to remove statistically insignificant parts of the model. The
p-value is associated with the accuracy of the model, or how close the actual values are to the
fitted values of the regression model. The r-squared value refers to the coverage of the model, or
how well the fit corresponds to all data points. We look for low p-values and high r-squared
values to determine statistical significance of a gene in a given cancer.
In user interpretation of these linear models, one would look for positive copy number
coefficients and negative methylation coefficients. Recalling that copy number refers to the
number of copies of the gene and methylation refers to the protection of the gene, one can
interpret that copy number and methylation must have positive and negative correlations on gene
expression. However, due to the epigenetic and undiscovered true nature of methylation, we
leave this part of the model up to interpretation.
2.4 Data Visualization
Users can see results instantaneously upon visualizing the data generated. Three ways to
visualize the data are available through this package, designed to work together to understand
gene expression in any given cancer.
8
All raw data are shown, and users can see whether genes are driven by copy number,
methylation, or both. The user can also see graphs of individual gene-by-gene analysis, which
shows the linear fit line across copy number and methylation and displays correlation statistics of
the gene. A disease summary allows the user to see how genes are being driven across the entire
disease. It also shows the genes with the highest statistical correlation in the dataset.
I created these data visualization tools using the R package ggplot2, a graphics engine
that allows for several types of data visualization.
3 RESULTS
All results displayed here are run on the brain cancer glioblastoma using microarray data
for gene expression, GISTIC data for copy number alterations, and MethylMix for methylation
clusters. However, the pipeline can be run on any disease, with options for types of data in
expression and methylation.
9
3.1 Individual Gene Analysis
Graphs are printed of the linear regression of individual genes. Plotting the regression is
the fastest way to quickly obtain a visualization of the gene. This example shows a model for
expression based on only methylation for the gene FABP5, which controls fatty acids and
regulates control of growth factors.10
Without proper gene expression, the cell’s growth factor
communication may lose control, resulting in cancerous tumor growth.
Figure 1: Modeling expression by methylation in FABP5. Noticing the 59% coverage as indicated by the r-squared
value, we can infer that this gene is being downregulated by methylation in glioblastoma patients.
10
EGFR is a growth factor involved in cell division and proliferation, and it tends to have a high
correlation with copy number in many cancers, including glioblastoma.11
Figure 2: Individual gene analysis of EGFR in glioblastoma. This gene holds a statistical correlation for copy
number and MethylMix methylation. As the two separate models (blue) are combined (red), we can see that this
gene is driven by copy number alteration.
11
PDLIM4 is a gene that is heavily methylated in several cancers, and its repression due to this
hypermethylation is listed as a biomarker for several cancers.
Figure 3: Individual gene analysis of PDLIM4, an actin protein coding gene, in glioblastoma. This gene holds a
statistical correlation for copy number and MethylMix methylation. As the two separate models (blue) are combined
(red), we can see that this gene is driven by copy number alteration.
12
3.2 Raw Linear Model Data
The raw data shows accuracy and scope of the expression model with relation to copy
number alterations, methylation data, or both predictors. In this chart, analysis of raw data has
been done to see if genes driven by copy number, methylation, or both factors. All of these genes
have both copy number and methylation rendered as significant, or higher than the p-value filter
(default p-value filter is < 0.05).
Figure 4: Linear Model Data – Users are given all the raw data for genes of interest in glioblastoma. Combining
copy number and methylation into one model facilitates analysis of drivers for gene expression. (0* indicates value
<1 e-10)
3.3 Disease Summary
The disease summary describes in a holistic view how the disease is being driven. A pie
chart displays the genes driven by copy number, methylation, or both factors, and a percent of
genes will be labeled insignificant. Sorting the genes of interest by relevance can show the most
important genes for any cancer. Because of the selection for only relevant clusters, we can
13
eliminate a number of genes as insignificant, and we can accurately and precisely model the
expression of important genes, as shown in the disease summary.
Figure 5: Pie chart showing drivers for significant genes of a pool of genes of interest in glioblastoma.
The model maximizes the number of genes that are not significant by selecting for the most significant
coefficients, allowing for a wider spread of genes. 73% of genes all have unique models, as shown here.
14
Figure 5: Top genes listed in a disease summary for glioblastoma. At a glance, users can see genes sorted by
relevance for any given cancer. Relevance is denoted using the r-squared value.
4 DISCUSSION
Through the three visualization tools of the linear model software, we can find the most
important genes in a cancer, visualize them separately, determine the dominant factor in these
oncogenes, and get a glimpse of how the disease is driven overall.
In executing only methylation or copy number values, we can see easily how the gene is
being driven. For example, methylation in FABP5 is downregulating gene expression, as it
should according to the traditional view of methylation blocking transcription. We can then infer
that, because FABP5 has a large 59% coverage and a very low p-value, that methylation in
FABP5 could be instrumental to the regulation of GBM and should require further research.
EGFR and PDLIM4 in glioblastoma are examples of linear models that incorporate both
copy number alteration and methylation clusters. We can see that EGFR has a strong correlation
in copy number and methylation, but its relevance is much higher with copy number than with
methylation. If EGFR was studied with just copy number or methylation, we would not be able
15
to see the copy number dominate the expression of EGFR as in the combined model. Similarly,
in PDLIM4, we see methylation as the ultimate driver for expression after the copy number and
methylation models are combined.
The separate copy number and methylation models can also work together and promote
synergy, as shown in the raw data for PDGFRA. We can see the relevance of the model increase
from 41% in methylation and 22% in copy number to 55% overall in the combined model. This
synergy shows an inherent strength in the model to combine components and increase accuracy
and relevance of the model.
Summarizing the disease based on its drivers allows us to see the overall expression of
the disease. It also shows the package’s efficacy in combining models, producing 25% of models
as combined models with copy number and methylation. By selecting only for relevant parts of
the model, the model also inherently selects for the relevant genes, allowing the code to
maximize the number of irrelevant genes and permitting the user to focus on the genes that have
statistical relevance to the cancer.
5 CONCLUSIONS AND FUTURE WORK
Overall, the linear model package is a simple yet effective way to model gene expression
in cancer. It allows us to come to striking conclusions by analyzing data available on TCGA in
real time. The package provides a comprehensive view of genes with a strong correlation to other
specific factors, which allows users to find and continue research on unexplored genes that are
found using this tool. Mechanisms related to these undiscovered genes with high correlation to
copy number, methylation, or both factors can be explored to study the reason for specific
cancers and to ultimately invert the mechanism and do a better job treating cancer.
16
Simple and comprehensive as the linear model is, it has its faults. For example,
exponential, logarithmic, polynomial, and otherwise nonlinear relationships between gene
expression and copy number alterations or methylation data can go unnoticed. In the future, these
complex relationships can be added to find more genes that have strong correlations to cancer.
We intend to add 3D modeling to gene-by-gene graphs to visualize the linear fit as a
plane equation spanning in the areas of copy number and methylation. More predictors to
analyze gene expression such as miRNA may be included in the model in the future to give more
factors to gene expression and make the model more accurate.13
These more complex higher-
dimensional models allow more places for strong statistical correlations between gene expression
and its predictors, and allow multiple statistically significant predictors to work together in
synergy.
Creating the software package from scratch was a great way to explore the field of
bioinformatics. By providing data analysis tools, I learned more about the biological and
statistical concepts of this field. My lab provided the introduction and the framework necessary
for my understanding on how to create this product, and through this background, I built a
powerful product that can service the bioinformatics field.
In summation, we modeled gene expression as a linear model using copy number and
methylation, and the result is a fast, elegant, and comprehensive way to find important
oncogenes and analyze their behavior and impact on any given cancer.

More Related Content

What's hot

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
Incedo
 
Michael's IUCRL Poster 2014 Close to Final with CDW edits
Michael's IUCRL Poster 2014 Close to Final with CDW editsMichael's IUCRL Poster 2014 Close to Final with CDW edits
Michael's IUCRL Poster 2014 Close to Final with CDW editsMichael Araya
 
The Cancer Genome Atlas Update
The Cancer Genome Atlas UpdateThe Cancer Genome Atlas Update
The Cancer Genome Atlas Update
Melanoma Research Foundation
 
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
IJTET Journal
 
NetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David AmarNetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David Amar
Alexander Pico
 
Research Poster
Research PosterResearch Poster
Research Poster
Mohsin Rafique
 
An expression meta-analysis of predicted microRNA targets identifies a diagno...
An expression meta-analysis of predicted microRNA targets identifies a diagno...An expression meta-analysis of predicted microRNA targets identifies a diagno...
An expression meta-analysis of predicted microRNA targets identifies a diagno...Yu Liang
 
Cancer Proteomics
Cancer ProteomicsCancer Proteomics
Cancer Proteomics
Mereshega Jayasankar
 
Pre-clinical drug prioritization via prognosis-guided genetic interaction net...
Pre-clinical drug prioritization via prognosis-guided genetic interaction net...Pre-clinical drug prioritization via prognosis-guided genetic interaction net...
Pre-clinical drug prioritization via prognosis-guided genetic interaction net...
laserxiong
 
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancerMmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
Austin Publishing Group
 
Introduction to cancer bioinformatics
Introduction to cancer bioinformaticsIntroduction to cancer bioinformatics
Introduction to cancer bioinformatics
creativebiolabs11
 
NetBioSIG2013-Talk Thomas Kelder
NetBioSIG2013-Talk Thomas KelderNetBioSIG2013-Talk Thomas Kelder
NetBioSIG2013-Talk Thomas Kelder
Alexander Pico
 
NetBioSIG2014-Talk by Traver Hart
NetBioSIG2014-Talk by Traver HartNetBioSIG2014-Talk by Traver Hart
NetBioSIG2014-Talk by Traver Hart
Alexander Pico
 
Presentation july 31_2015
Presentation july 31_2015Presentation july 31_2015
Presentation july 31_2015
gkoytiger
 
ASEE-GSW_2015_submission_75
ASEE-GSW_2015_submission_75ASEE-GSW_2015_submission_75
ASEE-GSW_2015_submission_75Sam Yang
 
Introduction to Cancer Genomics Databases
Introduction to Cancer Genomics DatabasesIntroduction to Cancer Genomics Databases
Introduction to Cancer Genomics Databases
Neuro, McGill University
 
2015 bioinformatics personal_genomics_wim_vancriekinge
2015 bioinformatics personal_genomics_wim_vancriekinge2015 bioinformatics personal_genomics_wim_vancriekinge
2015 bioinformatics personal_genomics_wim_vancriekinge
Prof. Wim Van Criekinge
 

What's hot (19)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Michael's IUCRL Poster 2014 Close to Final with CDW edits
Michael's IUCRL Poster 2014 Close to Final with CDW editsMichael's IUCRL Poster 2014 Close to Final with CDW edits
Michael's IUCRL Poster 2014 Close to Final with CDW edits
 
The Cancer Genome Atlas Update
The Cancer Genome Atlas UpdateThe Cancer Genome Atlas Update
The Cancer Genome Atlas Update
 
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
 
NetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David AmarNetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David Amar
 
Hamilton.nature.comms
Hamilton.nature.commsHamilton.nature.comms
Hamilton.nature.comms
 
Research Poster
Research PosterResearch Poster
Research Poster
 
An expression meta-analysis of predicted microRNA targets identifies a diagno...
An expression meta-analysis of predicted microRNA targets identifies a diagno...An expression meta-analysis of predicted microRNA targets identifies a diagno...
An expression meta-analysis of predicted microRNA targets identifies a diagno...
 
Cancer Proteomics
Cancer ProteomicsCancer Proteomics
Cancer Proteomics
 
Pre-clinical drug prioritization via prognosis-guided genetic interaction net...
Pre-clinical drug prioritization via prognosis-guided genetic interaction net...Pre-clinical drug prioritization via prognosis-guided genetic interaction net...
Pre-clinical drug prioritization via prognosis-guided genetic interaction net...
 
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancerMmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
 
Introduction to cancer bioinformatics
Introduction to cancer bioinformaticsIntroduction to cancer bioinformatics
Introduction to cancer bioinformatics
 
NetBioSIG2013-Talk Thomas Kelder
NetBioSIG2013-Talk Thomas KelderNetBioSIG2013-Talk Thomas Kelder
NetBioSIG2013-Talk Thomas Kelder
 
NetBioSIG2014-Talk by Traver Hart
NetBioSIG2014-Talk by Traver HartNetBioSIG2014-Talk by Traver Hart
NetBioSIG2014-Talk by Traver Hart
 
Presentation july 31_2015
Presentation july 31_2015Presentation july 31_2015
Presentation july 31_2015
 
ASEE-GSW_2015_submission_75
ASEE-GSW_2015_submission_75ASEE-GSW_2015_submission_75
ASEE-GSW_2015_submission_75
 
Published-PageOne
Published-PageOnePublished-PageOne
Published-PageOne
 
Introduction to Cancer Genomics Databases
Introduction to Cancer Genomics DatabasesIntroduction to Cancer Genomics Databases
Introduction to Cancer Genomics Databases
 
2015 bioinformatics personal_genomics_wim_vancriekinge
2015 bioinformatics personal_genomics_wim_vancriekinge2015 bioinformatics personal_genomics_wim_vancriekinge
2015 bioinformatics personal_genomics_wim_vancriekinge
 

Viewers also liked

Top Web Design Trends
Top Web Design TrendsTop Web Design Trends
Top Web Design Trends
SuanEng Tan
 
Startup week europe horizon 2020 presentation by Civitta
Startup week europe horizon 2020 presentation by CivittaStartup week europe horizon 2020 presentation by Civitta
Startup week europe horizon 2020 presentation by Civitta
Mirjam Kert
 
My Final Presentation about Ed Tech 1 and 2.
My Final Presentation about Ed Tech 1 and 2. My Final Presentation about Ed Tech 1 and 2.
My Final Presentation about Ed Tech 1 and 2.
Jermaine Dabon
 
Digital and Social Media Marketing MOOC
Digital and Social Media Marketing MOOCDigital and Social Media Marketing MOOC
Digital and Social Media Marketing MOOC
Imogen Connett
 
Leasing time xi n. 05 abbrevia (2)
Leasing time xi n. 05   abbrevia (2)Leasing time xi n. 05   abbrevia (2)
Leasing time xi n. 05 abbrevia (2)
Ioana Luisa Florea
 
Latihan1 suriniate barasa tk-1_d
Latihan1 suriniate barasa tk-1_dLatihan1 suriniate barasa tk-1_d
Latihan1 suriniate barasa tk-1_d
Suriniate Bratasri
 
Foriegn policy of pakistan
Foriegn policy of pakistanForiegn policy of pakistan
Foriegn policy of pakistan
13023901-016
 
Application Behavior-Aware Flow Control in Network-on-Chip
Application Behavior-Aware Flow Control in Network-on-ChipApplication Behavior-Aware Flow Control in Network-on-Chip
Application Behavior-Aware Flow Control in Network-on-ChipIvonne Liu
 
DotTerrace_TalkEvent_Part2
DotTerrace_TalkEvent_Part2DotTerrace_TalkEvent_Part2
DotTerrace_TalkEvent_Part2DOT TERRACE
 
Major marketing challenges involved in inbound international tourism
Major marketing challenges involved in inbound international tourismMajor marketing challenges involved in inbound international tourism
Major marketing challenges involved in inbound international tourism
vishnutc
 

Viewers also liked (13)

Top Web Design Trends
Top Web Design TrendsTop Web Design Trends
Top Web Design Trends
 
Startup week europe horizon 2020 presentation by Civitta
Startup week europe horizon 2020 presentation by CivittaStartup week europe horizon 2020 presentation by Civitta
Startup week europe horizon 2020 presentation by Civitta
 
My Final Presentation about Ed Tech 1 and 2.
My Final Presentation about Ed Tech 1 and 2. My Final Presentation about Ed Tech 1 and 2.
My Final Presentation about Ed Tech 1 and 2.
 
Digital and Social Media Marketing MOOC
Digital and Social Media Marketing MOOCDigital and Social Media Marketing MOOC
Digital and Social Media Marketing MOOC
 
cv lucky
cv luckycv lucky
cv lucky
 
Leasing time xi n. 05 abbrevia (2)
Leasing time xi n. 05   abbrevia (2)Leasing time xi n. 05   abbrevia (2)
Leasing time xi n. 05 abbrevia (2)
 
Latihan1 suriniate barasa tk-1_d
Latihan1 suriniate barasa tk-1_dLatihan1 suriniate barasa tk-1_d
Latihan1 suriniate barasa tk-1_d
 
mohamed haggag
mohamed haggagmohamed haggag
mohamed haggag
 
Foriegn policy of pakistan
Foriegn policy of pakistanForiegn policy of pakistan
Foriegn policy of pakistan
 
Application Behavior-Aware Flow Control in Network-on-Chip
Application Behavior-Aware Flow Control in Network-on-ChipApplication Behavior-Aware Flow Control in Network-on-Chip
Application Behavior-Aware Flow Control in Network-on-Chip
 
DotTerrace_TalkEvent_Part2
DotTerrace_TalkEvent_Part2DotTerrace_TalkEvent_Part2
DotTerrace_TalkEvent_Part2
 
Major marketing challenges involved in inbound international tourism
Major marketing challenges involved in inbound international tourismMajor marketing challenges involved in inbound international tourism
Major marketing challenges involved in inbound international tourism
 
CV
CVCV
CV
 

Similar to Geveart Lab SIMR Paper

Mining Gene Expression Data Focusing Cancer Therapeutics: A Digest
Mining Gene Expression Data Focusing Cancer Therapeutics: A DigestMining Gene Expression Data Focusing Cancer Therapeutics: A Digest
Mining Gene Expression Data Focusing Cancer Therapeutics: A Digest
KaashivInfoTech Company
 
Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...
ijsc
 
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
ijsc
 
Biomed central
Biomed centralBiomed central
Interrogating differences in expression of targeted gene sets to predict brea...
Interrogating differences in expression of targeted gene sets to predict brea...Interrogating differences in expression of targeted gene sets to predict brea...
Interrogating differences in expression of targeted gene sets to predict brea...
Enrique Moreno Gonzalez
 
APPLICATION OF NEXT GENERATION SEQUENCING (NGS) IN CANCER TREATMENT
APPLICATION OF  NEXT GENERATION SEQUENCING (NGS)  IN CANCER TREATMENTAPPLICATION OF  NEXT GENERATION SEQUENCING (NGS)  IN CANCER TREATMENT
APPLICATION OF NEXT GENERATION SEQUENCING (NGS) IN CANCER TREATMENT
Dinie Fariz
 
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
rahulmonikasharma
 
Advances-in-cancer-research.pdf
Advances-in-cancer-research.pdfAdvances-in-cancer-research.pdf
Advances-in-cancer-research.pdf
ssuser54a9d9
 
Developing a framework for for detection of low frequency somatic genetic alt...
Developing a framework for for detection of low frequency somatic genetic alt...Developing a framework for for detection of low frequency somatic genetic alt...
Developing a framework for for detection of low frequency somatic genetic alt...
Ronak Shah
 
Next-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxNext-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptx
SwetaTripathi13
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
IJDKP
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
IJDKP
 
Genomics and proteomics in drug discovery and development
Genomics and proteomics in drug discovery and developmentGenomics and proteomics in drug discovery and development
Genomics and proteomics in drug discovery and development
SuchittaU
 
Clinical Assessment In Incorporating a Personal Genome
Clinical Assessment In Incorporating a Personal GenomeClinical Assessment In Incorporating a Personal Genome
Clinical Assessment In Incorporating a Personal Genome
Diego Herrera
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learningFord Sleeman
 
overview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csnceroverview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csncer
Seham Al-Shehri
 
John Boikov Personalised Medicine Essay, Mark - 95 out of 100
John Boikov Personalised Medicine Essay, Mark - 95 out of 100John Boikov Personalised Medicine Essay, Mark - 95 out of 100
John Boikov Personalised Medicine Essay, Mark - 95 out of 100John Boikov
 
Role of biotechnology in cancer control
Role of biotechnology in cancer controlRole of biotechnology in cancer control
Role of biotechnology in cancer control
Janani Gopalarethinam
 

Similar to Geveart Lab SIMR Paper (20)

Mining Gene Expression Data Focusing Cancer Therapeutics: A Digest
Mining Gene Expression Data Focusing Cancer Therapeutics: A DigestMining Gene Expression Data Focusing Cancer Therapeutics: A Digest
Mining Gene Expression Data Focusing Cancer Therapeutics: A Digest
 
Biotech2012spring 1-overview 0
Biotech2012spring 1-overview 0Biotech2012spring 1-overview 0
Biotech2012spring 1-overview 0
 
Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...
 
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
 
Biomed central
Biomed centralBiomed central
Biomed central
 
Interrogating differences in expression of targeted gene sets to predict brea...
Interrogating differences in expression of targeted gene sets to predict brea...Interrogating differences in expression of targeted gene sets to predict brea...
Interrogating differences in expression of targeted gene sets to predict brea...
 
APPLICATION OF NEXT GENERATION SEQUENCING (NGS) IN CANCER TREATMENT
APPLICATION OF  NEXT GENERATION SEQUENCING (NGS)  IN CANCER TREATMENTAPPLICATION OF  NEXT GENERATION SEQUENCING (NGS)  IN CANCER TREATMENT
APPLICATION OF NEXT GENERATION SEQUENCING (NGS) IN CANCER TREATMENT
 
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
 
Advances-in-cancer-research.pdf
Advances-in-cancer-research.pdfAdvances-in-cancer-research.pdf
Advances-in-cancer-research.pdf
 
Developing a framework for for detection of low frequency somatic genetic alt...
Developing a framework for for detection of low frequency somatic genetic alt...Developing a framework for for detection of low frequency somatic genetic alt...
Developing a framework for for detection of low frequency somatic genetic alt...
 
Next-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxNext-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptx
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
Genomics and proteomics in drug discovery and development
Genomics and proteomics in drug discovery and developmentGenomics and proteomics in drug discovery and development
Genomics and proteomics in drug discovery and development
 
Clinical Assessment In Incorporating a Personal Genome
Clinical Assessment In Incorporating a Personal GenomeClinical Assessment In Incorporating a Personal Genome
Clinical Assessment In Incorporating a Personal Genome
 
RapportHicham
RapportHichamRapportHicham
RapportHicham
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learning
 
overview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csnceroverview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csncer
 
John Boikov Personalised Medicine Essay, Mark - 95 out of 100
John Boikov Personalised Medicine Essay, Mark - 95 out of 100John Boikov Personalised Medicine Essay, Mark - 95 out of 100
John Boikov Personalised Medicine Essay, Mark - 95 out of 100
 
Role of biotechnology in cancer control
Role of biotechnology in cancer controlRole of biotechnology in cancer control
Role of biotechnology in cancer control
 

Geveart Lab SIMR Paper

  • 1. Predicting Gene Expression By Copy Number Alterations and Methylation Data in Cancer Abstract Cancer cells harm the body through proliferation, caused by (epi)-genetic mutations or factors, such as copy number alterations and methylation changes influencing genetic activity through differential gene expression. Cancer is driven by irregular expression of certain oncogenes that alter the cell cycle, cause proliferation, or inhibit normal function. To investigate these oncogenes, I created a bioinformatics tool to model oncogene expression as a linear model predicted by copy number and methylation. The statistical model identifies oncogenes and tumor-suppressor genes relevant to specific cancers and which (epi)-genomic factors drive their expression. It extracts the latest live data and selects only relevant copy number data and methylation clusters in analysis, increasing accuracy of the expression model. We used the R programming language to package the model, using data from The Cancer Genome Atlas (TCGA), a cancer database, to retrieve expression data. Starting with glioblastoma and rapidly expanding to include all cancers in TCGA, I developed a program that summarizes the disease by genes most instrumental to its expression and offers a comprehensive individual analysis per gene. A ranking algorithm sorts genes by relevance and displays influencing factors. Conclusions about disease-specific gene expression can be quickly derived with this package.
  • 2. 1 Predicting Gene Expression By Copy Number Alterations and Methylation Data in Cancer 1 INTRODUCTION Cancer is a life-threatening disease that kills over 20,000 people every day. Multiple types of cancers exist and they can strike at any part of the body. Cancer is caused by irregular gene expression, and genes control all functions of the body.1 Certain cancer-causing genes, labeled as oncogenes when mutated to inhibit normal function, can lead to unwanted cell proliferation, cell-cycle irregularities, and tumor progression when activated. It is important to identify these oncogenes; however, cancer is a very general term for multiple specific diseases and specific cancers have different oncogenes that control their expression. Our goal is to find a way to identify a wide array of these oncogenes and their level of impact on a specific cancer. It is also desired to find which genetic or epigenetic factors control the expression of these oncogenes. Throughout the field of cancer biology, researchers search for genes that control cancerous mechanisms, such as cell cycle acceleration, proliferation, and malignant behavior. Many advances have been made in the field with regard to finding these oncogenes, and targeted drug delivery to inhibit the activity of these oncogenes has greatly increased life expectancy for cancer.2 The death rate due to cancer goes down each year due to medicinal advancement and the discovery of these important oncogenes.3 Bioinformaticians have studied cancer using the wide variety of data available in databases spanning the world wide web. Many databases, websites, and critical papers have investigated the role of specific genes in specific cancers, allowing for medicinal treatment and drug delivery to accommodate for these genes and greatly improve preventive treatment.4 This data analysis is huge to the field of cancer, as without targeted gene analysis new cancer-causing pathways and mechanisms would not be discovered. In lieu of this
  • 3. 2 information, we decided to create a comprehensive tool that studies gene expression using genetic or epigenetic predictors to ultimately provide a comprehensive gene analysis that pinpoints the most important oncogenes in any given cancer and understand their expression. Further research can be applied on these genes of high importance in cancer to facilitate cancer research and improve gene-targeted drug delivery to cure various forms of cancer. We aim to create a product that can help to identify more of the many genes listed as cancer biomarkers.5 A bioinformatics package was created to model oncogene expression in cancer. Gene expression was modeled with a linear fit, using copy number alterations and methylation data as predictors for the model. Copy number alterations specifically refer to the amount of copies of the gene relative to its expression. We retrieve this data through a GISTIC analysis of raw copy number data.6 Methylation is an epigenetic factor that controls gene expression by attaching a methyl-group to a CpG site, commonly located on the promoter region of the gene, which compresses the DNA and generally inhibits transcription.7 Raw human methylation data with general consensus clustering is used to study methylation directly related to these oncogenes. Methylation can also be modeled using MethylMix, an algorithm which uses differential and functional methylation states to group methylation into clusters.8 Gene expression can be mapped using mRNA sequencing data (mRNAseq) or gene expression microarray data, both standard ways to acquire gene expression data from patients. The package acquires this wealth of data from The Cancer Genome Atlas (TCGA). The Cancer Genome Atlas is a source that is home to data from over 30 cancers and compiles genetic data retrieved from multiple patients. TCGA is where the methylation, copy number, and gene expression data are retrieved. The use of the R programming language for statistics was essential in the creation of this package. The finished package will be released for users to enter the cancer
  • 4. 3 on which they desire to conduct research and to request the types of expression and methylation data they want. The package allows users to quickly come to conclusions on which oncogenes drive which cancers, and whether copy number, methylation, or both predictors drive the expression of these oncogenes. The R package, once completed, will be released on Bioconductor,9 a home for bioinformaticians to release software and will be available to any user who quickly wants to come to conclusions on oncogenes in cancers. For users, the package will run using the following general pipeline: The datasets, with the specific types of data they desire for methylation and gene expression, will be downloaded securely and locally to their hard drive, where a series of preprocessing steps will occur, including batch correction, missing value estimation, and patient matching for multiple datasets. A pool of genes of interest is extracted from the datasets, applying a variance filter to select a percent of genes specified by the user. A linear model is applied to these genes, combining only relevant and statistically significant parts of the model for each gene to ensure the most accurate predictive model for the gene and its drivers. Copy number alterations are added to the model by relevance and relevant methylation clusters are combined in a linear combination to produce the final methylation coefficient. Only relevant components of the expression model of the gene, whether it is copy number alterations, methylation clusters, or both, are included in the individual gene model. In this manner, extremely accurate and unique models are created for each gene of interest for a given cancer. It was decided to run our software on glioblastoma (GBM), a rare yet well-documented brain cancer, to compare the results retrieved from our package to established results on the disease. The package then was quickly expanded to be able to implement the model on all cancers available through TCGA. The code produces individual gene graphs, presents raw data,
  • 5. 4 and shows a full disease summary view, indicating the most important genes for specific cancers and how they are driven. The disease summary allows users to see incredibly accurate models for expression that are fitted for multiple genes of interest and show how a specific cancer is being driven as a whole, making it a great tool to find genes in cancers for further research. Overall, the model was a success in finding genes relevant to the expression of GBM. It is able to differentiate between genes driven only by copy number alterations, by methylation clusters, or both. The package is still under development and it will be released soon. In future updates to the software, more complex, higher dimensional models may be used. 3D graphs may be used to display these complex gene models, and new predictors may accompany copy number and methylation as drivers for the expression model. A grounds-up, simple approach to creating a gene overview tool for cancers is necessary and is now implemented through using data from TCGA and analysis via the R programming language. 2 MATERIALS AND METHODS The expression model using copy number and methylation as predictors is implemented as a user package rather than a 1-time analysis of genes. As such, the methods outlined here can be thought of as a pipeline that the package goes through upon its execution rather than a traditional procedure. 2.1 Downloading Data A designated R script is used to dynamically download data from the Broad Institute’s TCGA server. I initially wrote the script using the Broad Institute’s Firehose tool to acquire the data, but a graduate student updated the script to use curl and wget commands to retrieve data from Broad Institute URLs; the script can retrieve all the files we need to process and analyze. There are two types of data the user can select from for gene expression, which are mRNAseq
  • 6. 5 data and microarray data. Selections for methylation data include raw clustered methylation and MethylMix data. Copy number data used in the model are raw data run through a GISTIC analysis. The user decides for which cancer to retrieve data, and by the end of the downloading phase, expression, copy number, and methylation datasets are locally stored on the user’s computer. 2.2 Preprocessing While the datasets are retrieved, many steps of preprocessing must be achieved to get the datasets ready for analysis. All these separate processes are stored in separate datasets so the user has access to the raw data at every step. A series of scripts written by my mentor processes the data to move to the next step. The code begins by estimating missing values using the k-nearest- neighbor method. A 10% threshold of missing values is used to determine whether to eliminate a gene or patient from the dataset. Standard TCGA batch correction is done on the resulting dataset to normalize patient data across multiple facilities of data collection. I also had to accommodate for mismatched patients and genes to ensure modeling is accurate across those multiple datasets. The algorithm only tests patients with a pure form of the cancer, so I had to remove pan-cancer, metastatic, and healthy control patients. After these steps, the data are ready for linear modeling analysis. 2.3 Linear Modeling of Genomic and Epigenomic Data After the data are ready, the program starts analysis by selecting a pool of genes, which we label as the genes of interest, given a user prompt for the percentage of total genes available. To select genes of interest, a filter that maximizes data information regarding deviation, such as variance, interquartile range, and mean absolute deviation, sorts all genes. The code selects a percent of all genes to be the genes of interest based on user input. Individual gene analysis is
  • 7. 6 performed on these genes. The variance filter is a very basic and fast way to generally take a percent of genes for analysis. The linear model for expression (EXP) using copy number (CN) and methylation (MET) as predictors can be easily interpreted as a linear combination of beta values for copy number alterations and methylation data for each gene, and can be stated as follows. EXP GENE = β1 * CN GENE + β2 * MET GENE . The linear model executes by splitting itself up into its two-dimensional components and combining the statistically significant beta values into the above equation. The first step in this process is to model expression based on copy number and methylation separately: EXP GENE = β1 * CN GENE , EXP GENE = β2 * MET GENE . The copy number portion of the model is rendered separately, and if the beta values of this separate regression are significant, this beta is included in the overall model for gene expression. On the other hand, methylation has another nuance in clusters. The separate model for expression based on methylation can be modeled as a linear combination of the different n methylation clusters for each gene as shown here: EXP GENE = C=1 n å β2-ClusterC * MET GENE-ClusterC . A linear regression is applied on the clustered methylation model and significant coefficients are included in a second regression model for methylation, including only significant beta values. A linear regression is applied on the subsequent equation to retrieve fitted values. The linear coefficient for these fitted values can be determined as the significant beta value for methylation as a whole. So, the linear model can now be summed up in the following simple equation based on significance: EXP GENE = β1-signif * CN GENE + β2-signif * MET GENE .
  • 8. 7 If an entire driver, whether it be copy number or methylation, is insignificant, the final model will simply be rendered without that driver. Sometimes both copy number and methylation data are completely insignificant, implying that the gene has little to no role in cancer. In this way, we have created a powerful tool that lists the most significant genes and how they are expressed, creating a unique model for each and every gene. The beta values are filled in once the regression is applied. For a high correlation with a gene to a specific cancer, we look for low p-values and r-squared values, statistics associated with linear regression. Throughout the process of evaluation for statistical significance, we use a p-value filter determined by the user to remove statistically insignificant parts of the model. The p-value is associated with the accuracy of the model, or how close the actual values are to the fitted values of the regression model. The r-squared value refers to the coverage of the model, or how well the fit corresponds to all data points. We look for low p-values and high r-squared values to determine statistical significance of a gene in a given cancer. In user interpretation of these linear models, one would look for positive copy number coefficients and negative methylation coefficients. Recalling that copy number refers to the number of copies of the gene and methylation refers to the protection of the gene, one can interpret that copy number and methylation must have positive and negative correlations on gene expression. However, due to the epigenetic and undiscovered true nature of methylation, we leave this part of the model up to interpretation. 2.4 Data Visualization Users can see results instantaneously upon visualizing the data generated. Three ways to visualize the data are available through this package, designed to work together to understand gene expression in any given cancer.
  • 9. 8 All raw data are shown, and users can see whether genes are driven by copy number, methylation, or both. The user can also see graphs of individual gene-by-gene analysis, which shows the linear fit line across copy number and methylation and displays correlation statistics of the gene. A disease summary allows the user to see how genes are being driven across the entire disease. It also shows the genes with the highest statistical correlation in the dataset. I created these data visualization tools using the R package ggplot2, a graphics engine that allows for several types of data visualization. 3 RESULTS All results displayed here are run on the brain cancer glioblastoma using microarray data for gene expression, GISTIC data for copy number alterations, and MethylMix for methylation clusters. However, the pipeline can be run on any disease, with options for types of data in expression and methylation.
  • 10. 9 3.1 Individual Gene Analysis Graphs are printed of the linear regression of individual genes. Plotting the regression is the fastest way to quickly obtain a visualization of the gene. This example shows a model for expression based on only methylation for the gene FABP5, which controls fatty acids and regulates control of growth factors.10 Without proper gene expression, the cell’s growth factor communication may lose control, resulting in cancerous tumor growth. Figure 1: Modeling expression by methylation in FABP5. Noticing the 59% coverage as indicated by the r-squared value, we can infer that this gene is being downregulated by methylation in glioblastoma patients.
  • 11. 10 EGFR is a growth factor involved in cell division and proliferation, and it tends to have a high correlation with copy number in many cancers, including glioblastoma.11 Figure 2: Individual gene analysis of EGFR in glioblastoma. This gene holds a statistical correlation for copy number and MethylMix methylation. As the two separate models (blue) are combined (red), we can see that this gene is driven by copy number alteration.
  • 12. 11 PDLIM4 is a gene that is heavily methylated in several cancers, and its repression due to this hypermethylation is listed as a biomarker for several cancers. Figure 3: Individual gene analysis of PDLIM4, an actin protein coding gene, in glioblastoma. This gene holds a statistical correlation for copy number and MethylMix methylation. As the two separate models (blue) are combined (red), we can see that this gene is driven by copy number alteration.
  • 13. 12 3.2 Raw Linear Model Data The raw data shows accuracy and scope of the expression model with relation to copy number alterations, methylation data, or both predictors. In this chart, analysis of raw data has been done to see if genes driven by copy number, methylation, or both factors. All of these genes have both copy number and methylation rendered as significant, or higher than the p-value filter (default p-value filter is < 0.05). Figure 4: Linear Model Data – Users are given all the raw data for genes of interest in glioblastoma. Combining copy number and methylation into one model facilitates analysis of drivers for gene expression. (0* indicates value <1 e-10) 3.3 Disease Summary The disease summary describes in a holistic view how the disease is being driven. A pie chart displays the genes driven by copy number, methylation, or both factors, and a percent of genes will be labeled insignificant. Sorting the genes of interest by relevance can show the most important genes for any cancer. Because of the selection for only relevant clusters, we can
  • 14. 13 eliminate a number of genes as insignificant, and we can accurately and precisely model the expression of important genes, as shown in the disease summary. Figure 5: Pie chart showing drivers for significant genes of a pool of genes of interest in glioblastoma. The model maximizes the number of genes that are not significant by selecting for the most significant coefficients, allowing for a wider spread of genes. 73% of genes all have unique models, as shown here.
  • 15. 14 Figure 5: Top genes listed in a disease summary for glioblastoma. At a glance, users can see genes sorted by relevance for any given cancer. Relevance is denoted using the r-squared value. 4 DISCUSSION Through the three visualization tools of the linear model software, we can find the most important genes in a cancer, visualize them separately, determine the dominant factor in these oncogenes, and get a glimpse of how the disease is driven overall. In executing only methylation or copy number values, we can see easily how the gene is being driven. For example, methylation in FABP5 is downregulating gene expression, as it should according to the traditional view of methylation blocking transcription. We can then infer that, because FABP5 has a large 59% coverage and a very low p-value, that methylation in FABP5 could be instrumental to the regulation of GBM and should require further research. EGFR and PDLIM4 in glioblastoma are examples of linear models that incorporate both copy number alteration and methylation clusters. We can see that EGFR has a strong correlation in copy number and methylation, but its relevance is much higher with copy number than with methylation. If EGFR was studied with just copy number or methylation, we would not be able
  • 16. 15 to see the copy number dominate the expression of EGFR as in the combined model. Similarly, in PDLIM4, we see methylation as the ultimate driver for expression after the copy number and methylation models are combined. The separate copy number and methylation models can also work together and promote synergy, as shown in the raw data for PDGFRA. We can see the relevance of the model increase from 41% in methylation and 22% in copy number to 55% overall in the combined model. This synergy shows an inherent strength in the model to combine components and increase accuracy and relevance of the model. Summarizing the disease based on its drivers allows us to see the overall expression of the disease. It also shows the package’s efficacy in combining models, producing 25% of models as combined models with copy number and methylation. By selecting only for relevant parts of the model, the model also inherently selects for the relevant genes, allowing the code to maximize the number of irrelevant genes and permitting the user to focus on the genes that have statistical relevance to the cancer. 5 CONCLUSIONS AND FUTURE WORK Overall, the linear model package is a simple yet effective way to model gene expression in cancer. It allows us to come to striking conclusions by analyzing data available on TCGA in real time. The package provides a comprehensive view of genes with a strong correlation to other specific factors, which allows users to find and continue research on unexplored genes that are found using this tool. Mechanisms related to these undiscovered genes with high correlation to copy number, methylation, or both factors can be explored to study the reason for specific cancers and to ultimately invert the mechanism and do a better job treating cancer.
  • 17. 16 Simple and comprehensive as the linear model is, it has its faults. For example, exponential, logarithmic, polynomial, and otherwise nonlinear relationships between gene expression and copy number alterations or methylation data can go unnoticed. In the future, these complex relationships can be added to find more genes that have strong correlations to cancer. We intend to add 3D modeling to gene-by-gene graphs to visualize the linear fit as a plane equation spanning in the areas of copy number and methylation. More predictors to analyze gene expression such as miRNA may be included in the model in the future to give more factors to gene expression and make the model more accurate.13 These more complex higher- dimensional models allow more places for strong statistical correlations between gene expression and its predictors, and allow multiple statistically significant predictors to work together in synergy. Creating the software package from scratch was a great way to explore the field of bioinformatics. By providing data analysis tools, I learned more about the biological and statistical concepts of this field. My lab provided the introduction and the framework necessary for my understanding on how to create this product, and through this background, I built a powerful product that can service the bioinformatics field. In summation, we modeled gene expression as a linear model using copy number and methylation, and the result is a fast, elegant, and comprehensive way to find important oncogenes and analyze their behavior and impact on any given cancer.