"Non-coding RNA mediated epigenetic regulation of agronomic traits in crop pl...
RAnalysis
1. GEO DATASET GDS4145: Subcutaneous Interferon-beta-
1b treatment in relapsing-remitting multiple sclerosis
(U133 A): peripheral mononuclear blood cells
GPL96(http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL96): [HG-U133A]
Affymetrix Human Genome U133A Array
Primary Ref: Goertsches RH, Hecker M, Koczan D, Serrano-Fernandez P et al. Long-term genome-
wide blood RNA expression profiles yield novel molecular response candidates for IFN-beta-1b
treatment in relapsing remitting MS. Pharmacogenomics 2010 Feb;11(2):147-61. PMID: 20136355
By Boshika Tara
2. Introduction/Background
Data is from a 25 relapsing remitting multiple sclerosis patients that
were analyzed in a longitudinal transcriptional profile within 2 years of
rIFN-beta administration.
Post-therapy initiation, the authors identified 42 (day 2), 175 (month
1), 103 (month 12) and 108 (month 24) differentially expressed genes.
For this analysis I choose three timepoints, after the 12 month IFN-beta
injection_chipB, reason for this was to simplify the dataset that I was
working with, also because this was the midpoint of the two year
study.
4. Distribution of expression levels
The first task was to see how the expression levels are distributed for
this dataset
- Figure on the left is the distribution of all expression levels within
this timepoint
- Figure on the right is for probes whos expression is greater than
10,000
- Since the dataset is set to dataset_channel_count=1, it seems that
the data for this single channel is normalized, other words they are
absolute measurements of mRNA abundance
5. Correlational matrix
Next I created a correlational matrix, to
figure out how the samples expression
profile correlated with others. Also, to
look for duplicates and outliers.
Each sample has a Pearson's correlation
of 1 with itself - hence the 0 down the
diagonal.
It looks like there is a strong correlation
with sample expression, with an average
being around 0.97
The big exception is in the middle of the
graph(the big bright red line).
GSM601870 has the lowest correlation of
0.87, I removed this from my dataset
6. Histogram of Pearson’s correlation between different
samples
Higher correlation between samples is to
be expected. I also check to see what
genes were most highly expressed. Thes
genes were very highly expressed: ACTB,
RPL34, RPL19, RPL11. Interestingly ACTB
is beta actin, and that explains why the
levels would be high, all cells need beta
actin, as for other they all seem to be
ribosomal in nature.
7. Expression levels of Housekeeping gene
I decided to look at
expression levels of
GAPDH, a housekeeping
gene, to see how
consistent the expression
levels were. The graph on
the left is the absolute
levels of GAPDH, I one can
see it shows uneven
distribution, so I decided
to create histogram where
GAPDH levels were
relative to the genes.
GAPDH is very close to
being the highest ranked
gene in every sample,
which makes sense.
8. Expression levels of PRND and PRNP
Looked at non
housekeeping genes
expression levels, as by
contrast, a non-
housekeeping gene might
be more variable in both
absolute level and rank.
I choose PRNP and PRND,
both these genes play a
major role in cental
nervous system, and have
been know to play a role
in neurodegenerative
disorders(REF:Wikigenes)
9. Expression levels of PRND and PRNP
Looks like the absolute levels
of both PRNP and PRND,
vary substantially, while the
relative ranks compared to
other genes, PRNP seems to
have a more consistent
expression , compared to
PRND, which seem to be all
over the place. Also, PRNP,
seems to have higher
expression levels compared
to PRND, which make sense.
10. Principal Component Analysis
I choose a random subset of this dataset to run
my PCA analysis on. I ran PCA to see what
systematic variations were there in this dataset.
The plot is of the first two variations: PC1 vs PC2.
The most interesting aspect, of this graph is that
PC1 is there only to distinguish GSM601870
sample, which is the outlier from the correlational
matrix. This reconfirms that the sample is not
quite right, and should be excluded from the data
set.
11. Variance by PC
Overall, the PCA
analysis re confirms
the single outlier in
the sample set, rest of
the samples seem to
be very similar to each
other. Hence the inter-
sample correlation of
~.97 seen in the
correlational matrix.
This shows huge
stratification in
expression levels
between different
genes, as seen also in
the exponential
distribution plotted
earlier.
12. Comparison with mRNA-seq data
I decided to compare my dataset to mRNA-seq data
from Human BodyMap 2.0. It is important to see if
the data in some way relates to some other data
gathered using a different technology.
Since the microarray data were prepared from human
blood, I combine them with the blood FPKMs from
Human BodyMap.
Despite some outliers, there is a visible correlation
between each gene's average expression level in the
microarray data and its level in the Human BodyMap
2.0 mRNA-seq data.
Overall there is a reasonably strong correlation
between a gene's average level in the microarray data
and its FPKMs in the mRNA-seq data: Pearson's
correlation of rho=.73, and Spearman's rank
correlation of rho=.83
14. Analysis
Overall, it seems that ribosomal genes are much more highly
expressed compared to other genes. Other set of genes that seem to
be upregulated are genes belonging to the tyrosine kinase family like
DDR1, a receptor tyrosine kinase, know to play a role in cell growth
and communication(ref: http://www.genecards.org/cgi-
bin/carddisp.pl?gene=DDR1)
Other interesting part here is the upregualation of ribosomal protein
S6, in MS, and downregulation after interferon treatment.
It seems logical to see a uptick in regulation in some of the ribosomal
genes, as most of the patients in this study seem to not respond fully
to the interferon treatment.