This document compares two approaches for identifying potential gene expression regulators from experimental data: Sub-Network Enrichment Analysis (SNEA) in Pathway Studio and Causal Reasoning in Ingenuity Pathway Analysis (IPA). It analyzes a publicly available dataset on Spinal Muscular Atrophy. SNEA identified biological functions and expression regulators more specifically related to neurogenesis, the key process affected in SMA. It produced results with greater relevance to the disease compared to IPA. Mapping genes to a motor neuron differentiation model also showed SNEA results were more consistent with SMN1 knockout.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Â
Comparing Gene Expression Algorithms to Identify Regulators
1. Finding the Switch
Comparing gene expression algorithms
for the identification of expression regulators
FOR PHARMA & LIFE SCIENCES
White paper
Executive Summary
By examining experimental gene expression data researchers can identify
potential upstream regulatory factors that may control key biological
processes. In this paper we examine the effectiveness of two similar approaches
to this type of identification, as implemented by Elsevierâs Pathway Studio and
Qiagenâs Ingenuity Pathway Analysis, using a publicly available data set from
research done on Spinal Muscular Atrophy.
2. Introduction
Biological processes are described by complex sequences
of interactions between proteins and intra- as well as
extra-cellular components, including other proteins,
small molecules, and an array of small RNA molecules.
Unraveling these complex webs of interactions is critical to
understanding the biology that drives normal development,
disease progression, and responses to treatments. Often
groups of related processes are controlled by a limited number
of specific upstream regulatory proteins. By carefully examining
gene and protein expression information from
cells under different conditions, such as normal vs tumor,
or pre- and post-treatment, these upstream regulatory
proteins may be identified using specific functions that find
commonalities in complex expression patterns.
There are multiple analytical approaches to identifying these regulators. In this
paper we compare two analytical methods â Sub-Network Enrichment Analysis
(SNEA) as implemented in Elsevierâs Pathway Studio1
(http://www.elsevier.com/online-
tools/pathway-studio), and Causal Reasoning2
as implemented in Qiagenâs Ingenuity
Pathway Analysis (https://www.ingenuity.com/products/ipa), using publicly available
data from a recent publication - Maeda M, et al. (2014). Transcriptome Profiling of
Spinal Muscular Atrophy Motor Neurons Derived from Mouse Embryonic Stem Cells.
PLoS ONE 9(9): e106818.
3. 3
The Analytical Approaches
Background
Traditionally gene expression data
analysis has been driven by clustering
individual gene expression data into
groups sharing similar expression
patterns, or by comparing differentially
expressed genes with sets of genes
known to be associated with specific
biological functions or pathways. The
basic approach for functional analysis of
differentially expressed genes requires
the use of predetermined âgene setsâ â
collections of genes that form the basis
for the experimental data comparison.
These gene sets may be created by
the researchers, or publically available
gene sets such as those available
from the Broad Institute (http://www.
broadinstitute.org/gsea/msigdb/index.
jsp) may be used. Although these gene
sets are assumed to be carefully selected,
they may not have any specific relevance
to the particular experimental data being
analyzed, nor do researchers applying
them have access to specific literature
evidence describing how those particular
genes were initially selected.
The statistical significance of the overlap
between differentially expressed genes
and a gene set is calculated either using
hypergeometric Fisherâs exact test or
using gene set enrichment analysis
(GSEA). The former requires the pre-
selection of differentially expressed
genes using an applying an arbitrary
cutoff by expression value or p-value.
GSEA is considered to be the more
powerful test because it does not require
an arbitrary cutoff to be applied before
analysis, and as a result, it can detect
small but concerted expression changes
in a gene set that would be filtered out
after cutoff application3
.
While gene set comparisons have
been used successfully by researchers,
newer approaches that leverage specific
information extracted from the scientific
literature about the directionality of
molecular interactions between proteins â
termed âcausalâ approaches â provide an
additional level of information regarding
pathway and network relationships.
Elsevierâs Pathway Studio and Qiagenâs
Ingenuity Pathway Analysis implement
related but slightly different approaches
to leveraging molecular interaction data
from the scientific literature to provide
context that can improve the accuracy
of interpretation of gene expression
data. Both approaches generate gene
sets using causal networks but they
differ in the specific methods used for
calculating statistical significance of a
gene set, and in the size of the master
causal network (derived from the
literature) used by the algorithms.
Causal Reasoning in IPA
The Ingenuity database (master
network) contains more than 1.5 million
observed relationships4
obtained
through manual review and curation of
PubMed abstracts and selected full-text
scientific articles. Causal reasoning uses
Fisherâs Exact Test (FET) to calculate
the statistical significance of a gene
set, and therefore relies on an arbitrary
differential expression cutoff that must
be supplied by the end user. While IPA
has implemented four different causal
reasoning algorithms, this paper focuses
on the results obtained by Upstream
Regulator analysis (URA). URA builds
gene sets using causal edges from the
master network that connect differentially
expressed genes with their upstream
direct and indirect regulators. It then
applies a one-sided Fisherâs Exact
Test to determine the statistical
significance of the overlap between
differentially expressed genes and all
other targets of the regulator present
in the master network.
4. Sub-Network Enrichment Analysis
in Pathway Studio
SNEA as implemented in Pathway
Studio uses a master casual network
(database) containing more than 4.9
million relationships derived from more
than 3.7 million full text articles and 24
million PubMed abstracts. This network
is generated by a highly-tuned Natural
Language Processing (NLP) text mining
system to extract relationship data from
the scientific literature, rather than the
manual curation process used by IPA.
The ability to quickly update the
terminologies and linguistics rules used
by NLP systems ensures that new terms
can be captured soon after entering
regular use in the literature.
This approach, when coupled with
the ability to process thousands of
sentences a second, provides users
with the largest and most up-to-date
collection of literature-based molecular
interaction data. This extensive database
of interaction data provides high
levels of confidence when interpreting
experimentally-derived gene expression
data against the background of
previously published results. The use
of NLP to search the literature results
in a master network in Pathway Studio
that is more than three times bigger
than the database in IPA.
SNEA uses the Mann-Whitney ranking
test to evaluate the statistical significance
of a gene set generated from the
master network. This test compares the
distribution of expression values within
each gene set against the distribution
of expression values on the entire
microarray, and is the equivalent to the
regular GSEA approach. Thus, SNEA
does not require the user to specify an
arbitrary cutoff for differential expression,
and therefore is more sensitive for
small but coordinated changes in gene
expression levels of groups of genes,
although potentially at the cost of
potentially more âbackground noiseâ
in the selected data set.
In this paper the SNEA option âExpression
targetsâ was compared with the URA
algorithm in IPA. We also compare
the SNEA option âProteins/Chemicals
Regulating Cell processesâ was also
compared with IPAâs analysis of biological
functions networks.
Methods
A publicly available gene expression
dataset as described in the paper
âTranscriptome Profiling of Spinal
Muscular Atrophy Motor Neurons
Derived from Mouse Embryonic Stem
Cells5
â Maeda et. al., 2014, was used to
compare the IPA and SNEA algorithms.
This RNAseq dataset is available from
the Gene Expression Omnibus (GEO â
http://www.ncbi.nlm.nih.gov/geo/)
under accession number GSE56284.
Spinal Muscular Atrophy (SMA)5
is a
neurodegenerative disease characterized
by the destruction of motor neurons
(MNs) in the anterior horn of the spinal
cord leading to progressive muscle
weakness and atrophy. Previous reports
indicate that mutations in the Survival
Of Motor Neuron 1, Telomeric (SMN1)6
gene is disease-determining. The dataset
chosen for study here examines the
effect of a gene knock-out of SMN1 on
the gene expression in experimentally
derived mouse motor neurons.
Independent of the algorithm used,
the quality of the resulting predictions
will be affected by both the quality and
comprehensive nature of the literature-
derived interactions in each programs
database. The analysis was run using the
most current version (as of Jan 15, 2015)
of Elsevierâs Pathway Studio, and
comparing those results to those
obtained by Maeda et al.
5. 5
Results
As a first step, the results of biological
function analysis presented by authors
in Figure 6A and 6B in Maeda et al
was examined. The authors performed
separate analyses for up-regulated and
down-regulated genes in SMN1 knock-
out mouse embryonic stem cells (ESC)
as compared with normal ESCs. To obtain
a list of biological functions affected by
most differentially expressed genes
in SMN1 -/- mice, the SNEA algorithm
in Pathway Studio was run with the
option âProteins/Chemicals Regulating
Cell Processes.â The results are show
below in Table 1, and compare the
original published IPA results from the
Maeda paper (left column), with the
results obtained using SNEA in Pathway
Studio (right column).
Table 1. Biological functions affected by SMN1 knock-out identified by IPA and Pathway Studio.
Published in Maeda article SNEA in Pathway Studio
Cellular Functions and Maintenance Synaptogenesis
Cell Morphology Synaptic Transmission
Cellular Growth and Proliferation Axon Guidance
Tissue Development Neurotransmission
Cell Death and Survival Neurogenesis
Cellular Development Nerve Cell Differentiation
Embryonic Development Innervation
Nervous System Development and Function Neuron Development
Cell Morphology Regulation of Action Potential
Cellular Assembly and Organization Neuronal Activity
Cellular Function and Maintenance Central Nervous System Development
Tissue Morphology Synaptic Plasticity
Cellular Development Neuron Differentiation
The comparison shows that the top 15
biological functions identified by SNEA
in Pathway Studio all correspond directly
to neuronal development â which would
be expected since a knockout of SMN1
should affect neuronal development
in a targeted fashion. In contrast,
although IPAâs implementation of Causal
Reasoning identified a number of
biological processes that are generally
tied to cell and tissue development, only
one, âNervous System Development
and Function,â is specifically associated
with the development and degeneration
of motor neuron axons - a process
underlying the phenotype observed
in human and zebrafish with mutations
in the SMN1 gene. This result shows
that the combination of the more
sensitive SNEA algorithm along with the
larger underlying database of molecular
interactions and processes in Pathway
Studio results in the identification
of more specific and more relevant
biological processes associated with
the differentially expressed genes
in the Maeda data set.
6. Next, results of the Upstream Regulator
Analysis (URA) in IPA were compared with
results of the SNEA equivalent option,
âExpression targets,â in Pathway Studio.
The original URA results obtained by
Maeda et al are presented in Figure 8A
of the Maeda article, and reproduced
in the left hand column of Table 2 below.
As before, results obtained from Pathway
Studio are shown in the right column
of Table 2 below.
Expression regulators
in Maeda article
Expression regulators
identified by SNEA in PS
CTNNB1 NEUROG2
ASCL1 ASCL1
EGR2 PAX6
AR DLX1
FOXC2 SHH
FOXC1 PHOX2B
HMGA1 NEUROG3
POU5F1 PTF1A
TP53 PHOX2A
PAX7 NEUROD1
HIF1A NTF3
SOX2 NTF4
STAT4 POU4F1
NEUROG2 SOX10
NEUROG3 NEUROG1
NANOG NEUROD2
NF-kB REST
Table 2. Top expression regulators identified by URA in IPA and SNEA option âExpression
targetsâ in Pathway Studio
Upon examining the genes identified
using IPA in the Maeda publication (left
column), it is apparent that many of the
genes identified in the Maeda paper
are transcription factors and homeobox
proteins that have a wide range of
activities. Only 6 out of 17 (35%), including
Neurogenin 2 and 3, and ASCL1, are
specifically associated with neurogenesis
â an observation which would be
consistent with SMN1âs causal relationship
to Spinal Muscular Atrophy. This does not
rule out the other genes identified in the
Maeda paper; it simply means additional
research may be required to identify these
genesâ specific roles in neurogenesis.
In contrast, when comparing the results
from the SNEA analysis done in Pathway
Studio with either the results from IPA,
a different pattern emerges. Of the
top 17 potential expression regulators
identified, all (100%) have been previously
described in the literature as being
directly implicated in neurogenesis.
One interpretation of this result is that
the much larger database of literature-
derived molecular interaction information
in Pathway Studio produces results with
greater specificity and potentially greater
relevance to a specific disease or process.
7. Figure 1. Model for motor neuron differentiation produced manually based on literature
data in Pathway Studio.
7
Building a Disease Model
To better understand some of the aspects
of the biology which underlies Spinal
Muscular Atrophy, a disease model of
motor neuron differentiation was created
based on the relationships identified
in the Pathway Studio database, as
extracted from the literature. The process
of creating de novo disease models
and pathways from information in the
literature is straightforward in Pathway
Studio, unlike competing products where
this process is difficult, if not impossible
- providing a unique tool for researchers
using Pathway Studio to visualize
and understand complex, interacting
biological processes. The model is
shown on Figure 1 below.
8. Using this model as a framework, the
differentially expressed genes from the
GSE56284 RNAseq dataset were mapped
to the model (Figure 2 below). The colors
are based on gene expression values.
This mapping shows that the expression
of most of the transcription factors
identified in the data set as related to
motor neuron development are down-
regulated. Since most transcriptional
regulators act by increasing gene
expression of their targets, the
observation of down-regulated genes
is highly consistent with the repression
or knock-out of their upstream regulator
â in this case the SMN1 gene.
Next, to evaluate the results obtained
from the âUpstream Regulator Analysisâ
(URA) algorithm in Ingenuity, those
genes identified as up-regulated (red)
and down-regulated (blue) were mapped
to the model described in Figure 1.
Figure 2. Differential expression in genes in SMN1-/- mice from GSE56284 RNAseq dataset,
mapped to the model for motor neuron differentiation from Figure 1. Blue â down-regulated
genes. Red â up-regulated genes Highlighted in green are the major expression regulators
identified by SNEA responsible for differential expression between SMN1-/- and wild type mice.
9. 9
The results shown in Figure 3 indicate
that many of the transcription factors
identified by the URA algorithm in IPA
as being involved with motor neuron
differentiation are upregulated. In
addition, only 5 out of 210 regulators
found by URA in IPA were present in
the model (2.4%). Only 3 regulators
from the handpicked list of 17 published
in Figure 8A of the Maeda paper were
present in the model.
In contrast to the Upstream Regulator
Analysis results from Ingenuity Pathway
Analysis, the results obtained using the
Expression Targets option from SNEA
in Pathway Studio show that almost all
of the transcription factors involved
in motor differentiation show down-
regulated activity, which is consistent
with the observation on Figure 2 that
the expression of these proteins is
also down-regulated. This comparison
suggests that in this system, SNEA from
Pathway Studio produces more relevant
results associated with potential upstream
gene regulators than does URA from
Ingenuity Pathway Analysis.
Figure 3. Results from Upstream Regulator Analysis (URA) tool in Ingenuity mapped
to the disease model. Red â activated regulator according to URA, blue â repressed
regulator according to URA.
10. Summary
With more than 1 million new scientific
papers published each year, there is a
continual accumulation of new data
and information about genes, proteins,
networks, and pathways from work done
by thousands of scientists in laboratories
across the world. Researchers have
difficulty keeping up with the current
literature in their narrow areas of
research, much less staying current with
larger fields, or more peripheral areas of
interest. The need for automated systems
that can accurately survey the appropriate
literature and extract relevant data for
their research can result in more accurate
results and better informed decisions.
In this comparison, both programs
identified some of the key genes and
proteins likely to be involved in the
biological processes that drive Spinal
Muscular Atrophy. This result is not
surprising since both applications
implement highly-related causal
reasoning-based algorithms to identify
potential upstream gene expression
regulators. However, since the utility of
causal reasoning algorithms is highly
dependent on the amount of literature-
based information they have access to for
comparison, the size and completeness
of the literature-derived molecular
interaction data can have a profound
effect on results. When combined with
the potential for increased sensitivity to
lower-level but correlated gene expression
changes that can be identified by the
SNEA algorithm, a marked difference
in the final results can result.
In this example, the results appear to be
much more specific and relevant and
better connected to specific biological
processes associated with neuronal
development. In this comparison, the
application of SNEA analysis in Pathway
Studio resulted in a more comprehensive
and plausible mechanistic picture of
the potential effects as predicted from
the original experimental data. Based
on similar results obtained by other
researchers (personal communications),
the larger database of molecular
interactions derived from the literature
often results in larger numbers of more
specific candidate genes and proteins
returned for almost any area of human
disease study â leading to better insights
for the interpretation of any researcherâs
experimental data.
1 Pyatnitskiy, MA, Shkrob, MA, Daraselia, ND,
Kotelnikova, EA. (2012) Sub-network Enrichment
and Cluster Analysis Reveal Possible Pathways
for Cetuximab Sensitivity. in From Knowledge
Networks to Biological Models, 151-172
2 Chindelevitch, L, Ziemek, D, Enayetallah, A, et al.
2011. Causal Reasoning on Biological Networks:
Interpreting Transcriptional Changes. V. Bafna
and S.C. Sahinalp (Eds.): RECOMB 2011, LNBI
6577, pp. 34â37.
3 Abatangelo, Luca, Maglietta, Rosalia, Distaso,
Angela, et. al. 2009. Comparative study of gene
set enrichment methods. BMC Bioinfor. 10:275.
4 KramerA, Green J, Pollard Jr J, Tugendreich S.
2014. Causal analysis approaches in Ingenuity
Pathway Analysis. Bioinformatics 30(4); P 523-530.
doi:10.1093/bioinformatics/btt703
5 Maeda M, Harris AW, Kingham BF, Lumpkin
CJ, Opdenaker LM, et al. 2014. Transcriptome
Profiling of Spinal Muscular Atrophy Motor
Neurons Derived from Mouse Embryonic Stem
Cells. PLoS ONE 9(9): e106818. doi:10.1371/
journal.pone.0106818
6 Crawford TO, Pardo CA. 1996. The neurobiology
of childhood spinal muscular atrophy. Neurobiol
Dis 3: 97â110.
7 Lefebvre S, Bu¨rglen L, Reboullet S, Clermont
O, Burlet P, et al. (1995) Identification and
characterization of a spinal muscular atrophy-
determining gene. Cell 80: 155â165
11. LEARN MORE
To request information or a product demonstration,
please visit elsevier.com/pathwaystudio or email us at
pathwaystudio@elsevier.com.
12. ASIA AND AUSTRALIA
Tel: + 65 6349 0222
Email: sginfo@elsevier.com
JAPAN
Tel: + 81 3 5561 5034
Email: jpinfo@elsevier.com
KOREA AND TAIWAN
Tel: +82 2 6714 3000
Email: krinfo.corp@elsevier.com
EUROPE, MIDDLE EAST AND AFRICA
Tel: +31 20 485 3767
Email: nlinfo@elsevier.com
NORTH AMERICA, CENTRAL AMERICA AND CANADA
Tel: +1 888 615 4500
Email: usinfo@elsevier.com
SOUTH AMERICA
Tel: +55 21 3970 9300
Email: brinfo@elsevier.com
PATHWAY STUDIO is a registered trademark of Elsevier Inc. CopyrightŠ 2015 Elsevier B.V. All rights reserved.
June 2015
Visit elsevier.com/products/solutions/pathway-studio
or contact your nearest Elsevier office.