NRNB Annual Report 2018

Overall
B.2. What was accomplished under these goals?
(to be converted to PDF; images are allowed)
Measuring success
● 131 publications citing NRNB grant in the reporting period
● Over 10,000 visits per week to Cytoscape.org
● 17,000 downloads per month for Cytoscape
● 3600 Cytoscape application launches per day
● 43,930 page views in January 2018 for the Cytoscape App Store
● 1000 downloads per day among 333 Cytoscape apps
● A total of 23 tools supported by NRNB
● 75 new and ongoing collaborations with external investigators on diverse topics
● 20 students trained through Google Summer of Code
● 8 NRNB coordinated training events in 7 locations in 4 countries
● Over 100 users trained on Cytoscape by NRNB staff
● 1,000 views in a typical week at tutorials.cytoscape.org
● 536 members in our Network Biology LinkedIn group
● 3,900 members on our moderated Google groups for Cytoscape
TRD 1: Differential Networks
● We have now developed a modeling framework for network inference to handle time
resolved data. We have used time resolved drug response measurements in melanoma
cells. The aim of this task was to improve the perturbation biology method (developed by
Nelander, Molinelli, and Korkut) for a more thorough understanding of protein networks
and their responses to drug perturbations.
● We continue to develop protein-protein interaction network alignment algorithms since
publishing “GreedyPlus: An Algorithm for the Alignment of Interface Interaction
Networks” in 2015, the first such algorithm for protein interaction networks that includes
binding site information. We have designed an interactome portal for their project at
http://interactome.baderlab.org/ that will include network alignment features.
● We continued development of the stringApp for Cytoscape by Dr. Morris. STRING
(http://www.string-db.org/) is an important public interaction database, widely regarded
by mass spec practitioners.

● The stringApp is incorporated into a new tutorial called “Functional Enrichment Analysis”.
Mass spectrometry practitioners and analysts routinely work with network models
constructed from fundamental interaction measurements.
● We revised the idmapper App to support automation and scriptibility. Automation is a
key new feature of the Cytoscape architecture in version 3.7. Idmapper served as an
early use case in adding command functionality to apps. The integration of identifier
mapping into Cytoscape as a built-in feature, for example, in addition to the stringApp
greatly enhances the user experience for mass spec practitioners as well as many other
users.
TRD 2: Descriptive to Predictive Networks
● We demonstrate the features of the netDx framework by integrating up to six
heterogeneous data types, including clinical variables, DNA methylation, somatic
mutations, mRNA, miRNA and protein expression profiles, for survival prediction in
kidney, lung, ovarian and brain cancer.
● We have made some preliminary progress on the network-constrained regression
methodology during this reporting period. Our starting point is the recently published
GELnet method. Using the GELnet method, we analyzed the Sanger GDSC
pharmacogenomic data for 1080 cell lines across 42 tumor classifications, including:
transcript expression, mutation, and drug response profiling on 297 compounds. The
overall goal of this project is the identification of biomarkers involved in the progression
of cancer and the response to pharmaceutical treatment.
● Using a newly constructed Evolutionarily Selected Pathways Map, we analyzed the
major cancer genome atlases to derive a hierarchical classification of tumor subtypes
linked to characteristic mutated pathways. The resulting new stratification of tumors
corresponded only weakly to known clinical subtypes and variables, or not at all,
suggesting new disease subtypes and pathways worthy of further investigation.
● We built DCell, an interpretable neural network (VNN) simulating a basic eukaryotic cell.
The structure of this model is formulated from extensive prior knowledge of the cell’s
hierarchy of subsystems documented for the budding yeast Saccharomyces cerevisiae.
● We prepared the first automation workflow example: Advanced Cancer Networks and
Data. We then prepared a workshop around this example and have directly presented it
to over 100 users during across 4 events during this reporting period.
TRD 3: Multiscale Networks
● We report the development of a software framework, the Data-driven Ontology Toolkit
(DDOT), which enables the construction and analysis of hierarchical models in a Python
package and their visualization in a web application. To demonstrate the ease of
computational modeling enabled by DDOT, we programmatically constructed a

compendium of data-driven gene ontologies that describe the mechanisms underlying
650 diseases.
● We used previous CliXO work to build a data-driven hierarchy supervised by the
topology of human curated resources and compared with human curated Gene Ontology
(GO). The resulting “baseline” hierarchy captured most of the conserved biological
processes and protein complexes.
● We developed an efficient parallel algorithm (“Ultra-fast CliXO ”) to accelerate the
construction of data driven ontology. The central idea of our approach is to replace all
data structures with matrices and re-formulize the original algorithmic framework as a
series of matrix operations.
● We have undertaken new research and development work to infer cell-cell interaction
networks with a focus on perfecting the workflow to take single cell RNA-seq data and
identify cell types, that can then be analyzed to predict intercellular interaction networks..
Driving Biological Projects
Current mapping of DBP and TRD projects for this reporting period. See the DBP Component
for more details.
New Working Group
During this reporting period, we made a special effort to launch Working Groups within the
NRNB organization to promote intra-collaboration. One burgeoning area of common interest,
relating to NRNB goals was Single Cell RNA-seq Analysis and Visualization. Gary Bader,
Scooter Morris and Alex Pico organized bi-monthly working group meetings, starting with a
survey of opportunities and needs. As we were developing a research strategy, Dr. Morris
identified a supplemental funding opportunity to cover a new postdoc and serve as other
support on a set of aims complementary to our NRNB effort. This is an excellent example of
NRNB catalyzing new, relevant projects in Network Biology. So, in this new component, we will

report on the aspects of the working group subproject not covered by the independent funding,
which will begin during the next reporting period.
A key component and NRNB relevance of this project will be the set of tools we will build and
contribute to the Cytoscape network analysis ecosystem, allowing researchers to leverage
dozens of analysis and visualization methods in single cell analysis pipeline. This includes the
development of a methods to group cells based on similarities in their transcriptional profiles
and effectively identify those groups by comparison with previously collected profiles with
"known" groups.
During our initial survey of the problem space, we noted the current standards in benchmarking
and evaluation of scRNA-seq protocols and statistical methods. The novel assignment of
biologically meaningful labels is clearly lacking and will require additional evaluation. Current
benchmarks involving known cell populations, e.g., as spike-ins, can be extended in our pipeline
benchmark to evaluate the functional labels assigned to the clusters. We rely on established
ontologies and cell fate maps to quantify the accuracy and sensitivity of the label assignments,
including the work of Jill Mesirov in our Collaboration Network to extend MSigDB with curated
cell type-specific signatures. Combining version control, immediate write access and
transparency, the evaluation of the functional label assignments will be an ongoing activity by
the research community, continually approaching consensus.
Evaluation of the utility of the system will leverage existing collaborations ongoing in the Bader
lab (e.g. scRNA-seq human liver samples from Dr. Ian McGilvray and Dr. Sonya MacParland,
mouse brain development) and with Dr. Morris’s scRNA-seq methods work with the DeRisi lab
at UCSF and Dr. Pico’s scRNA-seq analyses through the Gladstone Bioinformatics Core.
These ongoing collaborations will allow us to test algorithms as well as the user interfaces of the
overall system.
Training
The table below lists the events since our last annual report. Additional one- on-one training
requests are tracked as services in our CSP report.

Google Summer of Code
Accepted for our 10th year as a mentoring organization, we gathered ~60 project ideas and ~40
mentors for GSoC 2017. We had our largest and most successful years yet, with 20 or 21
enrolled students completing their projects. Once again, we prepared a Mentor Resource
Packet, a collection of resources designed to help mentors with recruiting students. The packet
includes tips on how and where to recruit, as well as ready-to-use slides, flyers and other
materials. In addition to the technical accomplishments and productivity of our students, we are
also proud of the many important aspects of diversity our students represent in the GSoC
program, including geographical, gender and academic. New this year, we prepared 5 Student
Profiles in interview-style format to highlight some of our best students.
Infrastructure
● We improved the availability, performance and robustness of our Kubernetes cluster by
adding sensors to restart failed services; implementing separate clusters for production,
staging and testing; and defining policies for promotion of services from one cluster to
the next.
● We also added common logging, tracing and metering functionality that can be
leveraged by all services, including both Diffusion and Cytoscape itself.
● We created the CXMate service framework supported by a continuous integration
pipeline comprising CircleCI, Github and Docker, which together automates service
deployment and release.
● We demonstrated and further developed CXMate, by first upgrading the Diffusion
(Python) service to use it, and then writing new services to perform network alignment
using the C++ based SANA algorithm, network format conversion (from CX to .png), and
image caching (for .png images).
● we continued our development and deployment of CyComponents (formerly CyWidgets)
to support the development of novel network biology web apps (e.g., HiView and DCell
in TRD 2.2 and 3.1).
● We migrated all 21 NRNB virtual machines to the VM cluster. Previously, they were
hosted on a 4 year old single processor Dell R420 server whose failure would have been
catastrophic.
● We also established a Jupyter notebook server as a VM (provisioned with 90GB and 4
cores).
● We made significant progress in fleshing out the features needed in Cytoscape from the
standpoint of usability, e.g., the user interface for setting user preferences and
configuration information has been reorganized and presented and graphics capabilities
were supplemented with more line stroking, arrowheads and geometric shapes to fully
support the requirements of pathway analysis.
● We have begun work on a new app called Legend Creator, which adds an additional
control panel to Cytoscape for determining which properties of the graph should be used

to generate the legend and then to create annotations of the graph to represent that
information. Although this is a secondary function of visualization software, it is an
essential part of a finished report, so users have been frustrated by having to manually
construct their legends in outside software and add it to their figures outside of
Cytoscape.
Additional Highlights
Cytoscape tutorials
In the past year, we have moved our tutorial content from our previous Open Tutorials site to
GitHub Pages reachable via tutorials.cytoscape.org. Cytoscape tutorials allows NRNB to reach
our goal of providing tutorial support to a broad and diverse community. Currently, the site
includes tutorials mainly for users. In the past year, Cytoscape tutorials have been significantly
updated and are now built using the reveal.js presentation framework, which enables easy
creation of attractive interactive slide decks using HTML. The tutorials and presentations are
modularized, making it easy to reuse content and extend content. See Training component for
more details.

Figure 1. Screenshot of the Cytoscape tutorials main page. Tutorials are organized into 4 main
categories: Data Visualization, Importing Data, Network Analysis, Exporting and Publishing.
Cytoscape automation training
We launched a new repository dedicated to promoting and collecting Cytoscape automation
training materials in the form of Python and R workflows:
https://github.com/cytoscape/cytoscape-automation
During this reporting period we added the following materials:
● Advanced: Cancer Networks and Data (R & Python)
● Basic: Fundamentals (R & Python)
● Basic: iGraph (R)
● Basic: Visual Styles (R)
● Basic: Feature Demos (Python)
● Challenge: PPI Network Pipeline (R) -- community contributed
● Challenge: Visualization of Cluster-Filtered Network (R) -- community contributed
● Challenge: Data Mining and Visualization (Python) -- community contributed
● Challenge: SBML Time Course Data (Python) -- community contributed
Cytoscape App Store
The maintenance of the site allows it to host over 333 apps (an 8% increase over last year)
developed by 563 different developers around the world and support Cytoscape users
downloading an average of 1,046 apps per day (a 24% increase over the past 12 months). That
has accumulated to just over 1,100,000 total app downloads since the launch of the App Store.
The top 3 downloaded apps, ClueGO, BiNGO and CluePedia, have accumulated over 192,000
downloads combined. During the month of January 2018, the site received over 43,000 page
views.
F1000Research: Cytoscape App Channel
The F1000Research Cytoscape App Channel now has a total of 36 peer-reviewed articles and 3
posters, with 3 articles revised in the last year. NRNB staff are gearing up for another round as
guest editors in 2018 to attract a new bolus of app articles leveraging Cytoscape automation.

TRD 1: Differential Networks
TRD 1.1: Tools for Inference of Differential Networks from
Protein States and Abundances Over Time
Background: The aim of this task was to improve the perturbation biology method (developed
by Nelander, Molinelli, and Korkut) for a more thorough understanding of protein networks and
their responses to drug perturbations. The perturbation biology method involves inference of
quantitative signaling models from high throughput drug response data. In recent years, we
solved the network inference problem through implementation of a probabilistic statistical
physics algorithm called belief propagation (BP). In network inference, we also benefit from
pathway database extracted prior information to improve model accuracy. The network models
are based on coupled nonlinear ordinary differential equations that represent the temporal
changes to perturbations.
Equation 1:
In Equation 1, xμ
i are the perturbed and/or measured variables, μ, represent the perturbations,
wij quantifies the edge strength, αi constant is the tendency of the system to return to the initial
state, and εi constant defines the dynamic range of each variable i. The transfer function, Φ
ensures that each variable has a sigmoidal temporal behavior.
However, our earlier BP approach was developed for static data obtained at a single time-point,
and based steady state assumptions. We have therefore further developed the network
inference framework to be able to account for time resolved measurements.
Progress This Year: We have now developed a modeling framework for network inference to
handle time resolved data. We have used time resolved drug response measurements in

melanoma cells. The data contains protein measurements as well as phenotypic measurements
(cell death and cell growth) for 54 different drug combinations. To infer model parameters, we
have developed a framework in TensorFlow to solve the ODEs (Equation 1) and thereby find
the optimal values of the parameters (wij, αi, and εi) by comparing model simulations with data
for all measured time points. TensorFlow uses automatic differentiation to compute the
gradients of the loss function efficiently, and therefore gives fast optimization with gradient
descent methods, even for large optimization problems. To handle overfitting, we used an L1
norm in the loss function, and thereby found networks with 200-300 edges between 136 nodes.
The network models was developed with the following steps:
1. We used leave-8-out cross-validation to confirm that the data was of high quality, and
that the modeling framework produced models with a predictive power. We thus divided
the data into a training set and a test set for each drug, where the test set consisted of
all combinations of a particular drug and all other drugs. The left out data was predicted
with good accuracy, as exemplified for one of the drugs (MEKi) in Figure 1.
2. The modeling framework was then used to create network models for all available data.
We ran the optimization in TensorFlow 10 times and calculated an average network out
of the networks that was produced (Figure 2). The average network was calculated from
the obtain mean value of each value of wij, and wegthed to how ofter the model was in
the network. The obtained average network reproduces known features for drug –
protein interactions (e.g. the node AKTi inhibitis the node AKT-S473) as well as protein –
protein interactions (e.g. the node MEK1/2-S217 activates the node ERK1/2-T202), and
also predicts new interactions that potentially can be experimentally tested.
Figures

Figure 1. Leave-8-out cross-validation. An example of the performance of the modeling
framework. MEKi is one of the drugs in the data, and Xi represent any of the other drugs. The
model was created using data for all drug combinations without MEKi, as well as data for MEKi
as a single drug. The left out combinations (MEKi + Xi) was predicted and compared to the
corresponding data for proteomic nodes (left) and phenotypic nodes (right). Colors represent the
drug combined with MEKi, as states in the legend. The same analysis was made for all drugs
(not shown). The Pearson correlation coefficient (corrcoef) shows a good agreement between
model simulations and left out data.
Figure 2. A resulting average network. A graphical representation of the average network
obtained from 10 models generated with the modeling framework. The thickness of an edge is
based on both the value of the parameter wij and how often the edge occur in the obtained
models. Blue nodes represent drug effects, pink nodes represent protein measurements, and
grey nodes represent phenotypic measurements.

TRD 1.2: Protein network alignment algorithm and viewer;
DBP 2: Vidal and Hill
TRD1, Differential networks Aim 2. As previously reported, we continue to develop
protein-protein interaction network alignment algorithms since publishing “GreedyPlus: An
Algorithm for the Alignment of Interface Interaction Networks” in 2015, the first such algorithm
for protein interaction networks that includes binding site information. This work has led us to
design a new technology for ortholog function assessment that simultaneously considers protein
and network evolution, but focused on two interacting proteins and their immediate interactors.
To support DBP 2 (Vidal and Hill) “Mapping the human interactome and its rewiring by disease
mutations”, we continue to engage in weekly discussions with the Vidal team to consult on the
analysis of their ongoing human interactome project, in particular where their work includes
differential network analysis and consideration of binding sites. We have designed an
interactome portal for their project at http://interactome.baderlab.org/ that will include network
alignment features.
TRD 1.3: Facilitating the interpretation of AP-MS data as
interaction networks
STRING app
This project to enable mass spec practitioners to perform network analysis is successfully
transitioning from Stage 3 to Stage 4, from deployment of public tools to promoting wide
adoption. As such, we report on major updates to the stringApp, a major component of AP-MS
and other proteomics workflows as well as a concerted effort to build interactive tutorials and
example usage.
The major activity was the continued development of the stringApp for Cytoscape by Dr. Morris.
STRING (http://www.string-db.org/) is an important public interaction database, widely regarded
by mass spec practitioners. With input from both mass spec practitioners and the
developers/maintainers of the STRING database, Dr. Morris implemented the app to take full
advantage of all the unique aspects of STRING, as described in the NAR special database
issue for 2017 [3]. The stringApp has been downloaded over 7400 times since its original
release in December of 2015 and is freely available at the Cytoscape App Store:
http://apps.cytoscape.org/apps/stringapp.

Figure 1. Screenshot of STITCH compound-protein network. This is the result of a query for
Coumadin (Warfarin®), a common blood thinner used to prevent thrombosis. Queries of
proteins or compounds are supported. The nodes in Cytoscape preserve the signature STRING
style with structures and glass bobble effects.
During this reporting period, Dr. Morris implemented critical support for STITCH as a fourth
query option in the stringApp (Figure 1). The STITCH database includes both physical
interactions and functional associations between chemical compounds and proteins
(http://stitch.embl.de). Now, in addition to protein, PubMed and disease queries, Cytoscape
users can select STITCH: protein/compound query and interrogate the STITCH database for
protein-compound interactions. This new dimension of interactions allows researchers to
extend protein networks into compound space or build protein networks from a set of one or
more compounds. This nicely complements any network or protein interaction resource tools
already available in Cytoscape. It is particularly relevant to the growing demand and data deluge
for drug compound screens and metabolomics, which includes mass spectrometry practitioners.
Another feature added to the stringApp during this period is enrichment analysis. This was a
major step in the AP-MS protocol that once again required the installation and operation of a
separate app. Now, upon import of any network via the stringApp the user can choose to
perform enrichment analysis and obtain Gene Ontology terms and KEGG pathway results.
Enrichment results may be visualized as both a table and as charts on the nodes (Figure 2).
The stringApp provides a number of different chart types, as well as the ability to filter terms to
specific categories and to remove redundant terms (using Jaccard overlap). This is a valuable
addition to workflows that involve STRING, DISEASE, or STITCH networks.

Figure 2: Screenshot of stringApp enrichment results. This shows the results of performing
a DISEASE query for the top 10 proteins associated with Alzheimer’s disease. The enrichment
results have been filtered to show only the GO Process category and to remove redundant
terms. The “Split donut” chart (the default) has been used to show the enrichment of each
protein.
Identifier Mapper Core App
The Id-Mapper app performs a critical role in integrating disparate databases. Proteins and
compounds identified in AP-MS experiments are not generally labeled sufficiently for querying
external analysis, Our analysis of the protocol showed that a lot of complexity experienced in
mass spec analysis stems from identifier and terminology mismatch. Our experience has been
that each iteration of developing this workflow involves improvements to the id-mapper core
app. During this past year, the app was revised to support automation and scriptibility.
Automation is a key new feature of the Cytoscape architecture in version 3.7. Id-mapper served
as an early use case in adding command functionality to apps. Because this app is an
extension of the table browser subsystem, it serves as a forerunner for a series of features in
table processing, which is a key function in the near future following the roadmap of the project.

WikiPathways as Demonstration of Automation and New Network Search Bar
Another important improvement Cytoscape recently added to expedite workflows is the network
search bar. Most searches can now be queried directly in the network control panel, instead of
asking the user to find a menu to open a dialog to type a keyword. These seemingly minor
hurdles accumulate into a lot of extra work for users, and slow the performance of tasks. So it is
important that the software evolve to reduce repetitive tasks and expedite workflows. The
network search bar facilitates apps to perform enrichment of networks faster than the previous
solution of provide different control panels depending on the database being searched.
The current release of WikiPathways is among the first app to support commands and
automated execution. The app provides specific functionality for searching and importing from
the WikiPathways web resource. Now that the app can be controlled via CyREST, imports of
pathway networks can be scripted from a variety of languages, including R, Java and Python.
Promoting Wide Adoption
The stringApp is incorporated into a new tutorial called “Functional Enrichment Analysis”
(https://cytoscape.github.io/cytoscape-tutorials/protocols/modules/functional-enrichment/#/).
Mass spectrometry practitioners and analysts routinely work with network models constructed
from fundamental interaction measurements. The data inform the biomedical understanding of
host-pathogen interactions, signaling networks and network rewiring in cancer, to name a few
examples. This is a critical field of research with which to provide powerful and accessible
network visualization and analysis technology. This project component is aimed at making
specific improvements and implementing new features to Cytoscape to enhance its applicability
and adoption by mass spec community. The main objectives are to augment Cytoscape to
streamline the typical mass spec analysis pipeline and provide better access to public mass
spec data and annotation repositories relevant to researchers. This is relevant to mass spec
practitioners who want to view their interaction data in the context of other public interaction
data and other annotations. The integration of identifier mapping into Cytoscape as a built-in
feature, for example, in addition to the stringApp greatly enhances the user experience for mass
spec practitioners as well as many other users. See the before/after comparison of the steps
required in the published mass spec Nature Protocol. The simplification goes beyond app
integration and user interface work. For example, rather than requiring the user to explicitly
connect to a database source, the new tool automatically connects to existing web service
provided by BridgeDb. And rather than requiring the user to explicitly choose a source identifier
type, the new tool infers the identifier from the values extracted from the column indicated by the
user in the right click action that initiated the dialog. We also included better options, based on
prior experience using and training.
References

1. Morris, J.H.K., G.M.; Verschueren, E.; Johnson, J.R.; Cimermancic, P.; Greninger, A.L.;
Pico, A.R. Affinity Purification-Mass Spectrometry and Network Analysis to Understand Protein-
Protein Interactions. Nature Protocol (2014) 9, 2539-54.
2. Gao J, Zhang C, van Iersel M, et al. BridgeDb app: unifying identifier mapping services for
Cytoscape. F1000Research. 2014;3:148..
3. Szklarczyk D, Morris JH, Cook H, et al. The STRING database in 2017: quality-controlled
protein–protein association networks, made broadly accessible. Nucleic Acids Research.
2017;45(Database issue):D362-D368.

TRD 2: Descriptive to Predictive Networks
TRD 2.1: Predicting clinical outcome using patient similarity
networks
Patient classification has widespread biomedical and clinical applications, including diagnosis,
prognosis, disease subtyping and treatment response prediction. A general purpose and
clinically relevant prediction algorithm should be accurate, generalizable, be able to integrate
diverse data types (e.g. clinical, genomic, metabolomic, imaging), handle sparse data, be
compatible with patient privacy protection systems and be intuitive to interpret. Last time, we
reported on the development of netDx (http://netdx.org/), a supervised patient classification
framework based on patient similarity networks that meets the above criteria. netDx models
input data as patient networks and uses the GeneMANIA machine learning algorithm that we
previously developed for network integration and feature selection. We demonstrate the
features of this framework by integrating up to six heterogeneous data types, including clinical
variables, DNA methylation, somatic mutations, mRNA, miRNA and protein expression profiles,
for survival prediction in kidney, lung, ovarian and brain cancer. We benchmarked netDx
performance as a machine-learning method by predicting binary survival in four tumour types.
netDx ranks at the top for two tumours and within the top 20th percentile for all four,
demonstrating consistently good performance and we continue to work on improving
performance. In comparison to traditional machine learning-based patient classifiers, netDx
results are more interpretable, visualizing the decision boundary in the context of patient
similarity space and identifying biological pathways and other features important for prediction.
By defining patient similarity using pathway-level gene expression, netDx identifies known
molecular correlates of poor survival in kidney cancer, and identifies potentially novel pathways
and biomarkers. Thus, netDx can serve both as a useful classifier and as a tool for discovery of
biological features characteristic of disease. An open-source R/Java implementation of netDx is
available along with sample files and automation workflows packaged as vignettes.
To support DBP 5: Sage Bionetworks: Molecular stratification of colorectal cancer and DREAM
challenges, we have revisited all major DREAM challenges where data are available and where
the challenge experimental design is compatible with netDX’s classification engine (two class
classification). We have broadly searched for additional similar data to analyze as well. We will
report the results in a publication, such as the netDX paper.

TRD 2.2a: Predicting cellular response to perturbation w/
network-guided regression
Background: The overall goal of this project is the identification of biomarkers involved in the
progression of cancer and the response to pharmaceutical treatment. This goal is to
accomplished through the used for regression-based methods that subject to biological network
constraints so that biomarkers can be understood in the context of regulatory processes.
Current progress: In conjunction with DBP 8, we have made some preliminary progress on
the network-constrained regression methodology during this reporting period. Our starting point
is the recently published GELnet method.
Using the GELnet method, we analyzed the Sanger GDSC pharmacogenomic data for 1080 cell
lines across 42 tumor classifications, including: transcript expression, mutation, and drug
response profiling on 297 compounds. Transcript expression included information on 19,562
genes and gene-level summarized mutations for 16,532 genes. The network used was either 1)
the entire PC database (pcAll), covering 22 interaction databases, or 2) only the CORUM
protein complex database (pcCorum). We compared LASSO and GELNet regression
methodologies to understand the benefits of GELNet over the commonly used LASSO method.
Figure 1 summarizes our findings comparing the two methods. The LASSO models tends to
produce models that more accurately correlate with the observed drug response, but there
combinations of feature sets where the GELNet models outperform. For example, using
mutation information, the entirety of the Pathway Commons network, and starting set of genes
that includes 1) genes correlated with the response and 2) genes in genesets of interest.
In the example described above, what we find is that even though we find these models to be
more accurate, but they come with a limitation which is that tend to be larger than then
accompanying LASSO models. Understanding the properties of these models is ongoing.

Figure 1: Faceted visualization of input feature combinations for LASSO and GelNet models
(see Data section for term descriptions). A) Drug Response Correlations: In all cases of network
and feature set combinations, models developed with expression data resulted in sets of models
where the median correlation to drug response for the resulting LASSO models was higher than
GelNet models. This was similar to the case when mutation data was used except when the
entire Pathway Commons network was used regardless of feature set. Example drugs where
this was the case, included S-trityl-L-cysteine (mitotic progression inhibitor) and romedepsin
(histone deacetylase inhibitor). B) Feature Counts: In cases where mutation data produced
more numerically accurate models, these models tended to have larger feature counts.
TRD 2.2b: Predicting cellular response to perturbation w/
network-guided regression
Part I
The goal of this part is to identify important biomarkers involved in the progression of cancer
and the response to pharmaceutical treatment in the context of biological network constraints. A
very popular approach is heat diffusion, also called network propagation, by which individual
gene mutations in a tumor are diffused across the network. Such diffusion creates “hot” network

neighborhoods of genes proximal to mutated genes. This type approach could help to define
driver genes and potential drug targets for cancer therapy. They also allow patients to be
clustered into subtypes, because the neighborhoods, unlike individual genes, are commonly
mutated and thus provide a basis for grouping tumors. In this project we find that the informative
genes and pathways driving cancer pathogenesis and subtypes can be remarkably difficult to
identify in the presence of gene interactions irrelevant to cancer. We show that this problem can
be addressed by assembly of a focused Evolutionarily Selected Pathways Map (ESPmap), in
which all interactions are supported by both protein-protein binding and genetic epistasis during
somatic tumor evolution. Our approach creates stringent tumor-specific networks of biophysical
protein interactions, identified by signatures of epistatic selection during tumor evolution. Using
such newly constructed network, we analyzed the major cancer genome atlases to derive a
hierarchical classification of tumor subtypes linked to characteristic mutated pathways. The
resulting new stratification of tumors corresponded only weakly to known clinical subtypes and
variables, or not at all, suggesting new disease subtypes and pathways worthy of further
investigation. These pathways are clinically prognostic and predictive, including the
TP53-AXIN-ARHGEF17 combination in liver and CYLC2-STK11-STK11IP in lung cancer, which
we validate in independent cohorts. Further investigation was carried out in a panel of liver and
lung cancer cell lines characterized in the Genomics of Drug Sensitivity in Cancer (GDSC)
dataset. We found the pathways we detected were associated with different types drug
response.
Part II
In this project, we built DCell, an interpretable neural network (VNN) simulating a basic
eukaryotic cell. The structure of this model is formulated from extensive prior knowledge of the
cell’s hierarchy of subsystems documented for the budding yeast Saccharomyces cerevisiae. It
is drawn from either of two sources: the Gene Ontology (GO), a literature-curated reference
database from which we extracted 2526 intracellular components, processes, and functions and
CliXO, an alternative ontology of similar size inferred from large-scale molecular datasets rather
than literature curation. Such hierarchies form a natural bridge from variations in genotype (at
the scale of nucleotides and genes) to variations in phenotype (at the scale of cells and
organisms).
The function of our new interpretable neural network is learned during a training phase, in which
perturbations to genes propagate through the hierarchy to impact parent subsystems that
contain them. This gives rise to functional changes in protein complexes, biological processes,
organelles and, ultimately, a predicted response at the level of cell growth phenotype. Unlike
standard neural network model, our interpretable neural network was tied to an extensive
hierarchy of internal biological subsystems with states that could be queried. This ‘visible’
aspect raises the possibility that our approaches could be used for in-silico studies of biological
mechanism, of which we focused on four major types: 1) Explaining a genotype-phenotype
association, 2) Prioritizing all important mechanisms in determination of phenotype overall, 3)

Characterization of the genetic logic implemented by a process, 4) Discovery of new biological
processes and states.
TRD 2.3: Cytoscape Apps for accessible network analysis of
genetic variant data
The initial work on this subproject was focused on surveying the current capabilities of
Cytoscape for visualizing genetic variant data and performing diffusion analysis and subnetwork
generation based on variant data. To this end, we prepared the first automation workflow
example: Advanced Cancer Networks and Data.
https://github.com/cytoscape/cytoscape-automation/blob/master/for-scripters/R/advanced-cance
r-networks-and-data-rcy3.Rmd
We then prepared a workshop around this example and have directly presented it to over 100
users during across 4 events during this reporting period (see Training). We will expand on
these materials as we implement new tools per the original timeline for this subproject.

TRD 3: Multiscale Networks
TRD 3.1: Data-driven assembly & refinement of gene
ontologies from networks
We report the development of a software framework, the Data-driven Ontology Toolkit (DDOT),
which enables the construction and analysis of hierarchical models in a Python package and
their visualization in a web application. In contrast to existing tools for studying hierarchical
models and literature-curated ontologies, DDOT supports general hierarchies known as directed
acyclic graphs (instead of trees) and focuses on the analysis of data-driven structure (instead of
semantic relations by traditional ontology software).
To demonstrate the ease of computational modeling enabled by DDOT, we programmatically
constructed a compendium of data-driven gene ontologies that describe the mechanisms
underlying 650 diseases. These ontologies were based on two types of input data: a set of
known gene associations for each disease and a gene-gene similarity network based on several
‘omics datasets. By calling DDOT functions, the software pipeline was very concise, consisting
of 16 lines of code for loading input data and setting parameters and 8 lines for modeling in a
single Python script. For each disease, its ontology suggests an association with new genes as
well as new cellular subsystems, many of which are not found in the literature-curated Gene
Ontology. For example, we studied Fanconi Anemia (FA), a rare genetic disorder that is
associated with bone marrow failure, myeloid dysplasia, and increased cancer risk. The
resulting ontology consisted of 194 genes arranged in a hierarchy of 74 cellular subsystems,
which includes the known association with DNA repair pathways but also proposes a previously
unknown role of orthogonal subsystems such as splicing and telomere maintenance.
A major component of DDOT is the HiView web application, which provides an interactive
visualization of two major features of a data-driven ontology: (1) the hierarchical structure
relating genes and subsystems and (2) the data supporting the inference of each subsystem. To
provide the first feature, HiView allows the user to interactively zoom between more expansive
views of the entire hierarchy and more focused views of particular subsystems. It also provides
techniques for visualizing multi-parent and multi-children relations in ontologies. To provide the
second feature, HiView also visualizes the network of gene-gene interactions that were used to
support the inference of each subsystem in the ontology (Figure 1).

Figure 1. Screenshot of the HiView application to visualize ontologies (http://hiview.ucsd.edu)
TRD 3.2: Functionalized gene ontologies as a hierarchy of
functional prediction
Part I. We used previous CliXO work to build a data-driven hierarchy supervised by the topology
of human curated resources and compared with human curated Gene Ontology (GO). The
resulting “baseline” hierarchy captured most of the conserved biological processes and protein
complexes. As a starting point to study context dependent interactions, we integrated cell-line
specific PPIs measured by AP-MS (Krogan lab) into our model, and aimed to interpret these
interactions by comparing the iterated model to the baseline model. We are also working on
improving CliXO by optimizing it to produce significantly fewer false positive subsystems. The
resulting human data-driven ontology is much more robust and contains fewer false positive
subsystems in comparison to our previous version.
Part II. In this project, we developed an efficient parallel algorithm (“Ultra-fast CliXO ”) to
accelerate the construction of data driven ontology. The central idea of our approach is to
replace all data structures with matrices and re-formulize the original algorithmic framework as a
series of matrix operations. There are two major advantages of our re-formulization relating to
both hardware architecture and software developing cost: 1) Faster transfer of memory from
RAM to CPU/GPU; 2) “Embarrassingly parallel” computation. Experimental results validate that
the new implementation significantly outperforms original data driven ontology construction
algorithm under various settings. We achieved a practical 200-fold decrease in runtime when
applied to representative yeast and human molecular datasets. Importantly, each individual
component of the pipeline, including clique detection, clique growing, clique merging, as well as

the novel data structure for dynamic sparse matrices, can be extended for use in other
applications in the biological, physical and social sciences.
TRD 3.3: Bridging ligand-receptor networks to cell-cell
communication networks
As previously described, we have undertaken new research and development work to infer
cell-cell interaction networks. In particular, we have extensively used single cell RNA-seq data to
infer higher resolution cell-cell networks and have developed applications to cancer stem cell
biology and regenerative medicine (e.g. DBP 9), both areas where cell communication is
important for tumour or normal tissue development. Our main work in 2017 to continue this work
has been to focus on perfecting the workflow to take single cell RNA-seq data and identify cell
types, that can then be analyzed to predict intercellular interaction networks.
To support DBP 9: Engineering blood for regenerative medicine, we are continuing work to
automate our cell-cell interaction network inference pipeline.

NRNB Annual Report 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NRNB Annual Report 2018

Similar to NRNB Annual Report 2018 (20)

More from Alexander Pico

More from Alexander Pico (17)

Recently uploaded

Recently uploaded (20)

NRNB Annual Report 2018