1. Urszula Czeriwnska, Laurence Calzone, Emmanuel Barillot and Andrei Zinovyev
Atelier Grands Graphes et Bioinformatique - EGC 2016
DEDAL CYTOSCAPE 3 APP
FOR PRODUCING AND MORPHING
DATA-DRIVEN AND STRUCTURE-DRIVEN
NETWORK LAYOUTS
U900 Computational Systems Biology for Cancer
2. OUTLINE
FROM EXTRATION OF KNOWLEDGE
TO INTELLIGENT LAYOUT
COMBING MUTIDIMENTIONAL DATA
AND NETWORK STRUCTURE
SUMMARY
DEDAL CYTOSCAPE 3.0 APP.
4. EXTRACTION OF KNOWLEDGE
IN BIOLOGY
molecular biology interactions -> networks
: the creation of knowledge from
structured and unstructured sources;
the resulting knowledge needs to be in a
machine-readable format;
9. THERE IS NOT A SINGLE LAYOUT
mapping data on the top of pre-defined
biological network layout
identyfing subnetworks from a global
network processing certain properities
computed from the data
using biological network structure for
pre-processing the high throughput data
1
2
3
10. T-test statistics
1. MAPPING DATA ON THE TOP OF THE NETWORK
Moldovan and D’Andrea 2009
Peri et al., 2004
TGCA, 2012
18. PMS
Pure network structure based
layout
Purely Data Driven Layout
PCA or Elmap
Combination of network
structure and data layout
MORPHING DATA-DRIVEN AND STRUCTURE-BASED LAYOUT
19. PMS
Pure network structure based
layout
Purely Data Driven Layout
PCA or Elmap
Combination of network
structure and data layout
MORPHING DATA-DRIVEN AND STRUCTURE-BASED LAYOUT
23. ADVANCED FEATURES OF DEDAL
Pre-processing of the data:
o Smoothing
o Double centering
o Quality check
Post-processing of the layout:
o Alignment
o Overlapping
o Missing values
o Outliers
Data-driven layout: PCA or nPMs
Morphing
24. NONLINEAR PRINCIPAL MANIFOLDS
Elastic map algorithms
Principal manifolds approximation
Linear PCA vs nonlinear Principal Manifolds for
visalisation of breast cancer microarray data
a) 3D PCA linear manifold.
b) ELMap2D
c) PCA2D
Gorban A.N., Zinovyev A. 2010
-
+
32. There is a need to combine networks
and -omics data in biology
Network layout should be adapted
to the analysis
DeDaL – Cytoscape App. performs
o different types of data-driven layouts
o morphing between strucure-based and
data-driven layout
o pre-processing of data as double
centering and network smoothing
SUMMARY
Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL (data warehouse), the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data.
Force-directed graph drawing algorithms are a class of algorithms for drawing graphs in an aesthetically pleasing way. Their purpose is to position the nodes of a graph in two-dimensional or three-dimensional space so that all the edges are of more or less equal length and there are as few crossing edges as possible, by assigning forces among the set of edges and the set of nodes, based on their relative positions, and then using these forces either to simulate the motion of the edges and nodes or to minimize their energy.[1]
The yFiles Organic layout (ORL) is a proprietary closed-source implementation of the force-directed placement paradigm, which combines elements from several layout algorithms to facilitate identification of clusters of tightly connected network modules
integrating network structure and high-throughput data
The existing layouts are based on network structure
Quantitative data has been visualized so far with nodes coloring
Toy input example. A toy example of an input problem with two distinct JACSs and with front and back nodes. Both JACSs (circled) are connected in the interaction network and heavy in the similarity graph. Note that the four front nodes in the left JACS form a connected subgraph only after the addition of the back node.
(a) Flowchart of the approach. (b) Example illustrating smoothing of patient somatic mutation profiles over a molecular interaction network. Mutated genes are shown in yellow (patient 1) and blue (patient 2) in the context of a gene interaction network. Following smoothing, the mutational activity of a gene is a continuous value reflected in the intensity of yellow or blue; genes with high scores in both patients appear in green (dashed oval). (c) Clustering mutation profiles using non-negative matrix factorization (NMF) regularized by a network. The input data matrix (F) is decomposed into the product of two matrices: one of subtype prototypes (W) and the other of assignments of each mutation profile to the prototypes (H). The decomposition attempts to minimize the objective function shown, which includes a network influence constraint L on the subtype prototypes. k, predefined number of subtypes. (d) The final tumor subtypes are obtained from the consensus (majority) assignments of each tumor after 1,000 applications of the procedures in b and c to samples of the original data set. A darker blue color in the matrix coincides with higher co-clustering for pairs of patients.
548 patients
Graph Laplacian and dosage balance in interaction networks
Assumption: Interacting molecular entities
A and B should be balanced in their
concentrations (local amounts in space and
time)
Examples from molecular biology:
1) A and B form complex, A or B along is toxic
2) A and B form functional complex, production
of A or B is expensive
3) A is a scaffold for B and C, complex of
A:B:C performs a function
4) A regulates B (catalyzes, titrates, ..)
5) A and B compete for a resource,
this competition is decisive for a cell fate
The approach is formally based on the spectral decomposition of the gene expression measurements with respect to the gene network seen as a graph, followed by an attenuation of the high-frequency components of the expression vectors with respect to the topology of the graph.
Using DeDaL for visualizing the network and RNA-Seq expression data of tissue-specific genes. RNA-Seq dataset for 27 healthy human tissues was used to defined a subnetwork of HPRD PPI database enriched in tissue-specific genes (see the text for explanations). Network smoothing followed by computation of principal manifold was applied to produce the data-driven network layout (DDL). Patterns of gene expression for two selected tissues (brain and spleen) are shown on top the constructed DDL, red color denotes higher expression, green color corresponds to lower expression. The sizes of the nodes are proportional to their connectivity degree in this network. On the left top panel application of the Force Directed layout is shown for comparison. On the left bottom panel results of quantitative comparison between multidimensional distance representation in DeDaL and Force Directed layout are shown. The most representative distances between the genes in the initial multidimensional space (see [28] for details) are ranked here from the largest to the smallest values