NetBioSIG2014-Talk by Hyunghoon Cho


Published on

NetBioSIG2014 at ISMB in Boston, MA, USA on July 11, 2014

Published in: Science
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • For instance, suppose we are observing an individual over the course of an environmental stress, such as a viral infection or a physical injury.
    In this case we expect to see some group of genes that are only temporarily co-regulated in a specific situation. For example, genes involved in immune response to injuries would be temporarily co-regulated. And this would lead to something like the blue module, which is only active in one of the networks.
    On the other hand, we also expect some housekeeping genes to be always turned on and highly co-regulated. This would lead to something that looks like a red module that is always present in the networks.

    By identifying and functionally characterizing such modules with different patterns of occurrences,
    one can start to reason about the biological processes that are affected or unaffected by the given context of interest.

    With this motivation in mind, the goal of this project was to develop an algorithm
    that takes as input multiple networks from different contexts,
    and outputs the overall community structure with the associated activity pattern
    that tells us in which subset of contexts each module appears in

    So how do we go about doing this?

  • 2 mins

    First, it is important to know that there are a large number of module detection algorithms that work on a single network
    This includes link clustering, label propagation, spectral decomposition, stochastic block models, et cetera.

    However, extension of these methods to the multiple network case is not trivial.

    One naïve approach one might consider is to combine all the networks into a single representative network
    (for example, by taking the average of the adjacency matrices)
    and to run existing module detection algorithm on it

    Once we have a global set of modules,
    Then we can go back to individual networks and check whether each identified module
    is active or not

    While this approach is fairly simple and easy to implement, this suffers from the limitation that
    modules that are only active in a small number of networks
    are more difficult to identify in the combined network.
    This is because the merging process dilutes the signal in the data.
    Another naïve approach, would be to apply module detection algorithms
    on each network independently to learn the modules, and then try to combine the outputs by matching modules detected from different networks.
    While this approach has no problem identifying modules that are rarely active,
    when the detected boundaries of a module differ between networks
    it is not clear how to resolve such disagreements in a principled manner.
    In this talk, I present a hierarchical Bayesian model named multi-MMSB
    that avoids both of these issues.
    Our model learns a global community structure jointly from all networks,
    while allowing each module to be only present in a subset of networks,
    thereby increasing power to detect rare modules.

    In the following section, I will describe the details of multi-MMSB. Let’s first start with a simple Bayesian model that forms the basis of our model.
  • Stochastic blockmodel is a probabilistic, generative model of random graphs that originates from the social network analysis literature.

    The basic idea is that when we look at the adjacency matrix of a graph that shows a modular pattern it will have these “blocky” structure, where each block corresponds to a single module.
    So the idea is to cluster the nodes such that within each cluster we see a lot of edges and not many edges are between different clusters.

    Now we can formalize this model as follows.
    First we introduce a parameter p_m that represents the connectivity level for each module m
    p_0 represents the background connectivity between nodes of different modules, which can be thought of as the amount of noise in the data
    and lastly for each node in the network, we have a latent label z_i that represents which module the node belongs to.

    Given these variables, each edge is sampled independently from a Bernoulli distribution with parameter p_m if both nodes belong to module m and p_0 otherwise

    A key limitation of stochastic blockmodel is that each node can only be assigned to a single module. However, in many applications, modules often overlap with each other. This is the motivation behind mixed-membership stochastic blockmodels, or MMSB.

  • In this version of the model, we allow each node to have a fractional membership to modules rather than a hard assignment. This is represented by the vector c_i.

    In addition, we introduce a latent label z_ij for every pair of i and j to represent the conditional membership of node I with respect to node j. Intuitively speaking, this allows each node to be multi-faceted – they can change their module membership based on the node that they are interacting with.

    In this new setup, an edge is sampled with probability p_m when the conditional memberships on both sides agree with each other.

    While this model has been shown to be effective in a variety of settings, by design it only works on a single network.
  • In order to extend this to the multiple network case we first duplicate the latent variables z_ij across the networks while keeping only a single copy of c_i so that the fractional membership of each node remains identical in every network.

    Furthermore, we introduce another layer of latent variables denoted as d_km, which represents context-specific activity of module m in network k.

    Now, when we sample the edges using p_m, in addition to checking whether the conditional memberships match, we also check whether the module is active in the given network.

    Note that, in practice, we are only given the edges and none of the latent variables.
  • A standard approach to learning a Bayesian model is to optimize the likelihood function given the observed data.

    In this case, since there are variables that are not observed, we want to optimize what’s called a marginal likelihood. Which is the same as the complete likelihood where the latent variables are integrated out.

    Expectation-maximization algorithm can be used to optimize this objective. But because the posterior distribution over the latent variables is intractable in this case, we need to use variational EM, which makes a simplifying assumption that the latent variables are independent from each other.

    At the end of this training procedure, what we get is the optimal set of model parameters and our belief over the latent variables, from which we can extract the community structure learned by the model.

    Because this approach is susceptible to local optima, we typically learn the model several times for a given setting and select the one with the highest objective for further analysis
  • Once we learn the model, we need a way to measure the accuracy, assuming the ground truth is available, which is the case in simulated data.

    To quantify the similarity between two community structures, we use a metric called normalized mutual information which was first developed in the context of network covers by Esquivel and Rosvall in 2012.

    I won’t go into too much detail here, but the basic intuition is as follows. First, we randomly generate a sequence of structural queries.
    [Give example]
    Then we send these queries through both the learned and the true community structures to get two sets of answers. Calculating mutual information between these two answer sets gives us our similarity score.

    In the limiting case, if the two structures are exactly the same then the answers we get would be identical in all cases and this leads to an NMI of 1.

    Note that this procedure does not require us to know the mapping between the modules between the two structures because mutual information doesn’t change even if we relabel the modules on either side.

    Now we’ve established everything about the model. In the following section I will present some results on synthetic data.
  • NetBioSIG2014-Talk by Hyunghoon Cho

    1. 1. Identifying context-dependent community structure across multiple networks Hyunghoon Cho, Gerald Quon, Bonnie Berger, Manolis Kellis MIT CSAIL ISMB Network Biology SIG July 11th, 2014
    2. 2. Modules / communities Cellular functions are carried out by groups of biomolecules (e.g., proteins, RNA) acting in a coordinated fashion. Problem: how does this structure change under a different condition?
    3. 3. Detecting changes in modules 1 2 3 1 2 3 Context Module 1 2 Kv v v
    4. 4. Approaches to module detection • Many algorithms for detecting modules in a single network – Link clustering [Shi et al. 2013], label propagation [Gregory 2010], Tensor decomposition [Anandkumar et al. 2013], mixed-membership stochastic blockmodels [Airoldi et al. 2008], etc. • Not obvious how to extend to the multiple network case: Combine networks, then detect modules likely to miss rare modules Detect modules, then combine results inconsistent module definition Multi-MMSB Jointly learns modules from all networks, allow each to be only present in a subset of networks
    5. 5. Model description: SB Note: each node belongs to a single module Adjacency matrix
    6. 6. Model description: MMSB [Airoldi et al., 2008]
    7. 7. Model description: Multi-MMSB
    8. 8. Learning the model Goal: optimize model likelihood Expectation-Maximization algorithm to deal with latent variables Need variational approximation Random restarts to alleviate local optima issue
    9. 9. Performance metric • Normalized mutual information (NMI) Sequence of structural queries Learned community structure True community structure Answers Answers Calculate mutual information [Esquivel and Rosvall, 2012]
    10. 10. Synthetic data: results Normalizedmutualinformation
    11. 11. Synthetic data: results
    12. 12. Synthetic data: results
    13. 13. Synthetic data: results
    14. 14. Asthma data (GSE19301) Microarray profiling of peripheral blood mononuclear cells from asthma patients at 3 different stages: • quiet: 394 samples • exacerbation: 125 samples • follow-up (2 weeks after exacerbation): 166 samples [Bjornsdottir et al., 2011]
    15. 15. Asthma data: results
    16. 16. RNA decay data (GSE37451) Microarray profiling of 70 lymphoblastoid cell lines at 5 different timepoints after transcription arrest: • 0 hr (before transcription arrest) • 0.5 hr • 1 hr • 2 hr • 4 hr
    17. 17. RNA decay data: results
    18. 18. Summary • We developed Multi-MMSB, a flexible way of learning community structure over multiple networks • Multi-MMSB outperformed naive methods on synthetic data • When applied to real data, Multi-MMSB identified context-specific modules that are biologically plausible
    19. 19. Future directions • Extending the model: – Directed networks – Weighted edges • Application to other types of biological networks: – Regulatory networks – PPI
    20. 20. Acknowledgements • Gerald Quon • Prof. Bonnie Berger • Prof. Manolis Kellis