Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PMED Undergraduate Workshop - Communities & Classification in Disease Data - Peter Mucha, October 22, 2018

44 views

Published on

This talk combines two stories about the analysis of data associated with diseases. In the first, we introduce community detection in networks and use network representations of genetic virulence factor similarities between different uropathogenic E. coli strains to identify communities of these strains that are more similar to each other than to the rest of the studied population. We then discuss the clinical differences between these E. coli communities. In the second story, we investigate metabolomic data obtained from stool samples of hospitalized patients. We employ a variety of methods for handling this sparse data to generate a new classifier for the presence of C.difficile in the samples. Working closely with our clinical collaborators, we then obtain a wholly new and surprisingly simple and accurate measurement for detecting the presence of active C. difficile infections.

Published in: Education
  • Be the first to comment

  • Be the first to like this

PMED Undergraduate Workshop - Communities & Classification in Disease Data - Peter Mucha, October 22, 2018

  1. 1. Communities & Classification in Disease Data Peter J. Mucha, University of North Carolina at Chapel Hill Acknowledgements: Jeff Henderson, Jim Moody; Kaveri Chaturvedi; William Weir, James Wilson; CDC, JSMF, NIDDK
  2. 2. Binary Biclustering of Data & Community Detection in Networks Binary Biclustering: blocks (rows & columns) in rectangular array that have statistically significantly higher numbers of 1’s Community Detection: working from a network representation of one of the two data types, identify groups that are more highly connected to each other than to the rest of the network
  3. 3. Motivating Data Example
  4. 4. • Jim Moody (paraphrased): “I’ve been accused of turning everything into a network.” • PJM (in response): “I’m accused of turning everything into a network and a graph partitioning problem.” • “Structure ßà Function” How to extend the notion of modularity in networks to multiple networks between the same actors/units, i.e. how to properly use identity in modularity? Philosophical Disclaimer
  5. 5. Karate Club Example This partition optimizes modularity, which measures the number of intra-community ties (relative to randomness) “If your method doesn’t work on this network, then go home.”
  6. 6. Karate Club Example Brought to you by Mason Porter and The Power Law Shop https://www.zazzle.com/the_power_law_shop “If your method doesn’t work on this network, then go home.”
  7. 7. “Cris Moore (left) is the inaugural recipient of the Zachary Karate Club Club prize, awarded on behalf of the community by Aric Hagberg (right). (9 May 2013)”
  8. 8. Community Detection: Null Model & Computational Heuristics • GOAL: Assign nodes to communities in order to maximize quality function Q • NP-Complete [Brandes et al. 2008] ~ enumerate possible partitions • Numerous packages developed/developing – e.g. igraph library (R, python), NetworkX –Need appropriate null model
  9. 9. Maximizing Modularity (Newman & Girvan, PRE 2004; Newman, PRE 2004, PNAS 2006, PRE 2006) • Independent edges, constrained to expected degree sequence same as observed. • Requires Pij = f(ki)f(kj), quickly yielding • g resolution parameter ad hoc (default = 1) (Reichardt & Bornholdt, PRE 2006; Lambiotte et al., arXiv 2008) • Resolution limit (Fortunato & Barthelemy, PNAS 2007) Degenerate landscape (Good, de Montjoye & Clauset, PRE 2010) Forces partition (many authors!)
  10. 10. Motivating Data Example
  11. 11. Roll call as a network? Roll Call SimilaritiesScientific Coauthorship
  12. 12. Polarization in Roll Call Networks
  13. 13. Congressional Roll Call (Waugh, Pei, Fowler, PJM & MAP [arXiv 2009-2011])
  14. 14. (Moody & PJM, Network Science 2013)
  15. 15. Binary Biclustering of Data & Community Detection in Networks Binary Biclustering: blocks (rows & columns) in rectangular array that have statistically significantly higher numbers of 1’s Community Detection: working from a network representation of one of the two data types, identify groups that are more highly connected to each other than to the rest of the network
  16. 16. 16 VFs by 337 E.coli isolates: Binary data & network representations
  17. 17. Network analysis reveals E.coli sort into four groups by VF similarities Gene Community 1 n=118 Community 2 n=98 Community 3 n=76 Community 4 n=45 chuA 98 49 100 98 fyuA 100 44 99 98 ompT 100 36 100 100 tspE 94 51 80 89 yjaH 94 29 99 87 usp 92 6 99 100 capII 54 18 76 80 iucD 97 24 24 9 iha 86 13 18 0 sat 86 6 33 2 prf 5 7 75 33 iroN 4 13 72 9 hlyA 2 0 91 0 sfa 1 0 80 0 cnf1 0 0 57 0 Dr 9 5 0 13 100% 0%
  18. 18. E.coli sort into four groups by VF similarities
  19. 19. VF similarities closely related to siderophores
  20. 20. Sex Differences Between VF-similarity Communities
  21. 21. Antibiotic resistance differs between VF-similarity communities Resistance (red) to trimethoprim/sulfamethoxazole/ bactrim and to ciprofloxacin, by community placement:
  22. 22. “Unsupervised” v. “Supervised”
  23. 23. Summary • There are many methods for doing unsupervised clustering and supervised classification that are useful in practice. • Don’t tie yourself down to one method: good clusters and classes should be robust, and hopefully your story shouldn’t depend on the precise method. If it does, you should try to understand why.

×