MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease


Published on

The Broad Institute has developed a novel high-throughput gene-expression profiling technology and has used it to build an open-source catalog of over a million profiles that captures the functional states of cells when treated with drugs and other types of perturbations. Referred to as the Connectivity Map (or CMap), these data when paired with pattern matching algorithms, facilitate the discovery of connections between drugs, genes and diseases. We wished to expose this resource to scientists around the world via an API that is easily accessible to programmers and biologists alike. We required a database solution that could handle a variety of data types and handle frequent changes to the schema. We realized that a relational database did not fit our needs, and gravitated towards MongoDB for its ease of use, support for dynamic schema, complex data structures and expressive query syntax. In this talk, we’ll walk through how we built the CMap library. We’ll discuss why we chose MongoDB, the various schema design iterations and tradeoffs we’ve made, how people are using the API, and what we’re planning for the next generation of biomedical data.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

  1. 1. making connections between genetics and disease MongoDB and the Connectivity Map
  2. 2. .
  3. 3. .
  4. 4. .
  5. 5. .
  6. 6. . Corey Rajiv
  7. 7. a common language Gene Expression
  8. 8. .
  9. 9. .
  10. 10. .
  11. 11. .
  12. 12. .
  13. 13. .13 ~7,000 experiments Over 19,000 registered users Cited by over 1,200 scientific reports
  14. 14. . 2006
  15. 15. . 2014
  16. 16. .16
  17. 17. CMap-LINCS dataset 1.4 million gene expression profiles 3,800 Genes (shRNA & cDNA) • Targets/pathways of approved drugs • Candidate disease genes • Community nominations 15 Cell types • Banked primary cell types • Cancer cell lines • Primary hTERT-immortalized • Patient-derived iPS cells • Community nominated 12,488 Compounds • FDA approved drugs • Bioactive tool compounds • Screening hits
  18. 18. • Diverse use-cases • Users with varying technical expertise • Annotations are complex and incomplete • Frequent updates CMap Data! Easy to describe, tough to Model
  19. 19. Store just what’s needed Refactor frequently Test and use daily Data Model! An agile philosophy keeps the model tractable
  20. 20. Data Model! An inventory of signatures siginfo
  21. 21. Data Model! Shared fields as separate collections siginfo cellinfo pertinfo
  22. 22. Data Model! Add computed fields and external meta-data siginfo cellinfo
  23. 23. Data Model! Duplicate data to optimize lookups siginfo pertinfo
  24. 24. APIs! Are awesome, we need more of them Picked functionality over convention! /siginfo?q={“cell”:”A”}  vs  /siginfo/cell/A
  25. 25. API! MongoDB inspired a rich query syntax Function Example Query /siginfo?q={“cell:”A”,”name”:”B”} Field selection /siginfo?q={}&f={“name”:1} Document count /siginfo?q={}&c=true Document limit /siginfo?q={}&l=10 Skip documents /siginfo?q={}&l=10&sk=10 Sort order /siginfo?q={}&s={“name”:-­‐1,”cell”:1} Distinct values /siginfo?q={}&d=name Aggregation /siginfo?q={}&g=name
  26. 26. API! Node and Mongoose enable easy API creation
  27. 27. Language Bindings! JSON as a universal format Javascript Python R
  28. 28. Analytic Tools! A compute API liberates command line scripts
  29. 29. Compute API! Messaging handled via a capped collection
  30. 30. Input Validation! JSON Schema simplifies validation
  31. 31. GCTX : A binary format based on HDF5 Cross platform Multi-language support Efficient I/O Storage size for 30 billion data points is 110 Gb Numeric Matrix Data! HDF5 offers efficient storage for large matrices
  32. 32. Sign up at Lincscloud! A platform for easy access to perturbational data Free for academic use
  33. 33. Predicting Drug Function! Diverse structures, common activities
  34. 34. Predicting Drug Function! Diverse structures, common activities VEGFR inhibitor PPARG agonist PI3K/MTOR inhibitor ROCK inhibitor Estrogen agonist
  35. 35. Finding Novel Drug Targets! Repurposing failed drugs Original target
  36. 36. Finding Novel Drug Targets! Repurposing failed drugs Original target Failed in Phase 2 clinical trial due to lack of efficacy
  37. 37. Finding Novel Drug Targets! Repurposing failed drugs Original target Novel Target A Novel Target B Novel Target C Novel Target D
  38. 38. Acknowledgements Todd Golub
 Core Team: Analysis & Software Arvind Subramanian Jacob Asiedu Larson Hogstrom Ian Smith David Lahr Aravind Subramanian Josh Gould Ted Natoli David Wadden ! Core Team: Lab John Davis David Peck Xiaodong Lu Melanie Donahue Daniel Lam Jackie Rosains (Project Manager) Collaborators Bang Wong Steven Corsello (Golub lab) Jake Jaffe (Proteomics) David Takeda (Hahn lab) Pablo Tamayo ! Chemistry & Therapeutics Lucienne Ronco Josh Bittker Arthur Liberzon Mathias Wawer Paul Clemons ! Genetic Perturbation Platform John Doench Federica Piccioni David Root