making connections between genetics and disease
MongoDB and the Connectivity Map
.
.
.
.
.
Corey Rajiv
a common language
Gene Expression
.
.
.
.
.
.13
~7,000 experiments
Over 19,000 registered users
Cited by over 1,200 scientific reports
.
2006
.
2014
.16
CMap-LINCS dataset
1.4 million gene expression profiles
3,800 Genes (shRNA & cDNA)
• Targets/pathways of approved drugs
• ...
• Diverse use-cases
• Users with varying technical expertise
• Annotations are complex and
incomplete
• Frequent updates
C...
Store just what’s needed
Refactor frequently
Test and use daily
Data Model!
An agile philosophy keeps the model tractable
Data Model!
An inventory of signatures
siginfo
Data Model!
Shared fields as separate collections
siginfo
cellinfo
pertinfo
Data Model!
Add computed fields and external meta-data
siginfo cellinfo
Data Model!
Duplicate data to optimize lookups
siginfo pertinfo
APIs!
Are awesome, we need more of them
Picked functionality over convention!
/siginfo?q={“cell”:”A”}	
  vs	
  /siginfo/ce...
API!
MongoDB inspired a rich query syntax
Function Example
Query /siginfo?q={“cell:”A”,”name”:”B”}
Field selection /siginf...
API!
Node and Mongoose enable easy API creation
Language Bindings!
JSON as a universal format
Javascript
Python
R
Analytic Tools!
A compute API liberates command line scripts
Compute API!
Messaging handled via a capped collection
Input Validation!
JSON Schema simplifies validation
GCTX : A binary format based on HDF5
Cross platform
Multi-language support
Efficient I/O
Storage size for 30 billion data p...
Sign up at lincscloud.org
Lincscloud!
A platform for easy access to perturbational data
Free for academic use
Predicting Drug Function!
Diverse structures, common activities
Predicting Drug Function!
Diverse structures, common activities
VEGFR inhibitor
PPARG agonist
PI3K/MTOR inhibitor
ROCK inh...
Finding Novel Drug Targets!
Repurposing failed drugs
Original target
Finding Novel Drug Targets!
Repurposing failed drugs
Original target
Failed in Phase 2 clinical trial due to lack of effica...
Finding Novel Drug Targets!
Repurposing failed drugs
Original target
Novel Target A
Novel Target B
Novel Target C
Novel Ta...
Acknowledgements
Todd Golub

Core Team: Analysis & Software
Arvind Subramanian
Jacob Asiedu
Larson Hogstrom
Ian Smith
Davi...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
Upcoming SlideShare
Loading in...5
×

MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

992

Published on

The Broad Institute has developed a novel high-throughput gene-expression profiling technology and has used it to build an open-source catalog of over a million profiles that captures the functional states of cells when treated with drugs and other types of perturbations. Referred to as the Connectivity Map (or CMap), these data when paired with pattern matching algorithms, facilitate the discovery of connections between drugs, genes and diseases. We wished to expose this resource to scientists around the world via an API that is easily accessible to programmers and biologists alike. We required a database solution that could handle a variety of data types and handle frequent changes to the schema. We realized that a relational database did not fit our needs, and gravitated towards MongoDB for its ease of use, support for dynamic schema, complex data structures and expressive query syntax. In this talk, we’ll walk through how we built the CMap library. We’ll discuss why we chose MongoDB, the various schema design iterations and tradeoffs we’ve made, how people are using the API, and what we’re planning for the next generation of biomedical data.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
992
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
23
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

  1. 1. making connections between genetics and disease MongoDB and the Connectivity Map
  2. 2. .
  3. 3. .
  4. 4. .
  5. 5. .
  6. 6. . Corey Rajiv
  7. 7. a common language Gene Expression
  8. 8. .
  9. 9. .
  10. 10. .
  11. 11. .
  12. 12. .
  13. 13. .13 ~7,000 experiments Over 19,000 registered users Cited by over 1,200 scientific reports
  14. 14. . 2006
  15. 15. . 2014
  16. 16. .16
  17. 17. CMap-LINCS dataset 1.4 million gene expression profiles 3,800 Genes (shRNA & cDNA) • Targets/pathways of approved drugs • Candidate disease genes • Community nominations 15 Cell types • Banked primary cell types • Cancer cell lines • Primary hTERT-immortalized • Patient-derived iPS cells • Community nominated 12,488 Compounds • FDA approved drugs • Bioactive tool compounds • Screening hits
  18. 18. • Diverse use-cases • Users with varying technical expertise • Annotations are complex and incomplete • Frequent updates CMap Data! Easy to describe, tough to Model
  19. 19. Store just what’s needed Refactor frequently Test and use daily Data Model! An agile philosophy keeps the model tractable
  20. 20. Data Model! An inventory of signatures siginfo
  21. 21. Data Model! Shared fields as separate collections siginfo cellinfo pertinfo
  22. 22. Data Model! Add computed fields and external meta-data siginfo cellinfo
  23. 23. Data Model! Duplicate data to optimize lookups siginfo pertinfo
  24. 24. APIs! Are awesome, we need more of them Picked functionality over convention! /siginfo?q={“cell”:”A”}  vs  /siginfo/cell/A
  25. 25. API! MongoDB inspired a rich query syntax Function Example Query /siginfo?q={“cell:”A”,”name”:”B”} Field selection /siginfo?q={}&f={“name”:1} Document count /siginfo?q={}&c=true Document limit /siginfo?q={}&l=10 Skip documents /siginfo?q={}&l=10&sk=10 Sort order /siginfo?q={}&s={“name”:-­‐1,”cell”:1} Distinct values /siginfo?q={}&d=name Aggregation /siginfo?q={}&g=name
  26. 26. API! Node and Mongoose enable easy API creation
  27. 27. Language Bindings! JSON as a universal format Javascript Python R
  28. 28. Analytic Tools! A compute API liberates command line scripts
  29. 29. Compute API! Messaging handled via a capped collection
  30. 30. Input Validation! JSON Schema simplifies validation
  31. 31. GCTX : A binary format based on HDF5 Cross platform Multi-language support Efficient I/O Storage size for 30 billion data points is 110 Gb Numeric Matrix Data! HDF5 offers efficient storage for large matrices
  32. 32. Sign up at lincscloud.org Lincscloud! A platform for easy access to perturbational data Free for academic use
  33. 33. Predicting Drug Function! Diverse structures, common activities
  34. 34. Predicting Drug Function! Diverse structures, common activities VEGFR inhibitor PPARG agonist PI3K/MTOR inhibitor ROCK inhibitor Estrogen agonist
  35. 35. Finding Novel Drug Targets! Repurposing failed drugs Original target
  36. 36. Finding Novel Drug Targets! Repurposing failed drugs Original target Failed in Phase 2 clinical trial due to lack of efficacy
  37. 37. Finding Novel Drug Targets! Repurposing failed drugs Original target Novel Target A Novel Target B Novel Target C Novel Target D
  38. 38. Acknowledgements Todd Golub
 Core Team: Analysis & Software Arvind Subramanian Jacob Asiedu Larson Hogstrom Ian Smith David Lahr Aravind Subramanian Josh Gould Ted Natoli David Wadden ! Core Team: Lab John Davis David Peck Xiaodong Lu Melanie Donahue Daniel Lam Jackie Rosains (Project Manager) Collaborators Bang Wong Steven Corsello (Golub lab) Jake Jaffe (Proteomics) David Takeda (Hahn lab) Pablo Tamayo ! Chemistry & Therapeutics Lucienne Ronco Josh Bittker Arthur Liberzon Mathias Wawer Paul Clemons ! Genetic Perturbation Platform John Doench Federica Piccioni David Root
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×