nowomics
making science easy to follow

&

mongoDB
Richard Smith	

richard@nowomics.com

www.nowomics.com
OVERVIEW
•

About nowomics	


•

Why mongo?	


•

Data model	


•

Aggregation Framework	


•

A bit about replica sets
As...
Biomedical data are being generated	

and published at an unprecedented rate
model organisms

1500
literature

biological ...
THE SOLUTION

BRCA2 gene

Follow

diabetes, type 2 disease Follow
neuron development

process

Follow
HOW NOWOMICS WORKS
literature
& databases

Fetch data	

every day

nowomics
Work out what’s
changed

link to original	

da...
Alpha - summer 2013
Beta - November 2013
CEDAR Enterprise Fellowship - December 2013
TECH
Python

pymongo

MongoDB
Amazon

~20GB (data & indexes)

EC2, S3, SES

Elasticsearch
WHY MONGO?
•

‘schema less’ - data will change over time	


•

horizontal scaling	


•

rich query system	


•

ease of de...
DATA MODEL
Gene

Publication

Gene

Disease
DATA MODEL
Gene

Publication

Gene

Disease

Tracking these relationships
gene: PPARG
disease: diabetes

date: 17 Nov 2013...
DATA MODEL
Gene

Publication

Gene

Disease

Tracking these relationships
gene: PPARG
disease: diabetes
experiment: GWAS
p...
COLLECTIONS
links

12.5m

short field names

{t1: gene, n1: 101, t2: pub, n2: 201,
date: 2013-11-17, type: pub }

gene

20...
1.

2.
3.
4.

1+4 - precalculate with aggregation framework
3 - wasn’t using correct index, needed hint
2 - uses aggregati...
AGGREGATION FRAMEWORK I
•

New in 2.2 - alternative to map reduce	

•

map reduce was slow and complex	


•

Analogous to ...
AGGREGATION FRAMEWORK II

*

*
AGGREGATION FRAMEWORK III

*
AGGREGATION EXAMPLE
Count of genes linked to each publication

db.links.aggregate([
{$match: {t2: 'pub', t1: 'gene'}},
{$g...
AGGREGATION EXAMPLE
Count of updates per month
db.links.aggregate([
{$match: {date: {$gte: new ISODate('2013-02-01')},
't1...
AGGREGATION ISSUES
•

No explain()(coming in 2.6)	


•

Can’t use index hints	


•

16MB result limit - run in batches

(c...
EC2 ARCHITECTURE
primary

secondary

EC2 large

EC2 large
app

EC2 large
arbiter

EC2 micro

mongoDB replica set
EC2 ARCHITECTURE
primary

app

secondary

EC2 large

EC2 large
app

EC2 large
arbiter

EC2 micro

mongoDB replica set
PERFORMANCE
Indexes & data (20GB) bigger than RAM (8GB)	

• main indexes in RAM would be OK	

• Loading data uses differen...
PERFORMANCE
Indexes & data (20GB) bigger than RAM (8GB)	

• main indexes in RAM would be OK	

• Loading data uses differen...
nowomics
making science easy to follow

richard@nowomics.com

www.nowomics.com
Upcoming SlideShare
Loading in …5
×

Nowomics & MongoDB

2,395 views

Published on

Nowomics.com is a website to help biology researchers stay up to date with the latest data and papers relevant to their research. This is a talk given at the Cambridge UK MongoDB User Group about how Nowomics is built on MongoDB.

Published in: Technology, Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,395
On SlideShare
0
From Embeds
0
Number of Embeds
1,507
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Nowomics & MongoDB

  1. 1. nowomics making science easy to follow & mongoDB Richard Smith richard@nowomics.com www.nowomics.com
  2. 2. OVERVIEW • About nowomics • Why mongo? • Data model • Aggregation Framework • A bit about replica sets Ask lots of questions
  3. 3. Biomedical data are being generated and published at an unprecedented rate model organisms 1500 literature biological databases proteins pathways genome annotation gene expression interactions ~20,000 journal articles a week mutations diseases
  4. 4. THE SOLUTION BRCA2 gene Follow diabetes, type 2 disease Follow neuron development process Follow
  5. 5. HOW NOWOMICS WORKS literature & databases Fetch data every day nowomics Work out what’s changed link to original data source Personalised News Feed & email alerts Follow Users follow what they work on Organise by gene, disease, process, author, etc
  6. 6. Alpha - summer 2013 Beta - November 2013 CEDAR Enterprise Fellowship - December 2013
  7. 7. TECH Python pymongo MongoDB Amazon ~20GB (data & indexes) EC2, S3, SES Elasticsearch
  8. 8. WHY MONGO? • ‘schema less’ - data will change over time • horizontal scaling • rich query system • ease of development
  9. 9. DATA MODEL Gene Publication Gene Disease
  10. 10. DATA MODEL Gene Publication Gene Disease Tracking these relationships gene: PPARG disease: diabetes date: 17 Nov 2013 source: NCBI
  11. 11. DATA MODEL Gene Publication Gene Disease Tracking these relationships gene: PPARG disease: diabetes experiment: GWAS probability: 0.012 date: 17 Nov 2013 source: NCBI + more fields ??? + more types
  12. 12. COLLECTIONS links 12.5m short field names {t1: gene, n1: 101, t2: pub, n2: 201, date: 2013-11-17, type: pub } gene 200k {id: 101, symbol: PPARG} pub 1.4m {id: 201, identifier: 24386954} disease 12k {id: 201, name: diabetes}
  13. 13. 1. 2. 3. 4. 1+4 - precalculate with aggregation framework 3 - wasn’t using correct index, needed hint 2 - uses aggregation framework, doesn’t support hint
  14. 14. AGGREGATION FRAMEWORK I • New in 2.2 - alternative to map reduce • map reduce was slow and complex • Analogous to SQL group by • Run a pipeline of commands db.links.aggregate([ {$match: {t2: 'pub', t1: 'gene'}}, {$group: {_id : '$n2', count: {$sum: 1}}} ])
  15. 15. AGGREGATION FRAMEWORK II * *
  16. 16. AGGREGATION FRAMEWORK III *
  17. 17. AGGREGATION EXAMPLE Count of genes linked to each publication db.links.aggregate([ {$match: {t2: 'pub', t1: 'gene'}}, {$group: {_id : '$n2', count: {$sum: 1}}} ]) (actually precalculate for all and store results in collection)
  18. 18. AGGREGATION EXAMPLE Count of updates per month db.links.aggregate([ {$match: {date: {$gte: new ISODate('2013-02-01')}, 't1': 'gene', 'n1': 530}}, {$project: {_id: 0, month: {$month: ‘$date'}, year: {$year: ‘$date'}}}, {$group: {'_id': {m: '$month', y: ‘$year'}, count: {$sum: 1}}} ] ) (actually precalculate for all and store results in collection)
  19. 19. AGGREGATION ISSUES • No explain()(coming in 2.6) • Can’t use index hints • 16MB result limit - run in batches (coming in 2.6) • Can’t output results to collection (coming in 2.6)
  20. 20. EC2 ARCHITECTURE primary secondary EC2 large EC2 large app EC2 large arbiter EC2 micro mongoDB replica set
  21. 21. EC2 ARCHITECTURE primary app secondary EC2 large EC2 large app EC2 large arbiter EC2 micro mongoDB replica set
  22. 22. PERFORMANCE Indexes & data (20GB) bigger than RAM (8GB) • main indexes in RAM would be OK • Loading data uses different indexes • Slow page loads •
  23. 23. PERFORMANCE Indexes & data (20GB) bigger than RAM (8GB) • main indexes in RAM would be OK • Loading data uses different indexes • Slow page loads • ! • ReadPreference.SECONDARY_PREFERRED send links queries to secondary • indexes stay in RAM •
  24. 24. nowomics making science easy to follow richard@nowomics.com www.nowomics.com

×