MongoDB for Collaborative
Science
Shreyas Cholia, Dan Gunter
LBNL
About Us
We are Computer Scientists and Engineers at
Lawrence Berkeley Lab
We work with science teams to help build
software and computing infrastructure for
doing awesome SCIENCE
1
Talk overview
Background: community science and data
science in general, materials in particular
How (and why) we use MongoDB today
Things we would like to do with MongoDB in
the future
Conclusions
2
Science is now a collaborative effort
Large teams of people
Lots of computational power
Science
3
Big Data
Science is increasingly
data-driven
Computational cycles
are cheap
Take an –omics
approach to science
Compute all
interesting things
first, ask questions
later
4
The Materials Project
An open science initiative that makes available
a huge database of computed materials
properties for all materials researchers
5
Why we care about "materials"
solar PV electric vehicles
other:
waste heat recovery (thermoelectrics)
hydrogen storage
catalysts/fuel cells
What do we mean by a
material?
This is a Material!
https://www.materialsproject.org/materials/24972/
{ "created_at": "2012-08-30T02:55:49.139558",
"version": { "pymatgen": "2.2.1dev", "db": "2012.07.09",
"rest": "1.0" }, "valid_response": true, "copyright":
"Copyright 2012, The Materials Project", "response": [ {
"formation_energy_per_atom": -1.7873700829440011,
"elements": [ "Fe", "O" ], "band_gap": null,
"e_above_hull": 0.0460096124999998, "nelements": 2,
"pretty_formula": "Fe2O3", "energy": -132.33005625,
"is_hubbard": true, "nsites": 20, "material_id": 542309,
"unit_cell_formula": { "Fe": 8.0, "O": 12.0 }, ….
8
Business as usual:
“find the needle in a haystack”
hours
The Materials Project is like
having an army to search
through the haystack
High-throughput computing is
like an army
MATERIALS THEORY COMPUTERS
WORKFLOW
vs.
i
dY({ri};t)
dt
= H
Ù
Y({ri};t)
Do not synthesize!
Put on backburner
Begin further investigation
The data is our "haystack"
13
Users
Materials Project website
http://materialsproject.org/
14
Infrastructure
Submitted
Materials
Materials
Data
Materials Properties
Supercomputers
• Over 10 million CPU hours of
calculations in < 6 months
• Over 40,000 successful VASP runs
(30,000+ materials)
• Generalizable to other high-
throughput codes/analyses
Calculation
Workflows
Supercomputers Codes to run
(in sequence)
Atomic positions
15
The Materials Project +
MongoDB
We use MongoDB to store data, provenance,
and state
16
Application Stack
Python Django Web Service
Pymatgen and other scientific Python libraries
Fireworks + VASP
pymongo
MongoDB under all these layers
Currently at Mongo 2.2
17
Powered by MongoDB
Materials Project data stored in a MongoDB
Core materials properties
User generated data
Workflow and Job data
18
Scalability
We use replica sets in a master/slave config
No sharding (yet)
But we are doing pretty well with a small
MongoDB cluster so far
4 nodes (2 prod, 2 dev)
128 GB, 24 cores per node
Current Data volume
30000 compounds – 250 GB
Eventually
millions of compounds, big experimental data
19
Why we like MongoDB
Flexibility
Developer-friendliness
JSON
Great for read-heavy patterns
20
Flexibility
Our data changes frequently
it's research
This was a major pain point for the old SQL
schema
The flip side, chaos, has not been a big
problem
in reality, all access is programmatic so the code
implies the schema
21
Developer-friendly
We work in small groups
mix of scientists, programmers
need something easy to learn, easy to use
Structuring the data is intuitive
Querying the data is intuitive
Querying the data is easy to map to web
interfaces
Developed a special language "Moogle" that
translates pretty cleanly to MongoDB filters
22
Search Example
23
Search as JSON
{
"nelements": {"$gt": 3, "$lt": 5},
"elements": {
"$all": ["Li", "Fe", "O"],
"$nin": ["Na"]
}
}
24
JSON
JSON is easy (enough) for scientists to read and
understand
Direct translation between JSON and the
database saves lots of time
also nearly-direct mapping to Python dict
For example: Structured Notation Language
the format for describing the inputs of our
computational jobs
Makes it very easy to build an HTTP API on our
data
Materials API
25
Read/Write Patterns
Scientific data usually generated in large runs
Must go through validation first
Core data only updated in bulk during controlled
releases
Core data is essentially read-only
User-generated data is r/w but writes need not
complete immediately
Workflow data is r/w and has some write issues,
since timing matters
26
Our use of MongoDB
Use it directly, no mongomapper or other
ORM-like layer
Did write a "QueryEngine" class
specific to our data, not a general ORM
27
Sample collections in DB
28
Collection Record size Count
diffraction_patterns 450KB 38621
materials 400KB 30758
tasks 150KB 71463
crystals 40KB 125966
Building materials from tasks
if ndocs == 1:
blessed_doc = docs[0]
# Add ntasks_id
# Add task_ids
blessed_doc['task_ids'] = [blessed_doc['task_id']]
blessed_doc['ntask_ids'] = 1
# only one successful result
# could be GGA or GGA+U
elif ndocs >= 2:
# multiple GGA and GGA+U runs
# select the GGA+U run if it exists
# else sort by free_energy_per_atom
....
29
Select the
"blessed"
material based
on domain
criteria
Data Examples – Fe2O3
https://www.materialsproject.org/materials/24972/
{ "created_at": "2012-08-30T02:55:49.139558",
"version": { "pymatgen": "2.2.1dev", "db": "2012.07.09",
"rest": "1.0" }, "valid_response": true, "copyright":
"Copyright 2012, The Materials Project", "response": [ {
"formation_energy_per_atom": -1.7873700829440011,
"elements": [ "Fe", "O" ], "band_gap": null,
"e_above_hull": 0.0460096124999998, "nelements": 2,
"pretty_formula": "Fe2O3", "energy": -132.33005625,
"is_hubbard": true, "nsites": 20, "material_id": 542309,
"unit_cell_formula": { "Fe": 8.0, "O": 12.0 }, ….
30
Mongo works for us
Adding new properties is common
need flexible schema
Data is nested and structured
more objects-like, less relational
Need a flexible query mechanism to be able to
pull out relevant objects
Read-heavy, few writes
Core data only updated in bulk during releases
User writes less critical, so we can survive hiccups
31
Schema-Last
Mongo provides flexible schemas
But web code expects data in a certain
structure
M/R methodology allows us to distill the data
into the schema expected by the code
Allows us to evolve schema while maintaining
a more rigorous structure frontend
32
Fireworks - workflows
Keep all our state in MongoDB
raw inputs (crystals) for the calculations
job specifications, ready and running
output of finished runs
Outputs are loaded in batches back to
MongoDB
Real-time updates for workflow status
MongoDB "write concerns" are important here
33
Lots of stuff needs to
be validated
Atomic compound is
stable
Volume of lattice is
positive
Time to compute was
greater than some
minimum
"Phase diagram" correct
 Rough agreement
across calculation
types
 Energies agree
with experimental
results
 .....
34
Validation requirements
Fast enough to use in real-time
But not full MongoDB query syntax
too finicky, tricky esp. for "or" type of queries
Want other people to be able to use this
We don't want to become a "mongodb tutor"
Also is nice to be general to any document-oriented
datastore
35
Our validation workflow
YAML
constraint
specificatio
n
Build
MongoDB
query
Determine which
constraints failed for
each record
Report
coll.find()
Records
User
from ML,
someday..
36
Example query
_aliases:
- energy = final_energy_per_atom
materials_dbv2:
-
filter:
- nelements = 4
constraints:
- energy > 0
- spacegroup.number > 0
- elements size$ nelements
37
{'nelements': 4,
'$or': [
{'spacegroup.number': {'$lte': 0}},
{'final_energy_per_atom': {'$lte': 0}},
{'$where': 'this.elements.length != this.nelements'}
]}
$ mgvv -f mgvv-ex.yaml
MongoDB Query
Example result
38
Sandboxes
39
Unified View
Materials Project
Core database
Private Sandbox Data
Sandbox implementation
We pre-build the sandboxes the same way we
pre-build all the other derived collections
We are using collections with "." to separate
components
core.<collection>, sandbox.<name>.<collection>
All access is mediated by our Django server
permissions stored in Django, map users and groups
to access to the sandboxes
No cross-collection querying is needed
40
Where we (and science) are
going with all this..
41
Analytics: Smarter, better data
mining
42
Share data across disciplines
43
Use MongoDB as a
flexible metadata store
that connects
data/metadata
Store and search user-
generated ontologies
Organize and search PBs of data
files
Currently store in file hierarchies
/projectX/experimentA/tuesday/run99/foobar.hdf5
Move the metadata to MongoDB
{'project':'X', 'experiment':'A', 'run':99,
'date':'2013-05-10T12:31:56', 'user':'joe', ....
'path':'/projectX/d67001A3.hdf5'}
Already doing this for some projects
one-offs: general layer possible?
44
Thank you
Contact info
Shreyas Cholia scholia@lbl.gov
Dan Gunter dkgunter@lbl.gov
Thanks to these people for slide material
Anubhav Jain, Kristin Persson, David Skinner (LBNL)
Thanks to Materials Project contributors
http://materialsproject.org/contributors
45

MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by Dan Gunter, Computer Scientist, LBNL and Shreyas Cholia, Computer Systems Engineer, NERSC/LBNL

  • 1.
  • 2.
    About Us We areComputer Scientists and Engineers at Lawrence Berkeley Lab We work with science teams to help build software and computing infrastructure for doing awesome SCIENCE 1
  • 3.
    Talk overview Background: communityscience and data science in general, materials in particular How (and why) we use MongoDB today Things we would like to do with MongoDB in the future Conclusions 2
  • 4.
    Science is nowa collaborative effort Large teams of people Lots of computational power Science 3
  • 5.
    Big Data Science isincreasingly data-driven Computational cycles are cheap Take an –omics approach to science Compute all interesting things first, ask questions later 4
  • 6.
    The Materials Project Anopen science initiative that makes available a huge database of computed materials properties for all materials researchers 5
  • 7.
    Why we careabout "materials" solar PV electric vehicles other: waste heat recovery (thermoelectrics) hydrogen storage catalysts/fuel cells
  • 8.
    What do wemean by a material?
  • 9.
    This is aMaterial! https://www.materialsproject.org/materials/24972/ { "created_at": "2012-08-30T02:55:49.139558", "version": { "pymatgen": "2.2.1dev", "db": "2012.07.09", "rest": "1.0" }, "valid_response": true, "copyright": "Copyright 2012, The Materials Project", "response": [ { "formation_energy_per_atom": -1.7873700829440011, "elements": [ "Fe", "O" ], "band_gap": null, "e_above_hull": 0.0460096124999998, "nelements": 2, "pretty_formula": "Fe2O3", "energy": -132.33005625, "is_hubbard": true, "nsites": 20, "material_id": 542309, "unit_cell_formula": { "Fe": 8.0, "O": 12.0 }, …. 8
  • 10.
    Business as usual: “findthe needle in a haystack”
  • 11.
  • 12.
    The Materials Projectis like having an army to search through the haystack
  • 13.
    High-throughput computing is likean army MATERIALS THEORY COMPUTERS WORKFLOW vs. i dY({ri};t) dt = H Ù Y({ri};t) Do not synthesize! Put on backburner Begin further investigation
  • 14.
    The data isour "haystack" 13 Users
  • 15.
  • 16.
    Infrastructure Submitted Materials Materials Data Materials Properties Supercomputers • Over10 million CPU hours of calculations in < 6 months • Over 40,000 successful VASP runs (30,000+ materials) • Generalizable to other high- throughput codes/analyses Calculation Workflows Supercomputers Codes to run (in sequence) Atomic positions 15
  • 17.
    The Materials Project+ MongoDB We use MongoDB to store data, provenance, and state 16
  • 18.
    Application Stack Python DjangoWeb Service Pymatgen and other scientific Python libraries Fireworks + VASP pymongo MongoDB under all these layers Currently at Mongo 2.2 17
  • 19.
    Powered by MongoDB MaterialsProject data stored in a MongoDB Core materials properties User generated data Workflow and Job data 18
  • 20.
    Scalability We use replicasets in a master/slave config No sharding (yet) But we are doing pretty well with a small MongoDB cluster so far 4 nodes (2 prod, 2 dev) 128 GB, 24 cores per node Current Data volume 30000 compounds – 250 GB Eventually millions of compounds, big experimental data 19
  • 21.
    Why we likeMongoDB Flexibility Developer-friendliness JSON Great for read-heavy patterns 20
  • 22.
    Flexibility Our data changesfrequently it's research This was a major pain point for the old SQL schema The flip side, chaos, has not been a big problem in reality, all access is programmatic so the code implies the schema 21
  • 23.
    Developer-friendly We work insmall groups mix of scientists, programmers need something easy to learn, easy to use Structuring the data is intuitive Querying the data is intuitive Querying the data is easy to map to web interfaces Developed a special language "Moogle" that translates pretty cleanly to MongoDB filters 22
  • 24.
  • 25.
    Search as JSON { "nelements":{"$gt": 3, "$lt": 5}, "elements": { "$all": ["Li", "Fe", "O"], "$nin": ["Na"] } } 24
  • 26.
    JSON JSON is easy(enough) for scientists to read and understand Direct translation between JSON and the database saves lots of time also nearly-direct mapping to Python dict For example: Structured Notation Language the format for describing the inputs of our computational jobs Makes it very easy to build an HTTP API on our data Materials API 25
  • 27.
    Read/Write Patterns Scientific datausually generated in large runs Must go through validation first Core data only updated in bulk during controlled releases Core data is essentially read-only User-generated data is r/w but writes need not complete immediately Workflow data is r/w and has some write issues, since timing matters 26
  • 28.
    Our use ofMongoDB Use it directly, no mongomapper or other ORM-like layer Did write a "QueryEngine" class specific to our data, not a general ORM 27
  • 29.
    Sample collections inDB 28 Collection Record size Count diffraction_patterns 450KB 38621 materials 400KB 30758 tasks 150KB 71463 crystals 40KB 125966
  • 30.
    Building materials fromtasks if ndocs == 1: blessed_doc = docs[0] # Add ntasks_id # Add task_ids blessed_doc['task_ids'] = [blessed_doc['task_id']] blessed_doc['ntask_ids'] = 1 # only one successful result # could be GGA or GGA+U elif ndocs >= 2: # multiple GGA and GGA+U runs # select the GGA+U run if it exists # else sort by free_energy_per_atom .... 29 Select the "blessed" material based on domain criteria
  • 31.
    Data Examples –Fe2O3 https://www.materialsproject.org/materials/24972/ { "created_at": "2012-08-30T02:55:49.139558", "version": { "pymatgen": "2.2.1dev", "db": "2012.07.09", "rest": "1.0" }, "valid_response": true, "copyright": "Copyright 2012, The Materials Project", "response": [ { "formation_energy_per_atom": -1.7873700829440011, "elements": [ "Fe", "O" ], "band_gap": null, "e_above_hull": 0.0460096124999998, "nelements": 2, "pretty_formula": "Fe2O3", "energy": -132.33005625, "is_hubbard": true, "nsites": 20, "material_id": 542309, "unit_cell_formula": { "Fe": 8.0, "O": 12.0 }, …. 30
  • 32.
    Mongo works forus Adding new properties is common need flexible schema Data is nested and structured more objects-like, less relational Need a flexible query mechanism to be able to pull out relevant objects Read-heavy, few writes Core data only updated in bulk during releases User writes less critical, so we can survive hiccups 31
  • 33.
    Schema-Last Mongo provides flexibleschemas But web code expects data in a certain structure M/R methodology allows us to distill the data into the schema expected by the code Allows us to evolve schema while maintaining a more rigorous structure frontend 32
  • 34.
    Fireworks - workflows Keepall our state in MongoDB raw inputs (crystals) for the calculations job specifications, ready and running output of finished runs Outputs are loaded in batches back to MongoDB Real-time updates for workflow status MongoDB "write concerns" are important here 33
  • 35.
    Lots of stuffneeds to be validated Atomic compound is stable Volume of lattice is positive Time to compute was greater than some minimum "Phase diagram" correct  Rough agreement across calculation types  Energies agree with experimental results  ..... 34
  • 36.
    Validation requirements Fast enoughto use in real-time But not full MongoDB query syntax too finicky, tricky esp. for "or" type of queries Want other people to be able to use this We don't want to become a "mongodb tutor" Also is nice to be general to any document-oriented datastore 35
  • 37.
    Our validation workflow YAML constraint specificatio n Build MongoDB query Determinewhich constraints failed for each record Report coll.find() Records User from ML, someday.. 36
  • 38.
    Example query _aliases: - energy= final_energy_per_atom materials_dbv2: - filter: - nelements = 4 constraints: - energy > 0 - spacegroup.number > 0 - elements size$ nelements 37 {'nelements': 4, '$or': [ {'spacegroup.number': {'$lte': 0}}, {'final_energy_per_atom': {'$lte': 0}}, {'$where': 'this.elements.length != this.nelements'} ]} $ mgvv -f mgvv-ex.yaml MongoDB Query
  • 39.
  • 40.
  • 41.
    Sandbox implementation We pre-buildthe sandboxes the same way we pre-build all the other derived collections We are using collections with "." to separate components core.<collection>, sandbox.<name>.<collection> All access is mediated by our Django server permissions stored in Django, map users and groups to access to the sandboxes No cross-collection querying is needed 40
  • 42.
    Where we (andscience) are going with all this.. 41
  • 43.
  • 44.
    Share data acrossdisciplines 43 Use MongoDB as a flexible metadata store that connects data/metadata Store and search user- generated ontologies
  • 45.
    Organize and searchPBs of data files Currently store in file hierarchies /projectX/experimentA/tuesday/run99/foobar.hdf5 Move the metadata to MongoDB {'project':'X', 'experiment':'A', 'run':99, 'date':'2013-05-10T12:31:56', 'user':'joe', .... 'path':'/projectX/d67001A3.hdf5'} Already doing this for some projects one-offs: general layer possible? 44
  • 46.
    Thank you Contact info ShreyasCholia scholia@lbl.gov Dan Gunter dkgunter@lbl.gov Thanks to these people for slide material Anubhav Jain, Kristin Persson, David Skinner (LBNL) Thanks to Materials Project contributors http://materialsproject.org/contributors 45

Editor's Notes

  • #4 Not really thelone scientist working in his basementInstead think large teams collaborating on big data sets while supercomputers crunch numbers
  • #5 *originally: “compute everything first” - our materials science colleagues were horrified at the notion of being replaced by machines.Compute (everything) first, ask questions later – this is a very powerful concept as we’ll see in a minute. Instead of trying to derive a solution and compute the results, you just compute the space of all possibilities and look for the optimal result in there.
  • #9 Show clickable REST link
  • #14 Wordcloud showing frequencies of elements in our databaseIs this analogy true?
  • #29 NOTE: Include a diagram of collections we store (and possible some relationships)
  • #31 Show clickable REST link