MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by Dan Gunter, Computer Scientist, LBNL and Shreyas Cholia, Computer Systems Engineer, NERSC/LBNL

MongoDB for Collaborative
Science
Shreyas Cholia, Dan Gunter
LBNL

About Us
We are Computer Scientists and Engineers at
Lawrence Berkeley Lab
We work with science teams to help build
software and computing infrastructure for
doing awesome SCIENCE
1

Talk overview
Background: community science and data
science in general, materials in particular
How (and why) we use MongoDB today
Things we would like to do with MongoDB in
the future
Conclusions
2

Science is now a collaborative effort
Large teams of people
Lots of computational power
Science
3

Big Data
Science is increasingly
data-driven
Computational cycles
are cheap
Take an –omics
approach to science
Compute all
interesting things
first, ask questions
later
4

The Materials Project
An open science initiative that makes available
a huge database of computed materials
properties for all materials researchers
5

Why we care about "materials"
solar PV electric vehicles
other:
waste heat recovery (thermoelectrics)
hydrogen storage
catalysts/fuel cells

What do we mean by a
material?

This is a Material!
https://www.materialsproject.org/materials/24972/
{ "created_at": "2012-08-30T02:55:49.139558",
"version": { "pymatgen": "2.2.1dev", "db": "2012.07.09",
"rest": "1.0" }, "valid_response": true, "copyright":
"Copyright 2012, The Materials Project", "response": [ {
"formation_energy_per_atom": -1.7873700829440011,
"elements": [ "Fe", "O" ], "band_gap": null,
"e_above_hull": 0.0460096124999998, "nelements": 2,
"pretty_formula": "Fe2O3", "energy": -132.33005625,
"is_hubbard": true, "nsites": 20, "material_id": 542309,
"unit_cell_formula": { "Fe": 8.0, "O": 12.0 }, ….
8

Business as usual:
“find the needle in a haystack”

The Materials Project is like
having an army to search
through the haystack

High-throughput computing is
like an army
MATERIALS THEORY COMPUTERS
WORKFLOW
vs.
i
dY({ri};t)
dt
= H
Ù
Y({ri};t)
Do not synthesize!
Put on backburner
Begin further investigation

The data is our "haystack"
13
Users

Materials Project website
http://materialsproject.org/
14

Infrastructure
Submitted
Materials
Materials
Data
Materials Properties
Supercomputers
• Over 10 million CPU hours of
calculations in < 6 months
• Over 40,000 successful VASP runs
(30,000+ materials)
• Generalizable to other high-
throughput codes/analyses
Calculation
Workflows
Supercomputers Codes to run
(in sequence)
Atomic positions
15

The Materials Project +
MongoDB
We use MongoDB to store data, provenance,
and state
16

Application Stack
Python Django Web Service
Pymatgen and other scientific Python libraries
Fireworks + VASP
pymongo
MongoDB under all these layers
Currently at Mongo 2.2
17

Powered by MongoDB
Materials Project data stored in a MongoDB
Core materials properties
User generated data
Workflow and Job data
18

Scalability
We use replica sets in a master/slave config
No sharding (yet)
But we are doing pretty well with a small
MongoDB cluster so far
4 nodes (2 prod, 2 dev)
128 GB, 24 cores per node
Current Data volume
30000 compounds – 250 GB
Eventually
millions of compounds, big experimental data
19

Why we like MongoDB
Flexibility
Developer-friendliness
JSON
Great for read-heavy patterns
20

Flexibility
Our data changes frequently
it's research
This was a major pain point for the old SQL
schema
The flip side, chaos, has not been a big
problem
in reality, all access is programmatic so the code
implies the schema
21

Developer-friendly
We work in small groups
mix of scientists, programmers
need something easy to learn, easy to use
Structuring the data is intuitive
Querying the data is intuitive
Querying the data is easy to map to web
interfaces
Developed a special language "Moogle" that
translates pretty cleanly to MongoDB filters
22

Search as JSON
{
"nelements": {"$gt": 3, "$lt": 5},
"elements": {
"$all": ["Li", "Fe", "O"],
"$nin": ["Na"]
}
}
24

JSON
JSON is easy (enough) for scientists to read and
understand
Direct translation between JSON and the
database saves lots of time
also nearly-direct mapping to Python dict
For example: Structured Notation Language
the format for describing the inputs of our
computational jobs
Makes it very easy to build an HTTP API on our
data
Materials API
25

Read/Write Patterns
Scientific data usually generated in large runs
Must go through validation first
Core data only updated in bulk during controlled
releases
Core data is essentially read-only
User-generated data is r/w but writes need not
complete immediately
Workflow data is r/w and has some write issues,
since timing matters
26

Our use of MongoDB
Use it directly, no mongomapper or other
ORM-like layer
Did write a "QueryEngine" class
specific to our data, not a general ORM
27

Sample collections in DB
28
Collection Record size Count
diffraction_patterns 450KB 38621
materials 400KB 30758
tasks 150KB 71463
crystals 40KB 125966

Building materials from tasks
if ndocs == 1:
blessed_doc = docs[0]
# Add ntasks_id
# Add task_ids
blessed_doc['task_ids'] = [blessed_doc['task_id']]
blessed_doc['ntask_ids'] = 1
# only one successful result
# could be GGA or GGA+U
elif ndocs >= 2:
# multiple GGA and GGA+U runs
# select the GGA+U run if it exists
# else sort by free_energy_per_atom
....
29
Select the
"blessed"
material based
on domain
criteria

Data Examples – Fe2O3
https://www.materialsproject.org/materials/24972/
{ "created_at": "2012-08-30T02:55:49.139558",
"version": { "pymatgen": "2.2.1dev", "db": "2012.07.09",
"rest": "1.0" }, "valid_response": true, "copyright":
"Copyright 2012, The Materials Project", "response": [ {
"formation_energy_per_atom": -1.7873700829440011,
"elements": [ "Fe", "O" ], "band_gap": null,
"e_above_hull": 0.0460096124999998, "nelements": 2,
"pretty_formula": "Fe2O3", "energy": -132.33005625,
"is_hubbard": true, "nsites": 20, "material_id": 542309,
"unit_cell_formula": { "Fe": 8.0, "O": 12.0 }, ….
30

Mongo works for us
Adding new properties is common
need flexible schema
Data is nested and structured
more objects-like, less relational
Need a flexible query mechanism to be able to
pull out relevant objects
Read-heavy, few writes
Core data only updated in bulk during releases
User writes less critical, so we can survive hiccups
31

Schema-Last
Mongo provides flexible schemas
But web code expects data in a certain
structure
M/R methodology allows us to distill the data
into the schema expected by the code
Allows us to evolve schema while maintaining
a more rigorous structure frontend
32

Fireworks - workflows
Keep all our state in MongoDB
raw inputs (crystals) for the calculations
job specifications, ready and running
output of finished runs
Outputs are loaded in batches back to
MongoDB
Real-time updates for workflow status
MongoDB "write concerns" are important here
33

Lots of stuff needs to
be validated
Atomic compound is
stable
Volume of lattice is
positive
Time to compute was
greater than some
minimum
"Phase diagram" correct
 Rough agreement
across calculation
types
 Energies agree
with experimental
results
 .....
34

Validation requirements
Fast enough to use in real-time
But not full MongoDB query syntax
too finicky, tricky esp. for "or" type of queries
Want other people to be able to use this
We don't want to become a "mongodb tutor"
Also is nice to be general to any document-oriented
datastore
35

Our validation workflow
YAML
constraint
specificatio
n
Build
MongoDB
query
Determine which
constraints failed for
each record
Report
coll.find()
Records
User
from ML,
someday..
36

Example query
_aliases:
- energy = final_energy_per_atom
materials_dbv2:
-
filter:
- nelements = 4
constraints:
- energy > 0
- spacegroup.number > 0
- elements size$ nelements
37
{'nelements': 4,
'$or': [
{'spacegroup.number': {'$lte': 0}},
{'final_energy_per_atom': {'$lte': 0}},
{'$where': 'this.elements.length != this.nelements'}
]}
$ mgvv -f mgvv-ex.yaml
MongoDB Query

Sandboxes
39
Unified View
Materials Project
Core database
Private Sandbox Data

Sandbox implementation
We pre-build the sandboxes the same way we
pre-build all the other derived collections
We are using collections with "." to separate
components
core.<collection>, sandbox.<name>.<collection>
All access is mediated by our Django server
permissions stored in Django, map users and groups
to access to the sandboxes
No cross-collection querying is needed
40

Where we (and science) are
going with all this..
41

Analytics: Smarter, better data
mining
42

Share data across disciplines
43
Use MongoDB as a
flexible metadata store
that connects
data/metadata
Store and search user-
generated ontologies

Organize and search PBs of data
files
Currently store in file hierarchies
/projectX/experimentA/tuesday/run99/foobar.hdf5
Move the metadata to MongoDB
{'project':'X', 'experiment':'A', 'run':99,
'date':'2013-05-10T12:31:56', 'user':'joe', ....
'path':'/projectX/d67001A3.hdf5'}
Already doing this for some projects
one-offs: general layer possible?
44

Thank you
Contact info
Shreyas Cholia scholia@lbl.gov
Dan Gunter dkgunter@lbl.gov
Thanks to these people for slide material
Anubhav Jain, Kristin Persson, David Skinner (LBNL)
Thanks to Materials Project contributors
http://materialsproject.org/contributors
45

MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by Dan Gunter, Computer Scientist, LBNL and Shreyas Cholia, Computer Systems Engineer, NERSC/LBNL

More Related Content

What's hot

Similar to MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by Dan Gunter, Computer Scientist, LBNL and Shreyas Cholia, Computer Systems Engineer, NERSC/LBNL

More from MongoDB

Recently uploaded

MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by Dan Gunter, Computer Scientist, LBNL and Shreyas Cholia, Computer Systems Engineer, NERSC/LBNL

Editor's Notes