Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MongoDB for CollaborativeScienceShreyas Cholia, Dan GunterLBNL
About UsWe are Computer Scientists and Engineers atLawrence Berkeley LabWe work with science teams to help buildsoftware a...
Talk overviewBackground: community science and datascience in general, materials in particularHow (and why) we use MongoDB...
Science is now a collaborative effortLarge teams of peopleLots of computational powerScience3
Big DataScience is increasinglydata-drivenComputational cyclesare cheapTake an –omicsapproach to scienceCompute allinteres...
The Materials ProjectAn open science initiative that makes availablea huge database of computed materialsproperties for al...
Why we care about "materials"solar PV electric vehiclesother:waste heat recovery (thermoelectrics)hydrogen storagecatalyst...
What do we mean by amaterial?
This is a Material!https://www.materialsproject.org/materials/24972/{ "created_at": "2012-08-30T02:55:49.139558","version"...
Business as usual:“find the needle in a haystack”
hours
The Materials Project is likehaving an army to searchthrough the haystack
High-throughput computing islike an armyMATERIALS THEORY COMPUTERSWORKFLOWvs.idY({ri};t)dt= HÙY({ri};t)Do not synthesize!...
The data is our "haystack"13Users
Materials Project websitehttp://materialsproject.org/14
InfrastructureSubmittedMaterialsMaterialsDataMaterials PropertiesSupercomputers• Over 10 million CPU hours ofcalculations ...
The Materials Project +MongoDBWe use MongoDB to store data, provenance,and state16
Application StackPython Django Web ServicePymatgen and other scientific Python librariesFireworks + VASPpymongoMongoDB und...
Powered by MongoDBMaterials Project data stored in a MongoDBCore materials propertiesUser generated dataWorkflow and Job d...
ScalabilityWe use replica sets in a master/slave configNo sharding (yet)But we are doing pretty well with a smallMongoDB c...
Why we like MongoDBFlexibilityDeveloper-friendlinessJSONGreat for read-heavy patterns20
FlexibilityOur data changes frequentlyits researchThis was a major pain point for the old SQLschemaThe flip side, chaos, h...
Developer-friendlyWe work in small groupsmix of scientists, programmersneed something easy to learn, easy to useStructurin...
Search Example23
Search as JSON{"nelements": {"$gt": 3, "$lt": 5},"elements": {"$all": ["Li", "Fe", "O"],"$nin": ["Na"]}}24
JSONJSON is easy (enough) for scientists to read andunderstandDirect translation between JSON and thedatabase saves lots o...
Read/Write PatternsScientific data usually generated in large runsMust go through validation firstCore data only updated i...
Our use of MongoDBUse it directly, no mongomapper or otherORM-like layerDid write a "QueryEngine" classspecific to our dat...
Sample collections in DB28Collection Record size Countdiffraction_patterns 450KB 38621materials 400KB 30758tasks 150KB 714...
Building materials from tasksif ndocs == 1:blessed_doc = docs[0]# Add ntasks_id# Add task_idsblessed_doc[task_ids] = [bles...
Data Examples – Fe2O3https://www.materialsproject.org/materials/24972/{ "created_at": "2012-08-30T02:55:49.139558","versio...
Mongo works for usAdding new properties is commonneed flexible schemaData is nested and structuredmore objects-like, less ...
Schema-LastMongo provides flexible schemasBut web code expects data in a certainstructureM/R methodology allows us to dist...
Fireworks - workflowsKeep all our state in MongoDBraw inputs (crystals) for the calculationsjob specifications, ready and ...
Lots of stuff needs tobe validatedAtomic compound isstableVolume of lattice ispositiveTime to compute wasgreater than s...
Validation requirementsFast enough to use in real-timeBut not full MongoDB query syntaxtoo finicky, tricky esp. for "or" t...
Our validation workflowYAMLconstraintspecificationBuildMongoDBqueryDetermine whichconstraints failed foreach recordReportc...
Example query_aliases:- energy = final_energy_per_atommaterials_dbv2:-filter:- nelements = 4constraints:- energy > 0- spac...
Example result38
Sandboxes39Unified ViewMaterials ProjectCore databasePrivate Sandbox Data
Sandbox implementationWe pre-build the sandboxes the same way wepre-build all the other derived collectionsWe are using co...
Where we (and science) aregoing with all this..41
Analytics: Smarter, better datamining42
Share data across disciplines43Use MongoDB as aflexible metadata storethat connectsdata/metadataStore and search user-gene...
Organize and search PBs of datafilesCurrently store in file hierarchies/projectX/experimentA/tuesday/run99/foobar.hdf5Move...
Thank youContact infoShreyas Cholia scholia@lbl.govDan Gunter dkgunter@lbl.govThanks to these people for slide materialAnu...
Upcoming SlideShare
Loading in …5
×

MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by Dan Gunter, Computer Scientist, LBNL and Shreyas Cholia, Computer Systems Engineer, NERSC/LBNL

2,142 views

Published on

Scientific data sets are messy (loose data structures, evolving schemas) and large. MongoDB is becoming increasingly popular in the scientific computing space for precisely these reasons. We discuss the advantages of using MongoDB in scientific computing, and describe how we've built the Scientific Computing infrastructure for The Materials Project using MongoDB. We also discuss "warts" in the MongoDB implementation that affect our choices of how and when to use it.

Published in: Technology
  • Be the first to comment

MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by Dan Gunter, Computer Scientist, LBNL and Shreyas Cholia, Computer Systems Engineer, NERSC/LBNL

  1. 1. MongoDB for CollaborativeScienceShreyas Cholia, Dan GunterLBNL
  2. 2. About UsWe are Computer Scientists and Engineers atLawrence Berkeley LabWe work with science teams to help buildsoftware and computing infrastructure fordoing awesome SCIENCE1
  3. 3. Talk overviewBackground: community science and datascience in general, materials in particularHow (and why) we use MongoDB todayThings we would like to do with MongoDB inthe futureConclusions2
  4. 4. Science is now a collaborative effortLarge teams of peopleLots of computational powerScience3
  5. 5. Big DataScience is increasinglydata-drivenComputational cyclesare cheapTake an –omicsapproach to scienceCompute allinteresting thingsfirst, ask questionslater4
  6. 6. The Materials ProjectAn open science initiative that makes availablea huge database of computed materialsproperties for all materials researchers5
  7. 7. Why we care about "materials"solar PV electric vehiclesother:waste heat recovery (thermoelectrics)hydrogen storagecatalysts/fuel cells
  8. 8. What do we mean by amaterial?
  9. 9. This is a Material!https://www.materialsproject.org/materials/24972/{ "created_at": "2012-08-30T02:55:49.139558","version": { "pymatgen": "2.2.1dev", "db": "2012.07.09","rest": "1.0" }, "valid_response": true, "copyright":"Copyright 2012, The Materials Project", "response": [ {"formation_energy_per_atom": -1.7873700829440011,"elements": [ "Fe", "O" ], "band_gap": null,"e_above_hull": 0.0460096124999998, "nelements": 2,"pretty_formula": "Fe2O3", "energy": -132.33005625,"is_hubbard": true, "nsites": 20, "material_id": 542309,"unit_cell_formula": { "Fe": 8.0, "O": 12.0 }, ….8
  10. 10. Business as usual:“find the needle in a haystack”
  11. 11. hours
  12. 12. The Materials Project is likehaving an army to searchthrough the haystack
  13. 13. High-throughput computing islike an armyMATERIALS THEORY COMPUTERSWORKFLOWvs.idY({ri};t)dt= HÙY({ri};t)Do not synthesize!Put on backburnerBegin further investigation
  14. 14. The data is our "haystack"13Users
  15. 15. Materials Project websitehttp://materialsproject.org/14
  16. 16. InfrastructureSubmittedMaterialsMaterialsDataMaterials PropertiesSupercomputers• Over 10 million CPU hours ofcalculations in < 6 months• Over 40,000 successful VASP runs(30,000+ materials)• Generalizable to other high-throughput codes/analysesCalculationWorkflowsSupercomputers Codes to run(in sequence)Atomic positions15
  17. 17. The Materials Project +MongoDBWe use MongoDB to store data, provenance,and state16
  18. 18. Application StackPython Django Web ServicePymatgen and other scientific Python librariesFireworks + VASPpymongoMongoDB under all these layersCurrently at Mongo 2.217
  19. 19. Powered by MongoDBMaterials Project data stored in a MongoDBCore materials propertiesUser generated dataWorkflow and Job data18
  20. 20. ScalabilityWe use replica sets in a master/slave configNo sharding (yet)But we are doing pretty well with a smallMongoDB cluster so far4 nodes (2 prod, 2 dev)128 GB, 24 cores per nodeCurrent Data volume30000 compounds – 250 GBEventuallymillions of compounds, big experimental data19
  21. 21. Why we like MongoDBFlexibilityDeveloper-friendlinessJSONGreat for read-heavy patterns20
  22. 22. FlexibilityOur data changes frequentlyits researchThis was a major pain point for the old SQLschemaThe flip side, chaos, has not been a bigproblemin reality, all access is programmatic so the codeimplies the schema21
  23. 23. Developer-friendlyWe work in small groupsmix of scientists, programmersneed something easy to learn, easy to useStructuring the data is intuitiveQuerying the data is intuitiveQuerying the data is easy to map to webinterfacesDeveloped a special language "Moogle" thattranslates pretty cleanly to MongoDB filters22
  24. 24. Search Example23
  25. 25. Search as JSON{"nelements": {"$gt": 3, "$lt": 5},"elements": {"$all": ["Li", "Fe", "O"],"$nin": ["Na"]}}24
  26. 26. JSONJSON is easy (enough) for scientists to read andunderstandDirect translation between JSON and thedatabase saves lots of timealso nearly-direct mapping to Python dictFor example: Structured Notation Languagethe format for describing the inputs of ourcomputational jobsMakes it very easy to build an HTTP API on ourdataMaterials API25
  27. 27. Read/Write PatternsScientific data usually generated in large runsMust go through validation firstCore data only updated in bulk during controlledreleasesCore data is essentially read-onlyUser-generated data is r/w but writes need notcomplete immediatelyWorkflow data is r/w and has some write issues,since timing matters26
  28. 28. Our use of MongoDBUse it directly, no mongomapper or otherORM-like layerDid write a "QueryEngine" classspecific to our data, not a general ORM27
  29. 29. Sample collections in DB28Collection Record size Countdiffraction_patterns 450KB 38621materials 400KB 30758tasks 150KB 71463crystals 40KB 125966
  30. 30. Building materials from tasksif ndocs == 1:blessed_doc = docs[0]# Add ntasks_id# Add task_idsblessed_doc[task_ids] = [blessed_doc[task_id]]blessed_doc[ntask_ids] = 1# only one successful result# could be GGA or GGA+Uelif ndocs >= 2:# multiple GGA and GGA+U runs# select the GGA+U run if it exists# else sort by free_energy_per_atom....29Select the"blessed"material basedon domaincriteria
  31. 31. Data Examples – Fe2O3https://www.materialsproject.org/materials/24972/{ "created_at": "2012-08-30T02:55:49.139558","version": { "pymatgen": "2.2.1dev", "db": "2012.07.09","rest": "1.0" }, "valid_response": true, "copyright":"Copyright 2012, The Materials Project", "response": [ {"formation_energy_per_atom": -1.7873700829440011,"elements": [ "Fe", "O" ], "band_gap": null,"e_above_hull": 0.0460096124999998, "nelements": 2,"pretty_formula": "Fe2O3", "energy": -132.33005625,"is_hubbard": true, "nsites": 20, "material_id": 542309,"unit_cell_formula": { "Fe": 8.0, "O": 12.0 }, ….30
  32. 32. Mongo works for usAdding new properties is commonneed flexible schemaData is nested and structuredmore objects-like, less relationalNeed a flexible query mechanism to be able topull out relevant objectsRead-heavy, few writesCore data only updated in bulk during releasesUser writes less critical, so we can survive hiccups31
  33. 33. Schema-LastMongo provides flexible schemasBut web code expects data in a certainstructureM/R methodology allows us to distill the datainto the schema expected by the codeAllows us to evolve schema while maintaininga more rigorous structure frontend32
  34. 34. Fireworks - workflowsKeep all our state in MongoDBraw inputs (crystals) for the calculationsjob specifications, ready and runningoutput of finished runsOutputs are loaded in batches back toMongoDBReal-time updates for workflow statusMongoDB "write concerns" are important here33
  35. 35. Lots of stuff needs tobe validatedAtomic compound isstableVolume of lattice ispositiveTime to compute wasgreater than someminimum"Phase diagram" correct Rough agreementacross calculationtypes Energies agreewith experimentalresults .....34
  36. 36. Validation requirementsFast enough to use in real-timeBut not full MongoDB query syntaxtoo finicky, tricky esp. for "or" type of queriesWant other people to be able to use thisWe dont want to become a "mongodb tutor"Also is nice to be general to any document-orienteddatastore35
  37. 37. Our validation workflowYAMLconstraintspecificationBuildMongoDBqueryDetermine whichconstraints failed foreach recordReportcoll.find()RecordsUserfrom ML,someday..36
  38. 38. Example query_aliases:- energy = final_energy_per_atommaterials_dbv2:-filter:- nelements = 4constraints:- energy > 0- spacegroup.number > 0- elements size$ nelements37{nelements: 4,$or: [{spacegroup.number: {$lte: 0}},{final_energy_per_atom: {$lte: 0}},{$where: this.elements.length != this.nelements}]}$ mgvv -f mgvv-ex.yamlMongoDB Query
  39. 39. Example result38
  40. 40. Sandboxes39Unified ViewMaterials ProjectCore databasePrivate Sandbox Data
  41. 41. Sandbox implementationWe pre-build the sandboxes the same way wepre-build all the other derived collectionsWe are using collections with "." to separatecomponentscore.<collection>, sandbox.<name>.<collection>All access is mediated by our Django serverpermissions stored in Django, map users and groupsto access to the sandboxesNo cross-collection querying is needed40
  42. 42. Where we (and science) aregoing with all this..41
  43. 43. Analytics: Smarter, better datamining42
  44. 44. Share data across disciplines43Use MongoDB as aflexible metadata storethat connectsdata/metadataStore and search user-generated ontologies
  45. 45. Organize and search PBs of datafilesCurrently store in file hierarchies/projectX/experimentA/tuesday/run99/foobar.hdf5Move the metadata to MongoDB{project:X, experiment:A, run:99,date:2013-05-10T12:31:56, user:joe, ....path:/projectX/d67001A3.hdf5}Already doing this for some projectsone-offs: general layer possible?44
  46. 46. Thank youContact infoShreyas Cholia scholia@lbl.govDan Gunter dkgunter@lbl.govThanks to these people for slide materialAnubhav Jain, Kristin Persson, David Skinner (LBNL)Thanks to Materials Project contributorshttp://materialsproject.org/contributors45

×