How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)


Published on

Video can be found here:

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)

  1. 1. DATA AS A SERVICEHow Web APIs and Data-CentricTools Power the Materials ProjectShreyas Cholia ( Gunter ( Berkeley National LaboratoryPyData 2013
  2. 2. Outline•  Data driven science•  Materials Project Overview•  Open data and APIs•  Dropping APIs on your data•  Things to think about in your API•  Writing libraries for code AND data (pymatgen RESTinterface)•  Science stories to back this up•  Ipython notebook demo
  3. 3. About Us•  Dan and Shreyas are Computer Scientists/Engineers atBerkeley Lab.•  We work with science teams to help build software andcomputing infrastructure that facilitates awesomeSCIENCE
  4. 4. Science•  Science is now a collaborative effort•  Large teams of people•  Lots of computational power
  5. 5. The Fourth Paradigm
  6. 6. Big DataScience isincreasingly data-drivenComputationalcycles are cheapTake an –omicsapproach toscienceCompute allinteresting thingsfirst, ask questionslater
  7. 7. The –omics approach•  Instead of trying to derive a solution and compute theresults, just compute the space of all possibilities and lookfor the optimal result in there.•  OK – so we are generating more data than we know whatto do with but that is ok•  (and might be a topic for another talk …)
  8. 8. An open science initiative that makes availablea huge database of computed materialsproperties for all materials researchers.The Materials ProjectWordcloud showingfrequencies of elementsin Materials Projectsdatabase..except Oxygen, which appears12,751 times (3.5x as much as thenext most frequent, Phosphorus)
  9. 9. The Materials Project
  10. 10. 18 yearsfrom creationto commercialmanufacture!TeflonTitaniumVelcroPolycarbonateGaAsDiamond-like ThinFilmsMaterials Data from: Eagar, T.; King, M. Technology Review (00401692) 1995, 98, 42.invented1960 19701950"Need for speed" in new materialsLithium ionS. WhittinghamSony1980 1990 2000
  11. 11. Materials have strategic importanceSept 7, 2010Japan arrestsChinese boat captainafter collision indisputed watersChina blocksshipments of RareEarth Metals toJapanSept 22, 2010Japan releasescaptainSept 24, 2010Japan invests in induction motors… coincidence?“Toyota Readying Motors That Don’t Use Rare Earths…”Jan 14, 2011 1:50 PM PTContent for this slide courtesy Gerbrand Ceder, MIT & Kristin Persson, LBNL2010 "Senkaku Boat Collision Incident"
  12. 12. Solution: ComputationMany materials properties can be computed00.511.522.533.544.5Voltage  (V)computed experimental  literaturestage I+IIStageIIStage Istage III+II+ =ΔH = [ E (X) + E (Y) ] –E(XY)Photovoltaics, Thermoelectrics,Energy Storage, Hydrogen,Catalysts, Magnets….
  13. 13. InfrastructureSubmittedMaterialsMaterialsDataMaterials PropertiesSupercomputers  •  Over 10 million CPU hours ofcalculations in < 6 months•  Over 40,000 successful VASPruns(30,000+ materials)•  Generalizable to other high-throughput codes/analysesCalculationWorkflowsSupercomputers Codes to run(in sequence)Atomic positions
  14. 14. 0100002000030000400005000060000Jul 2011 Oct 2011 Jan 2012 Apr 2012 Jul 2012 Oct 2012DateNumberofrunsstateFailedSuccessfulComputation•  Run VASP on NERSCsupercomputingresources•  Use Fireworks tomanage large groupsof runs•  Results in … data forMPJust sit back andenjoy theautomation..Total
  15. 15. Data Islands•  Data is still heavily silo-ed and inaccessible.•  Data sits on a machine somewhere, and you give peoplelocal ssh/DB accounts to access it.•  Good luck combining multiple datasets•  Does not scale!•  This is 2013 – we can do better!
  16. 16. Sharing (your data) is important!•  The Most Important Scientific Result Published in theLast YearJ.M. Wicherts, M. Bakker, and D. Molenaar:Willingness to Share Research Data Is Related to theStrength of the Evidence and the Quality of Reporting ofStatistical ResultsPLoS ONE, 6(11): e26828, 2011, doi:10.1371/journal.pone.0026828.Content for this slide courtesy Greg Wilson, Software Carpentry
  17. 17. Data Sharing•  Open access to data through programmatic interfaces•  Sub-select the data on demand rather than pulling downthe entire dataset•  Use your own local tools with centrally managed data•  Everyone sees the same data – better collaboration
  18. 18. Web portal•  Materials data stored in a Mongo DB• web portal makes materials dataeasily accessible•  Materials Explorer•  Phase Diagrams•  Crystal Toolkit•  Battery Explorer•  Reaction Calculator•  Structure Predictor•  Focus on a highly functional and usable website to querymaterials data. (We heart Django!)•  Additionally we distribute the tools used to compute andanalyze the data as an open source library – pymatgen
  19. 19. API access•  But we quickly found that scientists wanted programmaticaccess to data•  eg. Give me property X for all materials with Li and O sothat I can pass it through my own codes•  Lesson – make your data available through an API andpeople will start to do amazing things
  20. 20. Why Web APIs?•  Big push towards HTTP APIs across the web.•  Web APIs provide programmatic access to data andresources to developers over the web•  Access to data as well-defined objects allows users todevelop their own custom applications and codeEnables a thriving COMMUNITY built around data.
  21. 21. What is The Materials API?An open platform foraccessing MaterialsProject data over theweb.Flexible and scalable tocater to large numberof collaborators, withdifferent accessprivileges.Simple to use and codeagnostic.
  22. 22. HTTP API design URLUnique Identifier.Eg. a formula(Fe2O3), id (1234)or chemicalsystem (Li-Fe-O)Data type(vasp,exp, etc.)PropertyMaterials API maps URLs to dataobjects
  23. 23. Access via an API key•  To maintain privileged access, each user has anassociated API key (with certain defined accessprivileges).•  To get your key, login to and go•  All MP https requests must supply API key as:•  A x-api-key header, e.g., {‘X-API-KEY’: ‘MYKEY’}, or•  As a GET or POST variable, e.g., {‘API_KEY’: ‘MYKEY’}
  24. 24. Sample JSON outputGET{"created_at": "2013-03-17T09:14:58.158081","valid_response": true,"version": {"pymatgen": "2.5.4","db": "2013.02.25","rest": "1.0"},"response": [{"energy": -132.33005625,"material_id": 542309}, {"energy": -66.62512425,"material_id": 24972}],"copyright": "Copyright 2012, The Materials Project"}Just the energy and theid of the material
  25. 25. Getting started – Hello World API> pip install flask
  26. 26. Our dirty little secret•  It involves a certain language that ends with “uby” that wedon’t like to talk about in these parts•  Version 0.0.0 was of the Materials Project was coded inSinatra•  Sinatra is a microframework much like Flask•  But it proves that this approach is viable and can be theonramp to more amazing things.
  27. 27. Un-considerations•  Don’t worry too much about pure REST•  Initially just think of how URLs and verbs can map to functions•  Don’t worry too much about data formats•  JSON is easy and a great place to start•  Feel free to avoid XML unless you really need it
  28. 28. Our Stack•  Apache + mod_wsgi•  Django•  pymatgen•  pymongo + Mongo DB
  29. 29. pymatgen•  The open source python library that powers the MaterialsProject.•  Defines core Python objects for materials data representation.•  Provides a well-tested set of structure and thermodynamic analysistools relevant to many applications.•  Establishes an open platform for researchers to collaborativelydevelop sophisticated analyses of materials data obtained bothfrom first principles calculations and experiments.
  30. 30. Integration with pymatgenThe Materials APIPowerful MaterialsAnalytics Tool
  31. 31. Where we’re going with this•  Libraries that integrate data with computation!•  The scientific python ecosystem has a ton of data analysistools and libraries•  Just starting to think about baking in datasets directly intothese tools•  Pymatgen allows you to access core MP data directlyfrom the library
  32. 32. Compute + datapymatgen has hooks into the materials data so you can dostuff like this:entries = api.get_entries_in_chemsys([’Li, ‘Fe, O])But it also has computational tools that you can then use toact on the datapd = PhaseDiagram(entries)
  33. 33. Blurring the lines•  Yes – we are blurring the lines between compute and data•  But this is not a new idea•  Think of all the tools built around commercial APIs•  Twitter, Netflix etc. - python clients built around the API
  34. 34. Write First Class Science Functions•  Web APIs are extremely useful, but ultimately you want toencapsulate core science functionality as python functionsso that scientists aren’t worrying about things likeHow do I set theX-API-KEY header?
  35. 35. Sample use cases•  Screening for CO2 sorbents (with Clare Grey)•  Using the Materials API (MAPI) + pymatgen to calculate reactionenergies of thousands of oxides with CO2.•  Calculation of XAFS, XANES and other spectra forclusters of atoms (with Alan Dozier)•  Alan wrote a io add-on to pymatgen for FEFF input/output.•  Uses MAPI + pymatgen to extract structures.•  Defects (with Maciej Haranczyk)•  Uses MAPI + pymatgen to pull structures to perform Voronoianalysis to find possible interstitial sites.
  36. 36. Ipython Notebook Examples••
  37. 37. from import MPRester#This initializes the REST adaptor. Put your own API key in.a = MPRester("YOUR_API_KEY") #This gives you the Structure corresponding to material id 2254 inthe Materials Project.structure = a.get_structure_by_material_id(2254) #Entries are the basic unit for thermodynamic and other analysesin pymatgen.#This gets all entries belonging to the Ca-C-O system.entries = a.get_entries_in_chemsys([Ca, C, O])#With entries, you can do many sophisticated analyses,#like creating phase diagrams.pd = PhaseDiagram(entries)plotter = PDPlotter(pd) API + pymatgen example
  38. 38. Sandboxes•  A virtual private dataset•  Useful for•  Everyone as a sort of "scratch"space•  Industry partners who want to usethe tools but not share their data
  39. 39. Import format: Structure NotationLanguage (SNL)•  Contains structure/molecule object, and provenanceaboutcreated_atauthorsprojectsreferencesremarksdatahistoryAnother way to remember the acronym..
  40. 40. Fireworks•  FireWorks is a code for defining, managing, and executingscientific workflows•  It can be used to automate most types of calculations overarbitrary computing resources, including those that have aqueueing system•  It is very dynamic: Fireworks can begat other fireworks atruntime
  41. 41. Pymatgen-db•  Sick of MongoHub et al.? We were. So we wrote a simpleWeb UI using prettytable, pymatgen, and Django• weproceeded touse for deepscientific inquiry
  42. 42. We’re not the only ones …•  Bioinformatics•  KBase ( - DOE predictive and systems biology.•  Astronomy•  Sloan Digital Sky Survey (•  Spectroscopy•  Advanced Light Source (ALS), Advanced Photon Source (APS)•  According to ProgrammableWeb, ~130 others probably many of these are
  43. 43. More information•  Materials API + pymatgen examples••  The Materials API wiki••  Python Materials Genomics••  Shyue Ping Ong, William Davidson Richard, Anubhav Jain, GeoffroyHautier, Michael Kocher, Shreyas Cholia, Dan Gunter, VincentChevrier, Kristin A. Persson, Gerbrand Ceder. Python MaterialsGenomics (pymatgen) : A Robust, Open-Source Python Library forMaterials Analysis. (submitted)•  These slides:•
  44. 44. Takeaways•  Make scientific data easily available to end-users•  Friendly, powerful Web UI is a great way to engage, but then..•  Build APIs around your data to make it easily accessible•  Write scientific libraries with *both* analysis and data, byhooking them up to APIs.
  45. 45. We’re hiring•  Talented, science-loving, web-savvy, math-anythingPython programming code-slingers who would rather passa Nobel prize winner on the way to lunch than get freedry-cleaning•  downside: or even free coffee (groan)•  upside: some of your tax dollars go towards your own salary!•
  46. 46. Contact Us•  Shreyas Cholia –•  Dan Gunter –•  Materials Project Team –