io-Chem-BD, una solució per gestionar el Big Data en Química Computacional


Published on

Carles Bo, d'ICIQ, presenta IoChem-BD, un repositori de dades en química computacional. L'objectiu és elaborar una base de dades de forma normalitzada, definint processos, què es guarda i com es fa.

Aquesta presentació ha tingut lloc a la TSIUC'14, celebrada a la Universitat Autònoma de Barcelona el passat 2 de desembre de 2014, sota el títol "Reptes en Big Data a la universitat i la Recerca".

Published in: Technology
  1. 1. 30/11/14 1 una solució per ges/onar el Big Data en Química Computacional TSIUC’14 Universitat Autònoma de Barcelona, 2-­‐XII-­‐2014 Carles Bo ICIQ -­‐ URV Computa?onal Chemistry
  2. 2. 30/11/14 2 NOBEL PRIZE IN CHEMISTRY 2013 Computa?onal Chemistry Taking experiment to cyberspace Nobel Prize Chemistry 2013 (1981, 1998) POPULAR SCIENCE BACKGROUND Taking the experiment to cyberspace Chemical reactions occur at lightning speed; electrons jump between atoms hidden from the prying eyes of scientists. The Nobel Laureates in Chemistry 2013 have made it possible to map the mysteri-ous ways of chemistry by using computers. Detailed knowledge of chemical processes makes it pos-sible to optimize catalysts, drugs and solar cells. Chemists all over the world devise and carry out experiments on their computers on a daily basis. With the help of the methods that Martin Karplus, Michael Levitt and Arieh Warshel began to develop in the 1970s, they examined every tiny little step in complex chemical processes that are invisible to the naked eye. In order for you, the reader, to get an idea of how mankind can benefit from this, we begin with an example. Put your lab coat on, because we have a challenge for you: to create artificial photosyn-thesis. The chemical reaction occurring in green leaves fills the atmosphere with oxygen and is one prerequisite for life on Earth. But it is also interesting from an environmental perspective. If you can mimic the photosynthesis you will be able create more efficient solar cells. When water molecules are split oxygen is created, but also hydrogen that could be used to power our vehicles. So there is ample reason for you to get engaged in this project. If you succeed, you could contribute to solving the problem with greenhouse effect. Nobel Prize® is a registered trademark of the Nobel Foundation. Figure 1. Today chemists experiment just as much on their computers as they do in their labs. Theoretical results from computers are confirmed by real experiments that yield new clues to how the world of atoms works. Theory and practice cross-fertilize each other. Permanent storage. Cer/fy results. Re-­‐use results.
  3. 3. 30/11/14 3 Our Big Data Problem (1) Help researchers in their daily tasks (manage & store results, apps & tools) Our Big Data Problem (2) Manage files of former group members
  4. 4. 30/11/14 4 Our Big Data Problem (3) Suppor/ng Informa/on files Cer/fy results -­‐ Reuse results Yes, Comp Chem is a Big Data Problem
  5. 5. 30/11/14 5 5 ★ Open Data Tim Berners-­‐Lee OL: Open license OF: Open format LD: Linked RE: Readable data URI: Accessible Scien?sts Submit jobs Data Collec?on Manually Reports (pdf files) Manually HPC Files TeraBytes >95% waste Publishers Files Public Informa?on Present
  6. 6. 30/11/14 6 Scien?sts Submit jobs Workflows Data Collec?on Automated Reports XML Automated Cloud HPC HPC on demand Results Databases XML Publishers Informa?on Public Files Informa?on Future Scien?sts Submit jobs Data Collec?on Manually Reports XML Automated HPC HPC Results Databases XML Publishers Files Public Files Informa?on ioChem-­‐BD
  7. 7. 30/11/14 7 5 ★ Open Data Tim Berners-­‐Lee Present ioChem-­‐BD Defini?on ioChem-­‐BD is a Digital Repository aimed to manage and store Computa/onal Chemistry files (inputs & outputs), and comes to fill the gap between results genera?on and manuscripts publica?on, and raise data to 5* quality. Created by the fusion of previous projects:
  8. 8. 30/11/14 8 Goals • Build a distributed database of computa?onal chemistry results: reduce size and increase value. • Set a common data standard among all quantum chemistry legacy formats (XML -­‐ CML). • Become a daily tool in data management, search and manipula?on • Redefine workflows: store results and publishing, open-­‐data • Be open to add future func?onali?es for data manipula?on and analysis ioChem-­‐BD features • Dynamic independent templates for data extrac?on of data display • Data representa?on set on top of priori?es (XML-­‐CML) • Responsive design (any device is able to render our content) • Data easily exportable to other formats • Secure connec?ons • Fully compliant with latest web standards
  9. 9. 30/11/14 9 Performance of our new extrac?on library 450 400 350 300 250 200 150 100 50 0 Conversion /me vs File size Plain text to CompChem CML jumbo-­‐converters jumbo-­‐saxon jumbo-­‐saxon with keep field 112.73 502.88 1,012.32 1,914.19 1,914.19 2,559.18 2,573.73 3,421.10 3,486.16 5,076.22 30,229.58 68,328.04 Parsing /me (s) File size (kB) ≈14x ≈4x User interfaces Upload Convert Store Shell WEB User files (input/output) Conversion templates Search Create & Browse Manage Convert Share Publish
  10. 10. 30/11/14 10 Workflow steps (1): Create Results files are uploaded from user’s disk space -­‐ Create shell client -­‐ Create web interface -­‐ Cer/ficate results (True Data) -­‐ Valida/on (Convergence WF, Geometries) Create: Shell client
  11. 11. 30/11/14 11 Create: Shell client Basic commands Command Descrip/on start-­‐rep-­‐shell Connect to repository (mandatory) exit-­‐rep Disconnect from repository lspro List current path contents pwdpro Print current path Project related commands Command Descrip/on catpro Display project informa?on cdpro Change to project cpro Create a new project mpro Modify a project dpro Delete a project findpro Find project by it’s name (regex allowed) Calcula?on related commands Command Descrip/on loadcalc Load calcula?on into repository viewcalc View calcula?on informa?on Create: Web interface
  12. 12. 30/11/14 12 Workflow steps (2): Create The Create module manages results and facilitates advanced data treatment Create: Web interface • Manage – Post-­‐processing – Organize projects collec?ons – Enrich Data: Descrip?on, keywords, addi?onal files – Reports: Generate Sup. Info. files (pdf) for publishing – Reac?on Energy paths – Consistency (level of theory) – Thermodynamic correc?ons – Kine?c Analysis ( TOF, % e.e.) – Molecular descriptors (QSAR) – etc …
  13. 13. 30/11/14 13 Workflow steps (3): Browse Results can then be published and made available for viewing and downloading by general public on Browse module Handle URL generator Rich XML Suppor?ng Informa?on files Linked to a published manuscript Browse: Web interface
  14. 14. 30/11/14 14 Current project status • Private & Demo servers up ( www.iochem-­‐ • Supported formats: – Gaussian, ADF, VASP – Molcas (50%) • Tes?ng integrity (user-­‐driven tests) • Checking Data captured & displayed • Reports Module (50%) • To do: sindicate distributed browsers, links to external databases, … Acknowledgements Moises Álvarez N. Lopez, F. Maseras, J. M. Poblet, C. De Graaf