Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Standard Data Format for Computational Chemistry: CSX

1,617 views

Published on

An overview of the Common Standard for eXchange (CSX) a new markup language for the storage of computational chemistry calculation data. CSX stores publication and molecular system metadata along with calculation data and, optionally, raw input and output files associated with a calculation. The computational chemistry community is invited to participate in the development of CSX. For more information see http://www.chemicalsemantics.com.

Published in: Science
  • Be the first to comment

  • Be the first to like this

A Standard Data Format for Computational Chemistry: CSX

  1. 1. A Standard Data Format for Computational Chemistry: CSX Stuart J. Chalk1,2, Neil Ostlund1, Mirek Sopek1, Bing Wang1 1) Chemical Semantics Inc., Gainesville FL 2) Department of Chemistry, University of North Florida schalk@unf.edu 249th ACS Meeting, Denver, CO – March 2015
  2. 2.  Semantic Annotation of Data  Current DOE Project  Data Transformations  Common Standard for eXchange (CSX)  CSX a Standard Data Format  The CSX Schema  CSX - Publishing Information  CSX - Molecular System Information  CSX - Calculated Result Information  Future Plans  Conclusion Outline
  3. 3.  Create a way to ‘teach’ computers what information means – contextualize the data  Example  What is this? 904-620-1938  A computer just sees it as…  … a string  By using an appropriate semantic definition in RDF (the Resource Description Framework) we can identify to the computer that the text is a phone number (using the Friend of a Friend (FOAF) specification), i.e. Semantic Annotation of Data RDF Specification http://www.w3.org/RDF/ FOAF Specification http://xmlns.com/foaf/spec/ <foaf:phone rdf:datatype=“#string">904-620-1938</foaf:phone>
  4. 4.  RDF can be use to relate information as well as annotate it  The following RDF/XML shows how some information is related (XML is the eXtensible Markup Language)  Applying this technology to computational chemistry calculations will allow integration of the calculation and results with data about chemicals from other sources Semantic Annotation of Data <rdf:Description rdf:about=http://example.org/StuartChalk> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/> <foaf:knows rdf:resource="http://example.org/NeilOstlund"/> <foaf:phone rdf:datatype=”…#string”>904-620- 1938</foaf:phone> </rdf:Description>
  5. 5.  Chemical Semantics is funded by DOE to create a web portal to collect, organize and make searchable the results output from computational chemistry (CC) calculations  This will be freely available and will accept output from all CC software packages  The intent is to capture calculation results and…  Software used to calculate the results  Input parameters used in the calculation  Methodology by which the calculation was done  Details of the molecular system studied DOE SBIR Grant
  6. 6.  The approach Chemical Semantics is taking is to 1. Add code to software packages to generate an XML file alongside the normal output file –OR– Parse an existing output file (using a free application) and generate XML file 2. Send the XML file into the web portal 3. Convert the XML file into RDF into turtle format (TTL) 4. Finally, ingest TTL into a triplestore (Virtuoso)  All the data in Virtuoso can then be search using SPARQL (SPARQL Protocol and RDF Query Language) Data Transformations Virtuoso http://virtuoso.openlinksw.com/ SPARQL http://www.w3.org/TR/sparql11-query/
  7. 7.  Why XML?  Human readable (plain text - UTF-8)  Platform neutral  Archivable  Validatable  Why not use CML?  Inability to represent complex structures e.g. residues  No standard way to add CC results Intermediate XML File
  8. 8.  A CSX file is a text based file written in XML  It is a structured data container design to hold CC result data and additional metadata  Version 0.x was developed by Neil Ostlund  Version 1.0 is the current stable release developed as part of Phase 1 of the SBIR grant (limited scope)  Version 2.0 is currently under development as part of Phase 2 of the SBIR grant Common Standard for eXchange (CSX)
  9. 9.  It is well know that the formats in which data is reported in CC output files is:  Highly variable (software specific)  Sometimes difficult to interpret  Standardization would:  Allow data from different packages to be more easily compared  Open up opportunities for software development to display and reuse data for different applications  This mirrors movement in the CC community toward a common driver base for CC software packages CSX as a Standard Data Format
  10. 10.  In order to describe the layout and allowed names of elements and attributes, and values for both, a schema document is available for the CSX specification  This can be used to help new users write valid CSX files (using XML editing applications such as XML Spy and oxygenXML) and…  … validate existing CSX files using any of a number of XML validators (e.g. Xerces) …  … and understand the structure of the data especially for less frequently calculated results The CSX Schema
  11. 11. CSX Schema v1.0
  12. 12. CSX Schema v1.0
  13. 13. CSX Schema v1.0
  14. 14. CSX Schema v1.0
  15. 15. CSX – Publication Information
  16. 16. CSX – Molecular System Information
  17. 17. CSX – Calculated Result Information
  18. 18.  Work on CSX 2.0 is ongoing – expand to multiple systems and sets of calculated results  Develop CSX focused website with converter functionality, libraries, and documentation  Engage CC software users/programmers to get involved with the project  Organize a community developer workshop over summer 2015  Publish version 2.0 of CSX in Fall 2015 Future Plans
  19. 19.  CSX started out as a stepping stone to transfer information to the CS portal  Having a data standard for CC is an important development in of itself  The CC community can do more with their data  Leverage XML tools to visualize, process etc…  Compare results across CC packages  Validate results  Reference basis sets (https://bse.pnl.gov/) Conclusion
  20. 20.  schalk@unf.edu  Phone: 904-620-1938  Skype: stuartchalk  LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk  ORCID: http://orcid.org/0000-0002-0703-7776  ResearcherID: http://www.researcherid.com/rid/D-8577-2013 Questions?

×