Avogadro is being rewritten and architected to put semantic chemical meaning at the center of its internal data structures in order to fully support data-centric workflows. Computational and experimental chemistry both suffer when semantic meaning is lost; through the use of expressive formats such as CML, along with lightweight data-exchange formats such as JSON, workflows that previously demanded manual intervention to retain semantic meaning can be used. Integration with projects like JUMBO and Open Babel when conversion is required, coupled with codes such as NWChem where direct support for CML is being added, allow for much richer storage, analysis, and indexing of data. As web-based data sources add more semantic structure to their data, Avogadro will take advantage of those resources.
1. Avogadro, Open Chemistry and Semantics
August 21, 2012
Skolnik Symposium
Marcus D. Hanwell
Kyle Lutz
1
2. Introduction to Kitware
• Founded in 1998: 5 former GE Research employees
• 105 employees: more than 50 PhDs
• Privately held, profitable from creation, no debt
• Rapidly Growing: >30% in 2011, 7M web-visitors/quarter
• Offices
• 2011 Small Business
– Albany, NY Administration’s
– Carrboro, NC Tibbetts Award
– Santa Fe, NM • HPCWire Readers
and Editor’s Choice
– Lyon, France
• Inc’s 5000 List: 2008
– Bangalore, India to 2011
3. Avogadro
• Project began in 2006
• Split into library &
application (plugin-based)
• One of very few open source editors
• Designed to be extensible from the start
• Generates input & reads output from many
codes
• An active and growing community
• Chemistry needs a free, open framework
http://avogadro.openmolecules.net/
3
6. Vision for the Future
• Advancing the state-of-the-art
• Tight integration is needed
• Computational codes
• Clusters/supercomputers
• Data repositories
• Reduce, reuse, recycle!
• Facilitating sharing and
searching of data
• Embracing open data, cheminformatics
6
7. Opening Up Chemistry
• One of the most closed sciences
• Lots of black box proprietary codes
– Only a few have access to the code
– Publishing results from black box codes
– Many file formats in use, little agreement
• More papers should be including data
• Growing need for open standards
• Open tools needed to make that happen
7
8. Introduction to Open Chemistry
• User-friendly integration with
– Computational codes
– HPC/cloud resources
– Database/informatics resources
8
9. Introduction
• An open approach to chemistry software
Build, Test
– Open source frameworks & Package
Community
Review
– Developed openly
– Cross-platform
– Tested, verified
Software
– Contribution model Repository
– Supported by Kitware experts Developers
& Users
• BSD licensed to facilitate research/reuse
9
10. Open Chemistry Development Team
• Assembled an inter-disciplinary team
• Domain specialists: quantum chemistry,
biology, solid-state materials
• Computer scientists: build systems,
queuing, graphics, software process
• Marcus, Kyle, David L., Chris, David C.
10
11. OpenChemistry.org
• New website to promote open chemistry
• Hosts project-specific pages
• Provides an identity for related projects
• Promotes shared ownership of projects
– Website
– Code submission/review
– Testing infrastructure
– Wiki, mailing lists, news, galleries
11
13. Applications Being Developed
• Three independent applications
• Communication handled with local sockets
• Avogadro 2 – structure editing, input
generation, output viewing, and analysis
• MoleQueue – running local and remote
jobs in standalone programs, management
• ChemData – Storage of data, searching,
entry, annotation
13
14. Open Frameworks
• AvogadroLibs – core data structures and
algorithms shared across codes
• OpenQube – a collaboration platform for
quantum data ingestion and visualization
• Chemkit – file I/O, exploration and
chemoinformatics analysis
• VTK – specialized chemistry visualization/
data structures, use of above
14
19. Avogadro2
• Project began 2006
• Split into library &
application (plugin-based)
• One of very few open source editors
• Still using Qt, C++, Eigen, OpenGL
• Uses AvogadroLibs and OpenQube for core data
• Introduces client-server dataflow/patterns
• Includes new, efficient rendering code
• More liberally licensed – from GPL to BSD
19
20. Avogadro: Visualization
• GPU-accelerated rendering
• VTK for advanced visualization
• Support for 2D and 3D data plots
• Optimized data structures
– Large data
– Streaming
• Reworked interface
– Tighter database/workflow integration
20
21. MoleQueue: Job Management
• Tighter integration with remote queues
• Integration with databases
– Retains full log of computational jobs
– Triggers actions on completion
• Plugin-based system
– Easy addition of new codes
– Easy addition of new queuing systems
• Provides a client API for applications
21
23. New CML I/O
• Development of modular CML code
• Allows for multi-pass parsing of CML
• Keeps the CML closer to application
• Much faster, easier to extend and change
• Moving from simple CML to full semantic
documents that can be edited
• Learned from previous work in VTK and
Open Babel
23
24. File Format: CML & HDF5
• Leverages our experience with XDMF
• CML stores semantic data
– Name, formula, atoms, bonds
– Computational code, theory, basis set
• HDF5 used to store heavy data
– Basis set, intermediate data
– Eigenvectors, SCF matrix
– Volumetric data (MOs, electron density)
24
25. Rethinking Input File Generation
• Can we create a CML representation?
– Could be loaded directly by some codes
– Could be translated to input files for others
• Would allow search on input and output
• Could be stored and published
• Makes it easier to set up calculations
• Creates a more uniform experience
25
26. Advanced Impostor Rendering
• Using a scene, vertex buffer objects, and
OpenGL shading language
• Impostor techniques
– Sphere goes from 100s of triangles to 2!
– No artifacts from triangulation
– Scales to millions of spheres on modest GPU
26
28. Building Community
• Community around chemistry
Build, Test
projects & Package
Community
• Using Kitware’s software process Review
– Ensuring quality with continuous
testing
– Code contributions on the web
– Public mailing lists, bug trackers,
code review
• Promoting projects and Software
Repository
participation
– Publications
Developers
– Conferences & Users
– Workshops
28
29. Software Process
• Source code publicly hosted using Git
• Gerrit for code review
• CTest/CDash for testing/summary
– Gerrit can use CDash@Home
• Test proposed changes before merging
• CDash can now provide binaries
– Built nightly, available for direct download
• Wiki, mailing list, bug tracker
29
30. Conclusions
• Real opportunity to make an impact
• Improve research, industry and teaching
• Semantic data at the center of our work
– Storage
– Search
– Interaction with computational codes
– Comparison with experimental data
• Add support for iOS, Android and web
30
31. Acknowledgements
• Google Summer of Code for initial summer funding
• Avogadro developers: Geoffrey R. Hutchison, Donald E. Curtis,
David C. Lonie, Tim Vandermeersch and many more contributors,
users and supporters
• Kitware, Inc. for their unique business model & support
• The Engineer Research and Development Center’s Environmental
Laboratory for recent funding
• Open-source projects, standards and services we build on: Qt, Open
Babel, GLEW, CML, CACTUS Resolver, many, many more projects
• Support of many code developers including MOPAC, NWChem, Q-
Chem and others
• Support from Peter Murray-Rust and the Blue Obelisk
31