Poing: a coder’s take on protein modelling

1,280 views

Published on

Poing is a protein structure and folding model, designed to predict the tertiary structure of a protein from its sequence. I’ve been developing Poing for five years, after moving into computational biology from a background in software engineering. I’ve tried to keep the engineering ethos whilst dealing with the vagaries of scientific enquiry. My talk will focus on the engineering aspect, and how I’ve used a combination of C++, Python, various Python libraries, Subversion and server farms to produce a fairly slick workflow for both software engineering and developing and using the protein structure model. I will also talk about what I would have done differently with the benefit of hindsight.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,280
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Poing: a coder’s take on protein modelling

  1. 1. poing a coder's take on protein modelling Benjamin Jefferys Imperial College London   [email_address]
  2. 2. Talk overview <ul><ul><li>  A very short primer on protein folding </li></ul></ul><ul><ul><li>  Inception and philosophy of poing </li></ul></ul><ul><ul><li>  Software engineering of poing </li></ul></ul><ul><ul><li>  Doing science </li></ul></ul><ul><ul><li>  Problems </li></ul></ul><ul><ul><li>  What I'd do differently </li></ul></ul><ul><ul><li>  Releasing the source </li></ul></ul>
  3. 3. DNA gene mRNA amino acid chain protein structure
  4. 4. Inception and philosophy of poing <ul><ul><li>  Initially a toy model for evolving protein-like polymers </li></ul></ul><ul><ul><ul><li>  Key requirement: speed </li></ul></ul></ul><ul><ul><ul><li>  Key requirement: avoid protein structure prediction </li></ul></ul></ul><ul><ul><li>  Found to be too simple </li></ul></ul><ul><ul><li>  Developed to be more sophisticated - no longer a toy </li></ul></ul><ul><ul><li>  Mischievous postdoc suggested use for prediction </li></ul></ul><ul><ul><ul><li>  It did surprisingly well (read: not terribly) </li></ul></ul></ul><ul><ul><li>  Was sucked into structure prediction quagmire  </li></ul></ul><ul><ul><li>  Managed to keep it fast </li></ul></ul>
  5. 5. Molecular modelling or CSP? <ul><ul><li>  Protein folding looks like a constraint satisfaction problem </li></ul></ul><ul><ul><li>  poing is a physics-based constraint solver </li></ul></ul><ul><ul><li>  Constraints based upon heuristics from literature </li></ul></ul><ul><ul><li>  End result indistinguishable from molecular modelling </li></ul></ul><ul><ul><ul><li>  This is what I tell biologists it is </li></ul></ul></ul><ul><ul><ul><li>  Secretly I think of it as a constraint solver </li></ul></ul></ul><ul><ul><ul><li>  Other people have used CSP solvers to fold proteins </li></ul></ul></ul><ul><ul><ul><li>  Hasn't really worked </li></ul></ul></ul>
  6. 6. poing software architecture <ul><ul><li>  Speed-critical parts in fully OO C++ </li></ul></ul><ul><ul><li>  Everything else in Python </li></ul></ul><ul><ul><li>  SWIG to glue them together </li></ul></ul><ul><ul><ul><li>  SWIG is awesome </li></ul></ul></ul><ul><ul><ul><li>  Can even subclass C++ classes with Python classes </li></ul></ul></ul><ul><ul><li>  Amount in C++ reduced over time </li></ul></ul><ul><ul><ul><li>  SWIG makes fantastic interfaces </li></ul></ul></ul><ul><ul><ul><li>  But still have to think: not totally transparent! </li></ul></ul></ul><ul><ul><ul><li>  So good to minimise complexity of interface </li></ul></ul></ul>
  7. 7. poing software development <ul><ul><li>  Editor - jedit - bosh </li></ul></ul><ul><ul><li>  Source control - subversion </li></ul></ul><ul><ul><ul><li>  Mostly for multi-site working </li></ul></ul></ul><ul><ul><ul><li>  Occasionally for tagging releases </li></ul></ul></ul><ul><ul><ul><li>  Occasionally for branching </li></ul></ul></ul><ul><ul><ul><li>  Very rarely for accessing old versions </li></ul></ul></ul><ul><ul><li>  Bug/issue tracking </li></ul></ul><ul><ul><ul><li>  Initially bugzilla wrapped in a nice VM </li></ul></ul></ul><ul><ul><ul><li>  Then todoist.com - more flexible </li></ul></ul></ul>
  8. 8. Doing science with poing <ul><ul><li>  A new model being actively developed </li></ul></ul><ul><ul><li>  Important to experiment with new ideas, enabled by: </li></ul></ul><ul><ul><ul><li>  Maximising Python part, easing implementation </li></ul></ul></ul><ul><ul><ul><li>  Short run-times, giving rapid feedback </li></ul></ul></ul><ul><ul><ul><li>  An interactive folding viewer, enabling visual analysis </li></ul></ul></ul><ul><ul><ul><li>  Tools for running ensembles, to get good statistics </li></ul></ul></ul><ul><ul><ul><li>  Regular trips to the pub </li></ul></ul></ul>
  9. 9. Doing science with poing <ul><ul><li>  Science must be accountable </li></ul></ul><ul><ul><ul><li>  Published work must be repeatable </li></ul></ul></ul><ul><ul><ul><li>  &quot;I've lost the code&quot; is not acceptable </li></ul></ul></ul><ul><ul><ul><li>  &quot;The code has changed&quot; is not acceptable </li></ul></ul></ul><ul><ul><ul><li>  Code and data snapshot for every experiment </li></ul></ul></ul><ul><ul><li>  Lost count of how many farm interfaces I've made </li></ul></ul><ul><ul><ul><li>  Many methods to split work amongst many processors </li></ul></ul></ul><ul><ul><ul><li>  And then gather and analyse the results </li></ul></ul></ul><ul><ul><ul><li>  Fork -style worked for a while... </li></ul></ul></ul><ul><ul><ul><li>  But simple multi-script invokation won in the end </li></ul></ul></ul>
  10. 10. Doing science with poing <ul><ul><li>  Speed is key </li></ul></ul><ul><ul><ul><li>  Changes the kind of science you can do </li></ul></ul></ul><ul><ul><ul><li>  Only paper so far relied on the speed of poing </li></ul></ul></ul><ul><ul><li>  Many approaches to speed </li></ul></ul><ul><ul><ul><li>  Profiling (but - alters compiler optimisations) </li></ul></ul></ul><ul><ul><ul><li>  Simple time-edit-time   </li></ul></ul></ul><ul><ul><ul><li>  Examining assembler output from compiler </li></ul></ul></ul><ul><ul><li>  Two key speed-ups </li></ul></ul><ul><ul><ul><li>  All-all distance matrix cache </li></ul></ul></ul><ul><ul><ul><li>  Get rid of unnecessary casts </li></ul></ul></ul>
  11. 11. Problems: C++ - Python - SWIG <ul><ul><li>  C++ objects fundamentally unpickleable </li></ul></ul><ul><ul><ul><li>  I love pickle and use it all the time </li></ul></ul></ul><ul><ul><ul><li>  Have to bodge an interface to make pickleable objects </li></ul></ul></ul><ul><ul><li>  Advanced functionality in SWIG troublesome </li></ul></ul><ul><ul><ul><li>  Can put bits of glue C++ and Python in .i file </li></ul></ul></ul><ul><ul><ul><li>  Works, but confusing - keep it simple </li></ul></ul></ul><ul><ul><li>  Some surprising things missing from SWIG </li></ul></ul><ul><ul><ul><li>  No automatic interface to C-style arrays ! </li></ul></ul></ul><ul><ul><li>  Can be hard to switch context </li></ul></ul><ul><ul><ul><li>  No garbage collection in C++ </li></ul></ul></ul><ul><ul><ul><li>  Slightly inelegant bodge to make Python do it </li></ul></ul></ul>
  12. 12. Problems: external resources <ul><ul><li>  poing relies somewhat on filesystem access </li></ul></ul><ul><ul><ul><li>  Of course! </li></ul></ul></ul><ul><ul><li>  Accessing shared filesystem from farm unreliable </li></ul></ul><ul><ul><li>  Need for multiple levels of error-checking </li></ul></ul><ul><ul><li>  Is this an increasing problem in the future? </li></ul></ul>
  13. 13. What I'd do differently <ul><ul><li>  Use scipy/numpy instead of C++ </li></ul></ul><ul><ul><ul><li>  Compiled code is a nuisance </li></ul></ul></ul><ul><ul><ul><li>  scipy/numpy can use SSE without me thinking </li></ul></ul></ul><ul><ul><li>  Or use Java instead of C++/Python </li></ul></ul><ul><ul><ul><li>  Java can be pretty fast </li></ul></ul></ul><ul><ul><ul><li>  Spin-out art/science project impressively fast </li></ul></ul></ul><ul><ul><ul><li>  Almost as nice to code in as Python </li></ul></ul></ul><ul><ul><ul><li>  Eclipse! </li></ul></ul></ul><ul><ul><li>  But it isn't a big issue - my approach has been successful  </li></ul></ul><ul><ul><ul><li>  From a software engineering point of view, at least </li></ul></ul></ul>
  14. 14. Releasing poing <ul><ul><li>  Preparing the code </li></ul></ul><ul><ul><ul><li>  Clearing out junk code </li></ul></ul></ul><ul><ul><ul><li>  Doxygen for documentation of C++ </li></ul></ul></ul><ul><ul><ul><li>  Doxygen for documentation of Python? (suggestions?) </li></ul></ul></ul><ul><ul><ul><li>  Example scripts for typical use </li></ul></ul></ul><ul><ul><li>  Distribution </li></ul></ul><ul><ul><ul><li>  poing.sourceforge.net </li></ul></ul></ul><ul><ul><ul><li>  License - open source </li></ul></ul></ul><ul><ul><ul><li>  Force citation of paper? </li></ul></ul></ul><ul><ul><ul><li>  Co-authorship on derived work? </li></ul></ul></ul>

×