Cheminformatics Software Development: Case Studies
CheminformaticsSoftware Development: Case Studies Direct Observations and Informed Opinions Jeremy J Yang PhD Student, IU Cheminformatics Mgr, Systems & Programming, UNM Biocomputing Indiana University School of Informatics and Computing - I571, Intro to Cheminformatics - Fall 2011
My experience1) Daylight (1989 - 2002): Support Coordinator, SoftwareEngineer – support, user education, software engineering, meetings, application science, web apps, QA, databases, methodology research, etc.2) OpenEye (2002 - 2007): Director/VP of Support, Sr.Software Engineer – support, management, software engineering, QA, web apps, methodology research, etc.3) UNM (2002 - ): Mgr., Systems & Programming – software engineering, management, support, computational methodolgy, bioassay screening informatics, biomedical informatics research, etc.
Concept of this talk...Describe direct observations from experiences incheminformatics over last 22 years, relevant todayto understand and navigate complex landscape ofcheminformatics software roles and choices.Avoiding excessive idle reminiscing, include someof the colorful personalities and curious events.Suggest some lessons learned and trendsobserved, in the opinion of the author.
OutlineCase studies included:1) Daylight2) OpenEye3) Symyx a.k.a. MDL Some interesting others also mentioned4) Accelrys briefly.5) OpenBabel(Chosen mostly based on my familiarity.)
Perspectives on scientific software• Developer, programmer• Computational/informatics scientist, scholar• Support, educator, maintainer• Software as scientific publishing• Open source collaboration• Consumer: toolkit user (programmer)• Consumer: app user, scientist• Licensing, intellectual property, legal• Business (commercial or non-commercial)
Case Study #1Daylight Chemical Information Systems, Inc. http://www.daylight.com
Daylight Chemical Information Systems, Inc.l Founded 1987 by Dave Weininger, Art Weininger, Yosi Taitz, in Claremont, CAl Ancestry: Pomona College MedChem program (Corwin Hansch, Al Leo)l Innovations: SMILES, SMARTS, SMIRKS, fingerprints, rigorous syntax/grammar/semanticsl Products: ClogP, Thor, Merlin, C toolkits, Oracle Chemistry Cartridgel Fortran → C ~1990, oop-ish API (Scofields).l DEC-VAX/VMS → Unix → Linux, Windows
Daylight “MUG” meetings known for scientific impact Daylight MUG ‘98, Santa Fe, NM
Daylight Chemical Information Systems, Inc.l Success via appeal to examplars who became advocates at various pharmas.l Published, supported APIs and open transparent system appealed to “hackers, nurds and geeks”.l From research computing to enterprise (IT).l Many other software developers learned and borrowed from Daylight.l Focused on software, to enable science.l Max ~12 employees, $5M annual revenue.
Some idle reminiscing... Daylight Krewe @ Zulu Ball, New Orleans 1992 Dave Weininger ArtCraig James Weininger Jeremy Yang
Daylight Chemical Information Systems, Inc.l Scientific mission: To effectively store, retrieve, process all existing chemical information.l This mission often guided what to do, and also what to not do.l Often that conflicted with profit motives.l Avoided: consulting, macromolecules, biology, Windows, investors, sales, marketingIf I have seen further it is only by standing on the shoulders ofgiants. - Issac NewtonI cant see far because there are giants on my shoulders! - DaveWeininger
Example research project using Daylight tools http://www.daylight.com/meetings/mug01/Yang/gadd/
Example research project using Daylight tools Slide 15
Case Study #2OpenEye Scientific Software, Inc. http://www.eyesopen.com
OpenEye Scientific Software, Inc.l Founded 1997 by Anthony Nicholls in Santa Fe, NM.l Ancestry: Honig lab, biophysics, Columbia.l Innovations: molecular shape and continuum dielectric Poisson-Boltzmann electrostatics via atom centered Gaussians, 3D speed, rigor, accuracy and validation, comprehensive APIsl Products: Rocs, Fred, OEChem, Vidal APIs: C++, Python, Java, many OSs
OpenEye apps Rocs shape overlay Fred docking result Szmap hydrophillic regionsBrood bioisosteric fragments
OpenEye Scientific Software, Inc.l Focused on software and science, sometimes a difficult balance.l High-throughput 3D virtual screening (esp. Rocs) has become standard practice (enterprise-like).l Spectrum of functionality between 2D cheminformatics and 3D high performance computing.l Max 34 employees (current).
OpenEye Example Code (2D & 3D)$ pdb2lig.pyUsage:pdb2lig.py [options] [<infile] [>outfile] --i=<INFILE> --outlig=<OUTLIGFILE> --outpro=<OUTPROFILE> ... output protein or other macromol --multiligfiles ... one output file per ligand --minatoms=<N> ... minimum atomcount cutoff for ligand  --maxatoms=<N> ... maximum atomcount cutoff for ligand  --metal ... disconnected metal ions stay with protein --f ... force processing of non-PDB file --v ... verbose --vv ... very verbose --h ... help
Case Study #3Symyx a.k.a. MDL http://www.symyx.com(since 2010, http://www.accelrys.com)
Symyx a.k.a. MDLl Founded 1978 by Stuart Marson and Todd Wipke (ancestry: Harvard, Princeton, Stanford).l “Molecular Design Ltd”, reflected initial ab initio design goals, renamed “MDL Information Systems” (93)l Products: REACCS (82), MACCS (84), MACCS-3D (88), ISIS (91), Isentris (04), chem +bio database systemsl Invented molfile/SD format (proprietary till 1991), and extensions (query, R-group).l Based Chime on Rasmol source code.l Purchased by Reed Elsevier (97), then Symyx (07), then Accelrys (10).
Symyx a.k.a. MDLl Max ~300 employees. (My guess.)l By 1990, all pharma research companies used MDL software for their compound database.l Then the MBAs, lawyers and admen took over!l And innovation ceased. Slide 30
Case Study #4 Accelrys http://www.accelrys.com
Accelrys A tale of mergers and acquisitions (~1990 - 2011): – MSI (Molecular Simulations) • Biodesign • Cambridge Molecular Design • Polygen • Biocad • Biosym Technologies – Synopsys Scientific Systems – Oxford Molecular – Genetics Computer Group – Synomyx – SciTegic – Symyx – Contur Software AB
Accelrys SELECT hts_ap_archive.runset_number as "RUN", hts_ap_archive.ap_alias as "APlateName", hts_plate.plate_id, hts_plate.alternate_id as "IPlateName", example hts_well.well_no, hts_sample.alternate_id, of hts_result_detail.value_char AS Target, hts_result_type.type_desc AS Result_Type, hts_assay_result.concentration || hts_conc_unit.unit_value AS CONC, AEI hts_assay_result.dilution, hts_assay_result.result_value AS Value SQL: FROM hts_well, hts_plate, hts_sample, hts_conc_unit, ddi_container_master,corporate hts_ap_archive, hts_assay_result,merging hts_result_type, hts_result_detail WHERE hts_ap_archive.ap_alias=213_20110712_135454-1reflected in AND hts_assay_result.sample_id=hts_well.sample_id AND hts_assay_result.sample_id=hts_sample.sample_idschema, AND hts_well.plate_id=hts_plate.plate_id AND hts_assay_result.plate_id=hts_ap_archive.ap_numbertechnology AND hts_plate.alternate_id=ddi_container_master.container_name AND hts_well.plate_id=hts_plate.plate_id AND hts_well.sample_id=hts_sample.sample_id AND hts_assay_result.sample_plate=ddi_container_master.container_id AND hts_result_type.result_type=hts_assay_result.result_type AND hts_assay_result.result_id=hts_result_detail.result_id AND hts_assay_result.conc_unit=hts_conc_unit.unit_id ORDER BY hts_well.well_no,Target,Result_Type
Accelrys Merging reflected in technologyl Chemical cartridges: AEI (v6 & v7), Symyx,new SciTegic cartridgel Cheminformatics? E.g., canonicalize asmiles w/ Accelrys, Scitegic or MDL code.l The Accord Enterprise Informatics (AEI6.2) UNM purchased in 2009 is now a“legacy product”.l Re-branding and re-packagingl May be great technically, but challengingfor customers and Accelrys alike.
Accelryskey product:Pipeline Pilot (Scitegicfounded 1999, acquired byAccelrys 2004 for $21.5M.)
Opinionated Observation :There is a sometimes subtle differencebetween (1) a software product with a supportcontract, and (2) a service contract whichinvolves customizing and installing andconfiguring a software product, requiring orlikely to require ongoing, additional servicecontracts. Accelrys and Symyx (“solutionsproviders”) have generated much of theirrevenue using the latter. In contrast: toolsproviders.
Accelrys (My opinions -- feel free to disagree.)1) M&As reflected US & global businesstrends, but perpetual re-org is stressful andthere are huge costs to employees, customers,and technical progress.2) No coherent scientific mission, onlybusiness growth.3) Lots of excellent software, science andpeople. But all challenged by the chaos. "If in your science you only look for business, then you risk finding neither knowledge nor business." — Haldor Topsøe
OpenBabell Open-source C++ community project based on OELib by Matt Stahl, OpenEye.l Founded 2001, by Geoff Hutchinson et al. Now ~100 credited contributors.l C++ w/ wrappers via SWIG: Python, Perl, Java, Ruby, C#.l 2011 paper in Journal of Cheminformatics*l >160,000 downloads, >400 citations, used by >40 projects*”Open Babel: An open chemical toolbox”, Noel M. OBoyle, Michael Banck, Craig A. James, Chris Morley, TimVandermeersch and Geoffrey R. Hutchison, Journal of Cheminformatics 2011, 3:33doi:10.1186/1758-2946-3-33.
$ obscreenOpenBabel obscreen - screen molecules based on calculated properties OpenBabel v2.2.3 (Dec 4 2010)Example syntax: obscreen [options] options:Code -i <infile> -o <outfile> -otable <output table> -badsmarts <badfile> ... bad smarts file [default:builti -minmwt <MINWT> ... minimum molweight  -maxmwt <MAXWT> ... maximum molweight  -minhbd <MINHBD> ... min Hbond donors  -maxhbd <MAXHBD> ... max Hbond donors  -minhba <MINHBA> ... min Hbond acceptors  -maxhba <MAXHBA> ... max Hbond acceptors  -minrot <MINROT> ... min rotors  -maxrot <MAXROT> ... max rotors  -minchiral <MINCHIRAL> ... min chiral atoms  -maxchiral <MAXCHIRAL> ... max chiral atoms  -minlogp <MINLOGP> ... min logP [-5.0] -maxlogp <MAXLOGP> ... max logP [5.0] -hbasmarts <HBASMARTS> ... default=[#6,#7;R0]=[#8] -hbdsmarts <HBDSMARTS> ... default=[!H0;#7,#8,#9] -n_inmax <N> ... input limit -n_outmax <N> ... output limit -v ... verbose -vv ... very verbose
OpenBabel: active discussion listsindicators of community health
OpenBabel Used by eMolecules, quite successfullyCraig James, CTOformerly of Daylight,Accelrys
OpenBabel (My opinions -- feel free to disagree.)l OB is very good but not yet close to Daylight, OEChem or JChem in comprehensiveness, quality, and other measures.l SMARTS accuracy not state of the art.l OB continually improving thanks to capable and active community of developers and users.l The value reflects the total quality developer years invested. So there is no reason OB cannot catch up, if the community continues to grow.
Some interesting and successful others:1) ChemAxon - Budapest, Marvin, Java, JChem2) Tripos (now Certara) - St. Louis, Sybyl, SGI3) Chem Comp Group - Montreal, MOE, SVL4) PyMol - was Delano, now Schrödinger (OSS)5) Schrodinger - quantum, etc.6) UCSF School of Pharmacy - DOCK, Kuntz & al.7) NIH NCTT - OSS, Java, Tripod, bioassay analysis8) Scitouch - Russia, Indigo, Dingo, Bingo9) CDK - FOSS, Java10) RDKit - FOSS, incl. machine learning11) Silicos NV - OSS & commercial12) Bioreason, VC-funded, bought by Simulations Plus13) Mesa Analytics & Computing14) NextMove Software15) UNM Division of Biocomputing16) IU SOIC CCRG
Some General Conclusionsl Cheminformatics software landscape is very complexl Economics & Freakonomics are drivers but alsopersonalitiesl FOSS can compete with $$$ software.l Landscape includes lots of hidden costs.l “Software engineering” has been an ideal, far fromreality overall.l Software can enable or hinder sciencel Diverse business models, diverse everythingl Plenty of technological changel Some human change too
The EndFeel free to contact me directly with questions orideas!Jeremy J Yangjejyang@indiana.edu