Your SlideShare is downloading. ×
0
The Seven Deadly Sins of Bioinformatics Professor Carole Goble [email_address] The University of Manchester, UK The myGrid...
Roadmap <ul><li>Sins of BioScience </li></ul><ul><ul><li>With examples </li></ul></ul><ul><li>Why are we like this? </li><...
Intractable Problems in Bioinformatics. Have we sinned? Are these part of the intractable problem?
The traditional sins…. <ul><li>Lust </li></ul><ul><li>Gluttony  </li></ul><ul><li>Greed </li></ul><ul><li>Sloth </li></ul>...
Methodology <ul><li>Email a handful of bioinformaticans. </li></ul><ul><li>Stand well back. </li></ul><ul><li>Collect. </l...
I am grateful to… <ul><li>Phil Lord (University of Newcastle) </li></ul><ul><li>Anil Wipat (University of Newcastle) </li>...
They came up with more than seven. But I beat them into submission. Many are highly inter-related. Hopefully they are all ...
Sins <ul><li>Parochialism and Insularity </li></ul><ul><li>Exceptionalism </li></ul><ul><li>Autonomy or death! </li></ul><...
<ul><li>Parochialism </li></ul><ul><li>“ being provincial, being narrow in scope, or considering only small sections of an...
Reinvention <ul><li>Reinventing the Wheel. Rediscovering the same problems. Rediscovery of techniques & methods. </li></ul...
Comparative Genomics? Tisk! Its Comparative Bioinformatics Bioinformatics is about mapping one schema to another, one form...
Names and Identity Crisis Q92983 O00275 O00276 O00277 O00278 O00279 O00280 O14865 O14866 P78507 <ul><li>WSL-1 protein </li...
Andy Law's Third Law <ul><li>“The number of unique identifiers assigned to an individual is never less than the number of ...
The Selfish Scientist <ul><li>“ A biologist would rather share their toothbrush than their (gene) names” </li></ul><ul><li...
Some causes of the Identity Crisis <ul><li>Conflation of the ID for a thing, something to call the thing, a description of...
Id Reinvention <ul><li>Global Identity naming mechanism for data objects in the Life Sciences </li></ul><ul><li>LSIDs and ...
Andy Law’s First (Format) Law <ul><li>“ The first step in developing a new genetic analysis algorithm is to decide how to ...
<ul><li>EMBOSS lists more than 20 different sequence formats.  </li></ul><ul><li>“ Nearly every collection of sequences th...
Reinvention of Ontology tools <ul><li>OBO and OWL ? </li></ul><ul><li>OBOEdit and Protégé-OWL ? </li></ul>The Montagues an...
The “Oh No” OBO Pragmatists Aesthetics Philosophers Life  Scientists Capulets Knowledge Representation Montagues A means t...
Yet another database … <ul><li>Organism databases </li></ul><ul><li>Counter example  </li></ul><ul><ul><li>Generic Model O...
BioBabel <ul><li>bioperl  </li></ul><ul><li>biojava  </li></ul><ul><li>biopython  </li></ul><ul><li>bioruby  </li></ul><ul...
Integration <ul><li>Workflows Management Systems </li></ul><ul><li>Counter example </li></ul><ul><li>Taverna   </li></ul>...
<ul><li>Reinvent wheels in creating 'Transcriptional Units' ('genes' derived from ESTs and mRNA), within species and betwe...
Any more ? <ul><li>Another Web 2.0 Web Site? Another Web interface to a database? Another portal? </li></ul><ul><li>Whole ...
Reuse Rocks. Collaboration through  workflow and web services <ul><li>VL-e Project </li></ul><ul><li>“ instant collaborati...
Recycling, Reuse, Repurposing <ul><li>A Trypanosomiasis  in Cattle workflow (by Paul) reused without change  for  Trichuri...
Warning! Reuse is Hard <ul><li>Writing reusable workflows is hard. </li></ul><ul><ul><li>Local services </li></ul></ul><ul...
Bullying and the Borg <ul><li>If a group is working in a field, you get bullied at for trying out something different. </l...
Reinvention or Invention? Pre-dating <ul><li>BioMOBY pre-dates (Semantic) Web service revolution  </li></ul><ul><li>OBO an...
A few months in the laboratory (or the computer) can save a few hours in the library (or on Google). Westheimer's Law (wit...
No tool is an island… <ul><li>Assume </li></ul><ul><ul><li>only we will use it, whatever it may be. </li></ul></ul><ul><ul...
I know what it means... <ul><ul><ul><li>A hacker who studied ontology </li></ul></ul></ul><ul><ul><ul><li>Was famed for hi...
Not just bioinformatics  Computer Science is Guilty!
Why don’t biologists modularise OWL ontologies properly? Er, well, like how should we do it “properly” and where are the t...
“ I don't blame them [MGED/PSI community] because to truly comprehend RDF/OWL is not an easy task, it takes not just the u...
Standards are boring (but important) <ul><li>“ Blue collar Science” (John Quackenbush) </li></ul><ul><li>Nobody is going t...
Self promotion <ul><li>Not making shareable reusable software, because we can  publish  every single monolithic software s...
Research – Production Confusion <ul><li>Novelty vs Standards </li></ul><ul><li>Neither the funding nor the social structur...
Trust I don’t trust your code I don’t trust your data I don’t trust you will still be around in 1 year
Sin 2 <ul><li>Exceptionalism </li></ul><ul><li>Biologist exceptionalism </li></ul><ul><li>Biological exceptionalism </li><...
Biologist exceptionalism <ul><li>I know there is already a gene name for that gene, but, I don't like it and it doesn't fi...
Biological exceptionalism <ul><li>“ Biology is all exception.”  </li></ul><ul><li>“ Don’t complicate everyone’s life for t...
We are so much more complex… <ul><li>“ There are proteins, and there are records about proteins. Records come in different...
Other Sciences…. <ul><li>CERN: UML meta-modelling mechanisms in order to migrate models over time without losing data.  </...
Biology Exceptionalism <ul><li>Biology is harder than anything else in the whole wide world because there is lots of it an...
Sin 3 <ul><li>Autonomy or death! </li></ul><ul><li>Combined with churn and indifference to users. </li></ul><ul><li>Compou...
Autonomy is death! <ul><li>Change my interface / format whenever I feel like it, despite the fact I wanted lots of users a...
Lincoln Stein said a while ago… <ul><li>An interface is a contract between data provider and data consumer </li></ul><ul><...
Law's Second Law <ul><li>“Error messages should never be provided” corollary... “If error messages are provided, they shou...
Workflow commodities <ul><li>Workflow published with its paper and its data set. </li></ul><ul><li>So what happens when I ...
The myGrid Semantic Sweatshop <ul><li>Services and Workflows in the wild. </li></ul><ul><li>Curated by experts using an on...
The myGrid Semantic Sweatshop  notice how tired they look Franck Tanoh Katy Wolstencroft
Churn, Churn, Churn <ul><li>“ Stability is more important than Standards or Smartness. Discuss” </li></ul><ul><li>Constant...
Churn, Churn, Churn <ul><li>We expect the content to change, but why does everything else. </li></ul><ul><li>Constant chur...
Sin 4 <ul><li>Vanity </li></ul><ul><li>Pride  </li></ul><ul><li>Narcissism </li></ul><ul><li>conceit, egotism or simple se...
I know it all. <ul><li>Claiming to know everything about biology and everything about computers.  </li></ul><ul><li>This i...
Think like me!  <ul><li>Building interfaces that only you can use. </li></ul><ul><li>Not actually using your tools in the ...
A good User Experience outweighs smart features. Can I use it?  Is the user interface familiar? Does it fit with my needs?
Gain-Pain pay-off <ul><li>Just enough, just in time </li></ul>Gain Pain Very BAD Good, but Unlikely Just right
Sin 5 <ul><li>Monolith Meglomania </li></ul><ul><li>delusions of grandeur.  </li></ul><ul><li>obsession with grandiosity a...
More, more, more! <ul><li>Integration – the more the merrier. No. </li></ul><ul><ul><li>Every link is a potential dead lin...
The trouble with warehouses <ul><li>30% of data migration projects fail (Source: Standish Group) </li></ul><ul><li>50% of ...
More More More  <ul><li>“ Emacs of Biology” </li></ul><ul><li>End-user apps/libraries in bioinformatics workbenches with l...
Mash-Up Data Marshalling <ul><li>Content syndication and feeds </li></ul><ul><li>Emphasis shifts to the user creating spec...
Distributed Annotation System Mash-Up  http://www.biodas.org Reference Server AC003027 AC005122 M10154 Annotation Server A...
Sin 6 <ul><li>Scientific Method Sloth </li></ul><ul><li>Its easier to think of a new name than use someone else’s. </li></...
Ennui <ul><li>Garbage in, garbage out </li></ul><ul><ul><li>Running analysis over the wrong datasets </li></ul></ul><ul><u...
Its black and white <ul><li>Arbitrary cut-offs on rank-ordered result list </li></ul><ul><li>Everything above is absolute ...
Quality Delusions <ul><li>The bioinformatics does not have to be sound, because we only trust wet-lab results anyway. </li...
Quality Delusions <ul><li>The bioinformatics does not have to be sound, because we only trust wet-lab results anyway. </li...
Black Box Science <ul><li>Producing irreproducible bioinformatics analyses </li></ul><ul><ul><li>Not collecting the proven...
“ No experiment is reproducible.”  Wyszowski's Law “ An experiment is reproducible until another laboratory tries to repea...
Sin 7 <ul><li>Instant Gratification </li></ul><ul><li>Greed? Gluttony? </li></ul><ul><li>Always the immediate return. </li...
Hackery <ul><li>Deliver now, pay later </li></ul><ul><ul><li>Producing crap, non-reusable, software because only the biolo...
“ I am sure one could reuse large parts of re-annotation for building transcriptome maps, if they only used workflows and ...
“ Bioinformaticians have reached the standards of the 1980s, while computer scientists are working on the standards of the...
Blind faith in XML  <ul><li>It’s in XML, thus all data integration problems are solved. </li></ul><ul><ul><li>Er…no.  </li...
Blind Faith in Foo. <ul><li>There's a new thing to use. </li></ul><ul><li>we don't understand it yet.  </li></ul><ul><li>s...
Pioneering development methods <ul><li>Development by anecdote </li></ul><ul><ul><li>I heard in the pub that the way to go...
Open Source Blinkers <ul><li>Why does Open source have special merit?  </li></ul><ul><li>Commercial solutions with added s...
Sin Summary Maybe only one “original sin” in bioinformatics. Parochialism and Insularity Exceptionalism Autonomy or death!...
Can we become less sinful?  Why do these sins exist? Are bioinformaticians particularly naughty? No naughtier than Compute...
Why? <ul><li>Selfish Scientist – Self-interested Scientist </li></ul><ul><ul><li>Reputation, need to get results right now...
Luddism? Surely not! <ul><li>Refusing to have biology go beyond a cottage industry.  </li></ul><ul><li>Being scared to do ...
Research – Production Confusion <ul><li>Novelty vs Standards </li></ul><ul><li>Neither the funding nor the social structur...
Practical Steps? <ul><li>Create means to share know-how </li></ul><ul><ul><li>Understanding outside my expertise. e.g. sou...
FaceBook & Bazaar for  Workflow e-Scientists myexperiment.org Trials start  August 2007!
Delivery Bulge
Practical Steps for IT Platforms? <ul><li>Stop building monolithic solutions </li></ul><ul><ul><li>Strong force in busines...
Practical Steps? <ul><li>Presume and design for incremental change  </li></ul><ul><ul><li>Minimise disruption.  </li></ul>...
Web 2.0 Design Patterns <ul><li>http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html </li></ul>...
Practical Steps? <ul><li>Presume scientific practice naughtiness </li></ul><ul><ul><li>Try to deal with it, or expose it? ...
The Final Word Sin writes histories, goodness is silent.     Thomas Fuller
Upcoming SlideShare
Loading in...5
×

The seven-deadly-sins-of-bioinformatics3960

917

Published on

Published in: Health & Medicine, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
917
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Ide
  • Identity Stability Social Technical
  • Not sure these all apply So we asked some people
  • An impression from all our panelists from all the papers and application notes they have rejected … Pride! and Sloth? Envy? Insularity. Even though it means more work in the end. 1. creating yet another identity scheme (identity crisis) 2. creating yet another representation mechanism for data (profusion of file formats) 30 different syntaxes for representing DNA / RNA and protein sequences
  • How can the semantic web help? numerous identity schemes for identifying proteins, metabolites, genes etc, do we really need any more?
  • Competitive advantage VO forming; sharing e-Science ideals; May refusing to move data off her disk and copywriting her workflows Collaborate when it is necessary in order to gain … competitive advantage. Sharing on HER terms – May’s workflows/ Scientists share because They are compelled to (funding agencies, economies of scale, projects, the nature of the problem, it is the nature of the community) It is in their best interest There are rewards.
  • W3C Semantic Web Health Care and Life Sciences Interest Group identity wars Life Science Identifer vs URLs vs PURLs, Web Services vs REST services.
  • You could argue that OBO-edit is reinventing Protege badly. But make sure you are wearing your bullet proof vest. Some people have argued that LSID reinvents HTTP and DNS badly. &amp;quot;Data Warehouse? More like Data Mortuary” Anon You can quote Usamma Fayyad from Yahoo! Research! Laboratories! on what they call &amp;quot;Data Tombs&amp;quot; &amp;quot;Our ability to capture and store data far outpaces our ability to process and exploit it.This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again. Data tombs also represent missed opportunities.&amp;quot; See communications of the ACM: http:// portal.acm.org/citation.cfm?doid =545151.545174 Still with sin 1: EMBOSS lists more than 20 DIFFERENT SEQUENCE FORMATS !!! at http:// emboss.sourceforge.net/docs/themes/SequenceFormats.html
  • GMOD is the a collection of software tools for creating and managing genome-scale biological databases. You can use it to create a small laboratory database of genome annotations, or a large web-accessible community database. GMOD tools are in use at FlyBase, WormBase, SGD, BeeBase and many other large and small community databases.
  • Or multiple seq
  • Picture of workflow
  • Come to think of it, I am quite sure many people reinvent wheels in creating &apos;Transcriptional Units&apos; (&apos;genes&apos; derived from ESTs and mRNA), within species, but certainly between species. I think this holds for many genome assembly related stuff: I also doubt whether genome data compilers for E. coli, Drosophila, Plant species, etcetera reuse each other&apos;s code. In most cases something new is added, but large parts could have been reused. I should look at some bioinformatics publications for more examples, but also have to prepare our own ISMB demonstration. Why can&apos;t time be reinvented? And better this time! To give a recent counter example of our own: text miners generally require synonyms and probably reinvent the wheel to get them in many cases. We recently reached &apos;instant collaboration&apos; with Martijn Schuemie from Rotterdam through a web service that discloses their protein synonym data. He made that especially after seeing our poster that showed a workflow with our web services: &apos;collaboration through workflow&apos;. Within VL-e we are now even exchanging services and (sub)workflows with food scientists. Web services make that very easy, although I see that creating web services is still a bottleneck. For quick solutions it is still seen as too much extra trouble. We intend to make Martijn&apos;s service part of our ISMB demonstration (on Tuesday 24, after you left  :&apos;( ). Tomorrow I may come up with more when I have a look at your presentation (and find the time for it). Troubles with broken networks at home and at my provider (what are the odds?  :&apos;(   ) prevent me from doing that now (I hope this e-mail goes anywhere).
  • He made that especially after seeing our poster that showed a workflow with our web services: &apos;collaboration through workflow&apos;. Within VL-e we are now even exchanging services and (sub)workflows with food scientists. Web services make that very easy, although I see that creating web services is still a bottleneck. For quick solutions it is still seen as too much extra trouble. We intend to make Martijn&apos;s service part of our ISMB demonstration (on Tuesday 24, after you left  :&apos;( ).
  • Confirmed by the biologists Worm Lady&apos;s name is Joanne Pennock and as far as I know she works for Prof. Richard K.Grencis. Description Trichuris muris - the mouse whipworm is a useful parasite model of the human parasite - Trichuris trichuria . Whipworms derive their name from their characteristic morphology. Adults occupy the large intestine with their anterior ends embedded in the cells lining the intestine. Transmission occurs by ingestion of contaminated material. Jo didn’t know about the tools; she didn’t know how to do it properly. REUSE Identified sex-dependant biological pathways involved in mouse model. The correlation of sex depandance and the ability of mice to expel the parasite had previously been hypothesised, however, had not been verified using conventional manual analysis techniques.
  • A kind of exceptionalism and reinvention?
  • Quicker to build it than find it? Quicker to build it than adapt or reuse something else? – designing reusable stuff is HARD.
  • Interfaces to things
  • Yeah? Semantics and formalisms matter 11,800
  • Modularisation is important tHE RECENT EXCHANGE OF THE swls EMAIL LIST WAS GREAT. &amp;quot;WHY DON&apos;T BIOLOGISTS DO IT PROPERLY?&amp;quot;. &amp;quot;THEY DON&apos;T DO IT PROPERLY BECAUSE sw PEOPLE DON&apos;T KNOW HOW TO DO IT PROPERLY EITHER.aLSO YOU DON&apos;T GIVE US MUCH IN THE WAY OF TOOLS....&amp;quot; THIS WAS  ALL ABOUT MODULARISING OWL ONTOLOGIES -- WE DON&apos;T KNOW THE SEMANTICS; THERE ARE NO TOOLS; AND ALL THAT WAS ON OFFER WERE SOME VAGUE GUIDELINES AND THE INJUNCTION TO DO IT PROPERLY. &amp;quot;THERE ARE NO PROPER ONTOOGIES IN BIOLOGY&amp;quot; -- THAT IS, YOU DON&apos;T MAKE ANY THAT USE ALL THE FEATURES OF OWL WE&apos;VE INVENTED.... IT IS ALL SUMMED UP BY OBSERVING THAT THE AGENDA OF SW TECNOLOGISTS AND BIOOGISTS ARE NOT THE SAME. sw AT MOST, IS ONLY A MEANS TO AN END FOR BIOLGOISTS, BUT AN END IN ITSELF FOR sw TECHIES.
  • One-off, roll your owns Nature contacted 89 databases listed in the Molecular Biology Database Collection (Nucl. Acids Res.28 1−7; 2000) to see how many still have funding five years on. Of these, 51 reported that they are struggling financially. Seven of these have closed; the rest are being updated sporadically in their owners&apos; spare time. (Zeeya Merali and Jim Giles Nature 435, 1010-1011 (23 June 2005) doi: 10.1038/4351010a ) Publication and career driven: easier to get a paper or a promotion by building your own thing. We are to blame too!
  • Oh, the only other thing is that I think some of the sins are caused when research outputs are confused with production products. You requires standards in the latter. You require bushyness in the former. However, neither the funding nore the social structures of bioinformatics allow us to treat these two differently in any principled manner - after all, how do you get funding for production sw other than claiming to be researching stuff? How do you get a publication out of a bit of research sw without claiming a potential user-base?
  • Added after the talk.
  • A cause of don’t be deflected by the edge cases to over complicate the world Computer systems are too complicated - fight it Information resources are worse He who pays the piper establishes a committee to call the tune Nucleic acid sequences provide the fundamental starting point for describing and understanding the structure, function, and development of genetically diverse organisms. The GenBank, EMBL, and DDBJ nucleic acid sequence data banks have from their inception used tables of sites and features to describe the roles and locations of higher order sequence domains and elements within the genome of an organism. In February, 1986, GenBank and EMBL began a collaborative effort (joined by DDBJ in 1987) to devise a common feature table format and common standards for annotation practice. 2 Overview of the Feature Table format The overall goal of the feature table design is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them. The Feature Table documentation represents the shared rules that allow the three databases to exchange data on a daily basis. The range of features to be represented is diverse, including regions which: * perform a biological function, * affect or are the result of the expression of a biological function, * interact with other molecules, * affect replication of a sequence, * affect or are the result of recombination of different sequences, * are a recognizable repeated unit, * have secondary or tertiary structure, * exhibit variation, or have been revised or corrected.
  • It would be better if I wrote the script I need so I know what it does, how it does it and how to modify it later because I haven’t specified what it was supposed to do in the first place don’t be deflected by the edge cases to over complicate the world Computer systems are too complicated - fight it Information resources are worse He who pays the piper establishes a committee to call the tune Nucleic acid sequences provide the fundamental starting point for describing and understanding the structure, function, and development of genetically diverse organisms. The GenBank, EMBL, and DDBJ nucleic acid sequence data banks have from their inception used tables of sites and features to describe the roles and locations of higher order sequence domains and elements within the genome of an organism. In February, 1986, GenBank and EMBL began a collaborative effort (joined by DDBJ in 1987) to devise a common feature table format and common standards for annotation practice. 2 Overview of the Feature Table format The overall goal of the feature table design is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them. The Feature Table documentation represents the shared rules that allow the three databases to exchange data on a daily basis. The range of features to be represented is diverse, including regions which: * perform a biological function, * affect or are the result of the expression of a biological function, * interact with other molecules, * affect replication of a sequence, * affect or are the result of recombination of different sequences, * are a recognizable repeated unit, * have secondary or tertiary structure, * exhibit variation, or have been revised or corrected.
  • don’t be deflected by the edge cases to over complicate the world Computer systems are too complicated - fight it Information resources are worse He who pays the piper establishes a committee to call the tune Nucleic acid sequences provide the fundamental starting point for describing and understanding the structure, function, and development of genetically diverse organisms. The GenBank, EMBL, and DDBJ nucleic acid sequence data banks have from their inception used tables of sites and features to describe the roles and locations of higher order sequence domains and elements within the genome of an organism. In February, 1986, GenBank and EMBL began a collaborative effort (joined by DDBJ in 1987) to devise a common feature table format and common standards for annotation practice. 2 Overview of the Feature Table format The overall goal of the feature table design is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them. The Feature Table documentation represents the shared rules that allow the three databases to exchange data on a daily basis. The range of features to be represented is diverse, including regions which: * perform a biological function, * affect or are the result of the expression of a biological function, * interact with other molecules, * affect replication of a sequence, * affect or are the result of recombination of different sequences, * are a recognizable repeated unit, * have secondary or tertiary structure, * exhibit variation, or have been revised or corrected.
  • This is linked to pride
  • When Ensembl was getting going, they had the CERN people over to talk about managing schema change over time. CERN showed some realy nice UML meta-modeling stuff that allows them to migrate models over time without loosing data. Ewan sent them back to Europe because genes can have more than one transcript which can in turn re-use exons (in the Ensembl data model). The CERN people couldn&apos;t see how that was relevant to managing changing data models, but Ewan kept saying &amp;quot;Our data models are complicated - I don&apos;t think specifying them will help. We need to understand them instead.&amp;quot; Of course, this was a few years ago and my memory is a little hazy.
  • don’t be deflected by the edge cases to over complicate the world Computer systems are too complicated - fight it Information resources are worse He who pays the piper establishes a committee to call the tune Nucleic acid sequences provide the fundamental starting point for describing and understanding the structure, function, and development of genetically diverse organisms. The GenBank, EMBL, and DDBJ nucleic acid sequence data banks have from their inception used tables of sites and features to describe the roles and locations of higher order sequence domains and elements within the genome of an organism. In February, 1986, GenBank and EMBL began a collaborative effort (joined by DDBJ in 1987) to devise a common feature table format and common standards for annotation practice. 2 Overview of the Feature Table format The overall goal of the feature table design is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them. The Feature Table documentation represents the shared rules that allow the three databases to exchange data on a daily basis. The range of features to be represented is diverse, including regions which: * perform a biological function, * affect or are the result of the expression of a biological function, * interact with other molecules, * affect replication of a sequence, * affect or are the result of recombination of different sequences, * are a recognizable repeated unit, * have secondary or tertiary structure, * exhibit variation, or have been revised or corrected.
  • Autonomy and death: Biojava suffered from this over the first 2 releases. We hadn&apos;t worked out how to provide stable interfaces to unstable implementations back then, so each minor release tended to break end-user code. And they
  • Do you understand crimap’s error messages?
  • Scientist perspective for finding. Machinery perspective for validation. Readable and processable in OWL and RDF Readable and processable in OWL and RDF
  • The Ensembl relational schema alters regularly. Often, it&apos;s because they are &apos;fixing&apos; column naming that wasn&apos;t done according to their standards in the first place. Sometimes it is to add/remove fields. Since the perl API sits directly on this, usually the APIs change to track. May be different now, but they didn&apos;t used to provide any backwards compattibility glue. http://www.purl.org/ As an example for your &apos;Churn&apos; slide: when I look for web services with Google I find mostly pages /about/ web services and how things should be approached, rather than actual web services (things are different when you include filetype:wsdl ). Another example may be related to the recent URI discussion on HCLS (that I didn&apos;t read yet): I think what Andy and I have been doing with upper ontologies is quite relevant, but I feel we are still in the middle of gaining experience with what is available. W3C Semantic Web Health Care and Life Sciences Interest Group identity wars Life Science Identifer vs URLs vs PURLs, Web Services vs REST services. Impact on everyone else who uses the previous mechanism. A few voices, very loud, vested interest, for their application, win. You know what? Why don’t we stick with something for a while and rally behind it? Or at least figure out the cost of change. Join the debate.
  • The Ensembl relational schema alters regularly. Often, it&apos;s because they are &apos;fixing&apos; column naming that wasn&apos;t done according to their standards in the first place. Sometimes it is to add/remove fields. Since the perl API sits directly on this, usually the APIs change to track. May be different now, but they didn&apos;t used to provide any backwards compattibility glue. http://www.purl.org/ As an example for your &apos;Churn&apos; slide: when I look for web services with Google I find mostly pages /about/ web services and how things should be approached, rather than actual web services (things are different when you include filetype:wsdl ). Another example may be related to the recent URI discussion on HCLS (that I didn&apos;t read yet): I think what Andy and I have been doing with upper ontologies is quite relevant, but I feel we are still in the middle of gaining experience with what is available. W3C Semantic Web Health Care and Life Sciences Interest Group identity wars Life Science Identifer vs URLs vs PURLs, Web Services vs REST services. Impact on everyone else who uses the previous mechanism. A few voices, very loud, vested interest, for their application, win. You know what? Why don’t we stick with something for a while and rally behind it? Or at least figure out the cost of change. Join the debate.
  • Picture.
  • Thinking you are the user. Suits me.
  • Added after the talk.
  • Added after the talk in response to discussions.
  • Find the natural lines of cleavage which minimise the number of “connections” Standardise the connections Under More, More, More, you may want to also mention end-user apps/libraries that try to be the &apos;emax&apos; of bioinformatics. Not so much of a thing now, but there was a phaze of providing bioinformatics workbenches that had loads of crap bundled in, none of it kept up to date, none of it propperly integrated.
  • Nobody uses my warehouse. http://research.microsoft.com/towards2020science/ You can quote Usamma Fayyad from Yahoo! Research! Laboratories! on what they call &amp;quot;Data Tombs&amp;quot; See communications of the ACM: http:// portal.acm.org/citation.cfm?doid =545151.545174
  • no clue of testing during software development differentially expressed genes in microarray analyses. protein identifications using Mascot scores. there&apos;s another one like this - if a group is working in a field, you get shouted at for trying out something different - esp happens arround anything that covers the same space as the OBO crowd. Often, you are actually doing something different, but because you use some words in common... Comes out as &amp;quot;Why do this? It&apos;s already been solved by Foo - the massively unwieldy, slow-moving, monolythic, meeting paralized international effort for Things Mentioning Foo“
  • (translated embl) Lets fix the quality.
  • (translated embl) Lets fix the quality.
  • UniGene is a good example of irreproducibility I think; at least it was a short two years ago when I looked into it. I asked the creators for a model or flow-chart to learn exactly what is happening during UniGene clustering, but they couldn&apos;t give me such. It doesn&apos;t seem to exist. &apos;Human&apos; descriptions of what is done are available (via NCBI), but this is not exact. I was involved in a project that basically reclustered UniGene (leading to the Human Transcriptome Map), and I know many microarray analysts put a lot of efforts in re-annotating their clones using genome databases. (Btw I am sure one could reuse large parts of re-annotation for building transcriptome maps, if they only used workflows and ontologies.) Each UniGene entry is a set of transcript sequences that appear to come from the same transcription locus (gene or expressed pseudogene), together with information on protein similarities, gene expression, cDNA clone reagents, and genomic location
  • --
  • All kinds of hackery Instant gratificatin
  • Blind faith in ...: I&apos;ve seen this with nearly every technology going. There&apos;s a new thing to use, we don&apos;t understand it yet, so it sucks up all the stuff we already know we don&apos;t understand leaving us with a system either side of it free from problems. Lack of appreciation about exactly what the new tech addresses *in itself* before trying to make it work *for us* .
  • Conflicts with reinventing.
  • There is hacking and HACKING
  • Immaturity Build then think. Understanding the problem. But you never will.
  • A sin set
  • Why its very, very good: Lots of features for project management, file sharing, charting progress, recording “actions” Web based tool, designed for people split between many locations. Why there was little uptake Because we are naughty Because it took time to learn how to use it, so we all thought “OK, OK, I’ll do that later” Because it had jargon / language which we would have to learn and understand how each concept relates to our project Because it is a pre-designed recipe which might not fit the way we already work Because the system was particularly slow from Nairobi (possibly the slowness was the “authentication” step – we didn’t solve it, but maybe could have.) None of this reflects on Basecamp – it is a widely used tool which fits the needs of multi-site projects – perhaps we underestimated the “activation energy” needed to get this working. It is a solution which might have worked.
  • Experimental object – related to the caData – in the wild. myExperiment makes it really easy for the next generation of scientists to contribute to a pool of scientific workflows, build communities and form relationships. myExperiment enables scientists to share, re-use and repurpose workflows and reduce time-to-experiment, share expertise and avoid reinvention. Their kids may have got there first but scientists will soon have their very own version of MySpace, where they will be able to share preliminary results, ideas and research tools. — New Scientist Tech , October 2006.   myExperiment introduces the concept of a workflow bazaar; a collaborative environment where scientists can safely publish their creations, share them with a wider group and find the workflows of others. Workflows can now be swapped, sorted and searched like photos and videos on the web. myExperiment is a Virtual Research Environment which makes it easy for people to share experiments and discuss them. We are currently working with our users to determine exactly how they want this site to work. We had a user meeting at the end of September 2006 to brainstorm myExperiment, and you can read some of the results from this meeting at our portal party wiki . Currently, a lightweight repository of workflows and the Taverna BioService Finder are available. Scientists should be able to swap workflows and publications as easily as citizens can share documents, photos and videos on the Web. myExperiment owes far more to social networking websites such as MySpace and YouTube than to the traditional portals of Grid computing, and is immediately familiar to the new generation of scientists. The myExperiment provides a personalised environment which enables users to share, re-use and repurpose experiments - reducing time-to-experiment. We expect to start with focused pilot myExperiment portals based upon case studies for the specific areas of Astronomy , Bioinformatics , Chemistry and Social Science .
  • Add bernardo. Do not dis-stain the mundane! The delivery bulge Cost of really making this work. The cost had better be worth it And not just the cost of money but people and commitment So we had better be tackling the right bit of the problem. Papers do not equal usable systems. The devil is in the detail. Practicalities override Niceties. Who are your users? This is just for semantic web service provision. Put in pinar, software engineers, chris wroe, phil lord, mark wilkinson as a service provider. Each despises the other.
  • Back to Basics But building for other people. Sandy Carter agility of solutions. Making the service to the business process.
  • thE END OF THE BLACK BOX
  • Workflows
  • The only difference between the saint and the sinner is that every saint has a past, and every sinner has a future. Author:  Oscar Wilde Source:  None
  • Transcript of "The seven-deadly-sins-of-bioinformatics3960"

    1. 1. The Seven Deadly Sins of Bioinformatics Professor Carole Goble [email_address] The University of Manchester, UK The myGrid project OMII-UK
    2. 2. Roadmap <ul><li>Sins of BioScience </li></ul><ul><ul><li>With examples </li></ul></ul><ul><li>Why are we like this? </li></ul><ul><li>The Selfish Scientist? E-Science is me-Science. </li></ul><ul><li>Challenges </li></ul><ul><ul><li>Technical </li></ul></ul><ul><ul><li>Social </li></ul></ul>
    3. 3. Intractable Problems in Bioinformatics. Have we sinned? Are these part of the intractable problem?
    4. 4. The traditional sins…. <ul><li>Lust </li></ul><ul><li>Gluttony </li></ul><ul><li>Greed </li></ul><ul><li>Sloth </li></ul><ul><li>Wrath </li></ul><ul><li>Envy </li></ul><ul><li>Pride </li></ul>http://en.wikipedia.org/wiki/Seven_deadly_sins [Stevens and Lord]
    5. 5. Methodology <ul><li>Email a handful of bioinformaticans. </li></ul><ul><li>Stand well back. </li></ul><ul><li>Collect. </li></ul><ul><li>Edit. </li></ul><ul><li>Therapy on the cheap. </li></ul><ul><li>We all felt better. </li></ul>
    6. 6. I am grateful to… <ul><li>Phil Lord (University of Newcastle) </li></ul><ul><li>Anil Wipat (University of Newcastle) </li></ul><ul><li>Matthew Pocock (University of Newcastle) </li></ul><ul><li>Robert Stevens (University of Manchester) </li></ul><ul><li>Paul Fisher (University of Manchester) </li></ul><ul><li>Duncan Hull (Manchester Centre for Systems Biology) </li></ul><ul><li>Norman Paton (University of Manchester) </li></ul><ul><li>Marco Roos (University of Amsterdam) </li></ul><ul><li>Rodrigo Lopez (EBI) </li></ul><ul><li>Tom Oinn (EBI) </li></ul><ul><li>Andy Law (Roslin Institute) </li></ul><ul><li>Graham Cameron (EBI) </li></ul>
    7. 7. They came up with more than seven. But I beat them into submission. Many are highly inter-related. Hopefully they are all too familiar.
    8. 8. Sins <ul><li>Parochialism and Insularity </li></ul><ul><li>Exceptionalism </li></ul><ul><li>Autonomy or death! </li></ul><ul><li>Vanity: Pride and Narcissism </li></ul><ul><li>Monolith Meglomania </li></ul><ul><li>Scientific method Sloth </li></ul><ul><li>Instant Gratification </li></ul>
    9. 9. <ul><li>Parochialism </li></ul><ul><li>“ being provincial, being narrow in scope, or considering only small sections of an issue.” http://en.wikipedia.org/wiki/Parochialism </li></ul><ul><li>Insularity </li></ul><ul><li>“ a person, group of people, or a community that is only concerned with their limited way of life and not at all interested in new ideas or other cultures.” http://en.wikipedia.org/wiki/Insularity </li></ul>Sin 1
    10. 10. Reinvention <ul><li>Reinventing the Wheel. Rediscovering the same problems. Rediscovery of techniques & methods. </li></ul><ul><li>Creating… </li></ul><ul><li>Yet another identity scheme. Yet another representation mechanism for data. </li></ul><ul><li>Yet another ontology. Yet another data warehouse. </li></ul><ul><li>Yet another integration framework. Yet another query or ontology or workflow language. </li></ul><ul><li>Result? Misery. Or more work for the boys. </li></ul>
    11. 11. Comparative Genomics? Tisk! Its Comparative Bioinformatics Bioinformatics is about mapping one schema to another, one format to another, one id scheme to another. What a waste of time. What a handy distraction from doing some Real Science™.
    12. 12. Names and Identity Crisis Q92983 O00275 O00276 O00277 O00278 O00279 O00280 O14865 O14866 P78507 <ul><li>WSL-1 protein </li></ul><ul><li>Apoptosis-mediating receptor DR3 </li></ul><ul><li>Apoptosis-mediating receptor TRAMP </li></ul><ul><li>Death domain receptor 3 </li></ul><ul><li>WSL protein </li></ul><ul><li>Apoptosis-inducing receptor AIR </li></ul><ul><li>Apo-3 </li></ul><ul><li>Lymphocyte-associated receptor of death </li></ul><ul><li>LARD </li></ul><ul><li>GENE: Name=TNFRSF25 </li></ul>Q93038 = Tumor necrosis factor receptor superfamily member 25 precursor P78515 Q93036 Q93037 Q99722 Q99830 Q99831 Q9BY86 Q9UME0 Q9UME1 Q9UME5 Annotation history: http://www.expasy.org/uniprot/Q93038
    13. 13. Andy Law's Third Law <ul><li>“The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study”... and is frequently many, many more. </li></ul>http://bioinformatics.roslin.ac.uk/lawslaws.html
    14. 14. The Selfish Scientist <ul><li>“ A biologist would rather share their toothbrush than their (gene) names” </li></ul><ul><li>Mike Ashburner </li></ul><ul><li>Professor Genetics </li></ul><ul><li>University of Cambridge </li></ul><ul><li>UK </li></ul><ul><li>Amongst the many </li></ul>
    15. 15. Some causes of the Identity Crisis <ul><li>Conflation of the ID for a thing, something to call the thing, a description of the thing, with the thing itself (reference/referent) </li></ul><ul><li>Internal vs external IDs </li></ul><ul><li>Opaque vs human-interpretable IDs </li></ul><ul><li>Situation-dependent 'parts' of a resource get different IDs </li></ul><ul><ul><li>e.g. the gene in a disease process vs the disease in a metabolic process </li></ul></ul><ul><li>Annotation attribution and log differentiation </li></ul><ul><ul><li>Two organisations attach annotations to two IDs, state they are referring to the same thing, they now have provenance about which of them asserted which facts </li></ul></ul>[Pocock]
    16. 16. Id Reinvention <ul><li>Global Identity naming mechanism for data objects in the Life Sciences </li></ul><ul><li>LSIDs and URIs and PURLs. WS-Naming and all its friends </li></ul><ul><li>Half the debaters haven’t actually read the LSID or URL or PURL specs. Or provided use cases. </li></ul><ul><li>Web Pages are not Data Assets. </li></ul><ul><li>“ you could do this with HTTP based identifiers given <insert hack>”. </li></ul><ul><li>The debate rages! 124 messages in the last week. </li></ul><ul><li>W3C Semantic Web Health Care and Life Sciences Interest Group [email_address] </li></ul>urn:lsid:uniprot.org:{db}:{id}   http:// purl.uniprot.org /{db }/{id}
    17. 17. Andy Law’s First (Format) Law <ul><li>“ The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” </li></ul><ul><li>Different codes to signify the sex of animals. </li></ul><ul><li>crimap uses '0' female and '1' male. </li></ul><ul><li>Keightly algorithm. ‘1' female and ‘0' male. </li></ul><ul><li>Knott & Haley QTL analysis algorithm ‘1' female and ‘2' male </li></ul><ul><li>When they'll use '3' and '4' and then we'll know they're doing it deliberately.  </li></ul>http://bioinformatics.roslin.ac.uk/lawslaws.html
    18. 18. <ul><li>EMBOSS lists more than 20 different sequence formats. </li></ul><ul><li>“ Nearly every collection of sequences that dares call itself a database has stored its data in its own format.” </li></ul><ul><li>http:// emboss.sourceforge.net/docs/themes/SequenceFormats.html </li></ul>
    19. 19. Reinvention of Ontology tools <ul><li>OBO and OWL ? </li></ul><ul><li>OBOEdit and Protégé-OWL ? </li></ul>The Montagues and The Capulets.. Let me get my bullet-proof vest …
    20. 20. The “Oh No” OBO Pragmatists Aesthetics Philosophers Life Scientists Capulets Knowledge Representation Montagues A means to an end Content providers Theoreticians The end Mechanism providers Spiritual guides The Montagues and The Capulets …SOFG 2004, KCap 2005, Comparative and Functional Genomics 2004 Endurants, Perdurants, Being, Substance, Event
    21. 21. Yet another database … <ul><li>Organism databases </li></ul><ul><li>Counter example </li></ul><ul><ul><li>Generic Model Organism Database Toolkit. </li></ul></ul>FlyBase, WormBase, SGD, BeeBase and many other large and small community databases
    22. 22. BioBabel <ul><li>bioperl </li></ul><ul><li>biojava </li></ul><ul><li>biopython </li></ul><ul><li>bioruby </li></ul><ul><li>biophp </li></ul><ul><li>biosql </li></ul><ul><li>biouml </li></ul><ul><li>biofoo </li></ul><ul><li>biobar </li></ul>
    23. 23. Integration <ul><li>Workflows Management Systems </li></ul><ul><li>Counter example </li></ul><ul><li>Taverna  </li></ul><ul><li>http://www.mygrid.org.uk </li></ul>
    24. 24. <ul><li>Reinvent wheels in creating 'Transcriptional Units' ('genes' derived from ESTs and mRNA), within species and between species. </li></ul><ul><li>This holds for many genome assembly related stuff </li></ul><ul><li>Genome data compilers for E. coli, Drosophila, Plant species, etcetera reuse each other's code? </li></ul><ul><li>Usually something new is added, but large parts could have been reused. </li></ul>
    25. 25. Any more ? <ul><li>Another Web 2.0 Web Site? Another Web interface to a database? Another portal? </li></ul><ul><li>Whole database systems. ACeDB is not a lone-case. </li></ul><ul><li>Genome data compilers for E. coli, Drosophila, Plant species, etcetera reuse each other's code? </li></ul><ul><li>Text miners require synonyms and reinvent the wheel to get them in many cases. </li></ul><ul><li>Add your favourite here…. </li></ul>
    26. 26. Reuse Rocks. Collaboration through workflow and web services <ul><li>VL-e Project </li></ul><ul><li>“ instant collaboration” with Martijn Schuemie (Rotterdam) through a web service that discloses their protein synonym data. </li></ul><ul><li>Exchanging services and (sub)workflows with food scientists. </li></ul><ul><li>Web services make that easier. </li></ul>
    27. 27. Recycling, Reuse, Repurposing <ul><li>A Trypanosomiasis in Cattle workflow (by Paul) reused without change for Trichuris muris Infection (by Jo). </li></ul><ul><li>Identified the biological pathways believed to be involved in the ability of mice to expel the parasite. </li></ul><ul><li>Workflows are memes. Scientific commodities. To be exchanged and traded and vetted and mashed. Users add value. </li></ul>
    28. 28. Warning! Reuse is Hard <ul><li>Writing reusable workflows is hard. </li></ul><ul><ul><li>Local services </li></ul></ul><ul><ul><li>Permissions. Licences </li></ul></ul><ul><ul><li>What does it DO? </li></ul></ul><ul><li>Writing reusable services is hard. </li></ul><ul><ul><li>What does it DO? </li></ul></ul><ul><ul><li>Predicting the unknown required by the unknown. </li></ul></ul><ul><li>Finding workflows, services and tools is hard </li></ul><ul><ul><li>Where do you go?? What does it DO?? </li></ul></ul><ul><li>Creating web services is still a bottleneck. For quick solutions it is still seen as too much extra trouble. </li></ul>
    29. 29. Bullying and the Borg <ul><li>If a group is working in a field, you get bullied at for trying out something different. </li></ul><ul><ul><li>Can YOU think of an example?? </li></ul></ul><ul><li>You may actually be doing something different, but you use some common words. </li></ul><ul><li>“ Why do this? It's already been solved by Foo - the massively unwieldy, slow-moving, monolithic, meeting paralysed international effort for Things Mentioning Foo”. </li></ul>
    30. 30. Reinvention or Invention? Pre-dating <ul><li>BioMOBY pre-dates (Semantic) Web service revolution </li></ul><ul><li>OBO and OBO-Edit pre-dates OWL and Protégé-OWL </li></ul><ul><ul><li>20 years of Knowledge Representation. </li></ul></ul><ul><li>Taverna pre-dates a reliable Open Source BPEL engine </li></ul><ul><ul><li>20 years of functional programming. </li></ul></ul><ul><li>There ARE features that Bioinformatics needs that other solutions don’t cater for. </li></ul>
    31. 31. A few months in the laboratory (or the computer) can save a few hours in the library (or on Google). Westheimer's Law (with additions).
    32. 32. No tool is an island… <ul><li>Assume </li></ul><ul><ul><li>only we will use it, whatever it may be. </li></ul></ul><ul><ul><li>that it will be freestanding and unlinked to anything else. </li></ul></ul><ul><ul><li>that it will always work and will keep on working. </li></ul></ul><ul><ul><li>That everyone will understand it. </li></ul></ul><ul><li>“Well I know what I mean. And so does my mate. So I don’t need to specify it. Or document it properly. Or keep the metadata up to date.” </li></ul><ul><li>Never mind the interface, just look at my implementation! </li></ul><ul><li>Metadata matters. Models matter. </li></ul><ul><li>Interfaces matter. Services matter. </li></ul>
    33. 33. I know what it means... <ul><ul><ul><li>A hacker who studied ontology </li></ul></ul></ul><ul><ul><ul><li>Was famed for his sense of frivolity </li></ul></ul></ul><ul><ul><ul><li>When his program inferred </li></ul></ul></ul><ul><ul><ul><li>That Clyde ISA Bird † </li></ul></ul></ul><ul><ul><ul><li>He blamed – not his code – but zoology </li></ul></ul></ul><ul><ul><ul><li>† Clyde ISA Elephant </li></ul></ul></ul>“ AI limericks” by Henry Kautz http:// www.cs.washington.edu/homes/kautz/misc/limericks.html
    34. 34. Not just bioinformatics Computer Science is Guilty!
    35. 35. Why don’t biologists modularise OWL ontologies properly? Er, well, like how should we do it “properly” and where are the tools to help us? We don’t know and we haven’t got any. But here are some vague guidelines. W3C Semantic Web for Life Sciences mailing list, 2005
    36. 36. “ I don't blame them [MGED/PSI community] because to truly comprehend RDF/OWL is not an easy task, it takes not just the understand of technology itself but more so the vision on how things should and can work in SW.” “ One thing we have to remember is that biologists are building ontologies to do a job of work. They are not produced as some end of CS or SW research” “ Principles are all well and good, but we should know from decades of software engineering that saying &quot;do it properly&quot; isn't a solution. We need tooling and methodologies that do not in themselves hinder a domain specialist. In many cases it is easier to re-develop than re-use or even cut-and-paste from an existing ontology than it is to muck around “doing it properly”” “ There is actually a gap between the view of ontology for CS people and for biological people. The ontology in biologist's eyes are more of a treaty than logical representation, that in CS view is on the reverse of that view. It needs dialog to bring the view to a middle ground and mechanisms to stretch to both directions.”
    37. 37. Standards are boring (but important) <ul><li>“ Blue collar Science” (John Quackenbush) </li></ul><ul><li>Nobody is going to win a Nobel prize for creating a standard schema, ontology or whatever. (Duncan Hull) </li></ul><ul><li>“ Standardise where you need standards, don’t where you don’t. Standardise messages not structures” (Graham Cameron) </li></ul><ul><li>Drive on the left or the right? </li></ul>
    38. 38. Self promotion <ul><li>Not making shareable reusable software, because we can publish every single monolithic software solution. </li></ul><ul><li>And get promoted. </li></ul><ul><li>Applies equally to databases and ontologies. </li></ul><ul><li>Production vs Novelty </li></ul>Not all software and databases are equal.
    39. 39. Research – Production Confusion <ul><li>Novelty vs Standards </li></ul><ul><li>Neither the funding nor the social structures of bioinformatics allow us to treat these two differently in any principled manner </li></ul><ul><li>How do you get funding for production software other than claiming to be researching stuff? </li></ul><ul><li>How do you get a publication out of a bit of research software without claiming a potential user-base? </li></ul>
    40. 40. Trust I don’t trust your code I don’t trust your data I don’t trust you will still be around in 1 year
    41. 41. Sin 2 <ul><li>Exceptionalism </li></ul><ul><li>Biologist exceptionalism </li></ul><ul><li>Biological exceptionalism </li></ul><ul><li>Biology exceptionalism </li></ul><ul><li>A cause of Reinvention Syndrome </li></ul><ul><li>“ Bioinformatics is special” </li></ul><ul><li>“ Domain specific outcomes requires-specific approaches and technologies” </li></ul>
    42. 42. Biologist exceptionalism <ul><li>I know there is already a gene name for that gene, but, I don't like it and it doesn't fit in with my schema. </li></ul><ul><li>It would be better if I wrote the script I need so I know what it does, how it does it and how to modify it later because I haven’t specified what it was supposed to do in the first place. </li></ul>I’m different. We are all individuals.
    43. 43. Biological exceptionalism <ul><li>“ Biology is all exception.” </li></ul><ul><li>“ Don’t complicate everyone’s life for the sake of a few esoteric cases”. Cameron’s 5 th Commandment of Curation </li></ul><ul><li>Exceptionalism paralysis. </li></ul><ul><li>Gather requirements expansively, prune ruthlessly </li></ul><ul><li>The EMBL/GenBank/DDBJ/Feature Table </li></ul>
    44. 44. We are so much more complex… <ul><li>“ There are proteins, and there are records about proteins. Records come in different formats. If I make a statement using this url, is it about the record? or the protein?” Alan Ruttenberg </li></ul><ul><li>“ [Usually] we have one entry per gene. We have several entries for a single gene when description of variations are too complicated to describe in FT lines (of course, this criteria depends on the annotator). For viruses, it is much more messy, due to ribosomal frame-shifts. Formalise that!” Eric Jain UniProtDB </li></ul><ul><li>er…decomposition and untangling? </li></ul>
    45. 45. Other Sciences…. <ul><li>CERN: UML meta-modelling mechanisms in order to migrate models over time without losing data. </li></ul><ul><li>Ensembl: “Our data models are complicated - I don't think specifying them will help. We need to understand them instead.” </li></ul><ul><li>And? </li></ul><ul><li>Confusing meta-mechanisms with models </li></ul>
    46. 46. Biology Exceptionalism <ul><li>Biology is harder than anything else in the whole wide world because there is lots of it and its complicated. </li></ul><ul><li>Drawing graphs of data sets over time. </li></ul><ul><li>Physics wipes you off the map. </li></ul><ul><li>The real problem is complexity not scale. </li></ul><ul><li>The number of data sets, their diversity and how they overlap. </li></ul><ul><li>How they change. </li></ul><ul><li>Their Reliability. </li></ul>
    47. 47. Sin 3 <ul><li>Autonomy or death! </li></ul><ul><li>Combined with churn and indifference to users. </li></ul><ul><li>Compounded by the Early Adopter tendency of the community and a monopoly mentality. </li></ul><ul><li>“ Hell is other people’s systems” as John Paul Sartre would have said if he had been a bioinformatician. </li></ul>
    48. 48. Autonomy is death! <ul><li>Change my interface / format whenever I feel like it, despite the fact I wanted lots of users and I have lots of users who depend on this. And I won’t bother to debug either or provide backwards compatibility. </li></ul><ul><ul><li>BioMART changed 4 times in the past year. </li></ul></ul><ul><ul><li>NCBI changes as it fancies. </li></ul></ul><ul><ul><li>Ensembl relational schema. </li></ul></ul><ul><ul><li>Early BioJava. </li></ul></ul><ul><li>This is just unprofessional. </li></ul><ul><li>Stable Metadata matters. Stable Models matter. Stable Interfaces matter. Stable Services matter. </li></ul>
    49. 49. Lincoln Stein said a while ago… <ul><li>An interface is a contract between data provider and data consumer </li></ul><ul><li>Document interface; warn if it is unstable </li></ul><ul><li>Do not make changes lightly </li></ul><ul><ul><li>Even little fiddly changes can break things </li></ul></ul><ul><ul><li>Provide plenty of advance warning </li></ul></ul><ul><li>When possible, maintain legacy interfaces until clients can port their scripts </li></ul><ul><li>Support as many interfaces as you can </li></ul><ul><li>HTML (least desired) </li></ul><ul><li>Text only (better) </li></ul><ul><li>HTTP-XML (even better) </li></ul><ul><li>SOAP-XML (sweet!) </li></ul><ul><li>Easy Interfaces + Power User Interfaces </li></ul>… and he could say it again today.
    50. 50. Law's Second Law <ul><li>“Error messages should never be provided” corollary... “If error messages are provided, they should be utterly cryptic so as to convey as little information as possible to the end user” </li></ul>
    51. 51. Workflow commodities <ul><li>Workflow published with its paper and its data set. </li></ul><ul><li>So what happens when I want to run this workflow again? </li></ul><ul><li>Is the service dead? </li></ul><ul><li>Is the dataset still there? </li></ul><ul><li>Was it designed to be reproduced or reused in the first place? </li></ul>
    52. 52. The myGrid Semantic Sweatshop <ul><li>Services and Workflows in the wild. </li></ul><ul><li>Curated by experts using an ontology. </li></ul><ul><ul><li>Supplied by service providers (like EMBOSS) in text. </li></ul></ul><ul><ul><li>Or annotations (like BioMOBY, but they aren’t good annotations!) </li></ul></ul><ul><ul><li>Tagged by the Masses. </li></ul></ul><ul><li>Multi-perspective </li></ul><ul><ul><li>Scientist for finding. </li></ul></ul><ul><ul><li>Machinery for validation. </li></ul></ul><ul><li>Hard work. Look how tired they are. </li></ul>Semantic
    53. 53. The myGrid Semantic Sweatshop notice how tired they look Franck Tanoh Katy Wolstencroft
    54. 54. Churn, Churn, Churn <ul><li>“ Stability is more important than Standards or Smartness. Discuss” </li></ul><ul><li>Constant churn and change for change sake. </li></ul><ul><ul><li>Impact on everyone else who uses the previous mechanism. </li></ul></ul><ul><li>A few voices, very loud, vested interest, for their application, win. </li></ul><ul><li>You know what? Why don’t we stick with something for a while and rally behind it? Or at least figure out the cost of change. </li></ul><ul><li>Maybe this is a sin inherited from Computer Science. </li></ul>
    55. 55. Churn, Churn, Churn <ul><li>We expect the content to change, but why does everything else. </li></ul><ul><li>Constant churn and change for change sake. </li></ul><ul><li>Maybe this is a sin inherited from Computer Science. </li></ul><ul><li>The W3C Identity War. Web Services vs REST </li></ul><ul><li>Impact on everyone else who uses the previous mechanism. </li></ul><ul><li>A few voices, very loud, vested interest, for their application, win. </li></ul><ul><li>You know what? Why don’t we stick with something for a while and rally behind it? Or at least figure out the cost of change. </li></ul><ul><li>“ Stability is more important than Standards or Smartness. Discuss” </li></ul>
    56. 56. Sin 4 <ul><li>Vanity </li></ul><ul><li>Pride </li></ul><ul><li>Narcissism </li></ul><ul><li>conceit, egotism or simple selfishness. </li></ul><ul><li>Applied to a social group, denotes elitism or an indifference to the plight of others </li></ul>
    57. 57. I know it all. <ul><li>Claiming to know everything about biology and everything about computers. </li></ul><ul><li>This is really irritating to both biologists and computer scientists. </li></ul><ul><li>Even they don’t claim to know everything about biology or computer science. </li></ul><ul><li>Computer scientists do know a lot of stuff. And they publish too. </li></ul><ul><li>“ Biologists are the experts on everything because we produce the data” </li></ul>And what would you suggest, Mr. Smartie Pants?
    58. 58. Think like me! <ul><li>Building interfaces that only you can use. </li></ul><ul><li>Not actually using your tools in the field. </li></ul><ul><li>I understand workflows </li></ul><ul><li>Workflows are for biologists. </li></ul><ul><li>My granny can do workflows... </li></ul><ul><li>Designing good experiments is hard. </li></ul><ul><li>Workflows are computational experimental protocols. Ergo…. </li></ul><ul><li>Writing workflows should be expected to be hard. </li></ul><ul><li>Writing good workflows is really hard. </li></ul><ul><li>Writing good reusable workflows is really really hard. </li></ul>Misunderstanding and disrespecting users
    59. 59. A good User Experience outweighs smart features. Can I use it? Is the user interface familiar? Does it fit with my needs?
    60. 60. Gain-Pain pay-off <ul><li>Just enough, just in time </li></ul>Gain Pain Very BAD Good, but Unlikely Just right
    61. 61. Sin 5 <ul><li>Monolith Meglomania </li></ul><ul><li>delusions of grandeur. </li></ul><ul><li>obsession with grandiosity and extravagance. </li></ul><ul><li>Data mining - “my data is mine, and your data is mine” </li></ul>
    62. 62. More, more, more! <ul><li>Integration – the more the merrier. No. </li></ul><ul><ul><li>Every link is a potential dead link. </li></ul></ul><ul><ul><li>Every dependency can find its way on to your critical path. </li></ul></ul><ul><ul><li>Monolithic solutions always fail. </li></ul></ul><ul><li>Put it all in a warehouse. </li></ul><ul><ul><li>ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator, MIPS, BioMART blah blah blah… </li></ul></ul><ul><ul><li>Toolkits: Information Integrator, GMOD, BioMART, BioWarehouse, blah blah… </li></ul></ul><ul><ul><li>50% warehouses fail. </li></ul></ul><ul><li>Uber-tools” and “Uber-databases” </li></ul><ul><ul><li>Biomart, Ensembl, etc etc…. </li></ul></ul>[Cameron]
    63. 63. The trouble with warehouses <ul><li>30% of data migration projects fail (Source: Standish Group) </li></ul><ul><li>50% of data warehousing / Business Intelligence projects fail (Source: NCR) </li></ul><ul><li>“ Warehouses work? Piffle. They never manage to maintain synchrony with the source data. Mostly they fall down of their own weight!” Graham Cameron, EMBL-EBI </li></ul><ul><li>&quot;Our ability to capture and store data far outpaces our ability to process and exploit it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again. Data tombs also represent missed opportunities.&quot; Usamma Fayyad Yahoo! Research! Laboratories! </li></ul><ul><li>We believe that attempts to solve the issues of scientific data management by building large, centralised, archival repositories are both dangerous and unworkable” Microsoft 2020 Science report. </li></ul>
    64. 64. More More More <ul><li>“ Emacs of Biology” </li></ul><ul><li>End-user apps/libraries in bioinformatics workbenches with loads of crap bundled in, none of it kept up to date, none of it properly integrated. </li></ul><ul><li>Keep it simple and modular </li></ul><ul><li>Don’t reinvent Eclipse. </li></ul>
    65. 65. Mash-Up Data Marshalling <ul><li>Content syndication and feeds </li></ul><ul><li>Emphasis shifts to the user creating specific integration by mapping. </li></ul><ul><li>Just in time, just enough design </li></ul><ul><li>On demand integration – or rather, aggregation. </li></ul>Mash Up Application User interface Protocol objects Protocol Protocol
    66. 66. Distributed Annotation System Mash-Up http://www.biodas.org Reference Server AC003027 AC005122 M10154 Annotation Server Annotation Server AC003027 M10154 WI1029 AFM820 AFM1126 WI443 AC005122 Annotation Server
    67. 67. Sin 6 <ul><li>Scientific Method Sloth </li></ul><ul><li>Its easier to think of a new name than use someone else’s. </li></ul><ul><li>I want my own view over data and views are difficult, so I’ll create my own database. </li></ul><ul><li>Leads to Reinvention, Exceptionalism </li></ul><ul><li>Often the result of Instant Gratification </li></ul>
    68. 68. Ennui <ul><li>Garbage in, garbage out </li></ul><ul><ul><li>Running analysis over the wrong datasets </li></ul></ul><ul><ul><li>E.g. Identifying chicken proteins in mouse cells. </li></ul></ul><ul><li>Configuration traditionalism </li></ul><ul><ul><li>Not changing the parameters of BLAST. Ever. </li></ul></ul><ul><li>Top list ennui </li></ul><ul><ul><li>If there is a list only looking at the first one. </li></ul></ul><ul><ul><li>Look no further than the first Blast hit / first Google hit. </li></ul></ul><ul><li>Arbitrary cut-offs on rank-ordered result list </li></ul><ul><ul><li>Absolute truth above, absolute falsehood below </li></ul></ul><ul><ul><li>E.g. differentially expressed genes in microarray analyses. </li></ul></ul>
    69. 69. Its black and white <ul><li>Arbitrary cut-offs on rank-ordered result list </li></ul><ul><li>Everything above is absolute truth and everything below complete falsehood.  </li></ul><ul><ul><li>sequence similarity when looking for orthologs. </li></ul></ul><ul><ul><li>protein identifications using Mascot scores. </li></ul></ul><ul><ul><li>differentially expressed genes in microarray analyses. </li></ul></ul>
    70. 70. Quality Delusions <ul><li>The bioinformatics does not have to be sound, because we only trust wet-lab results anyway. </li></ul><ul><li>Worrying about errors in experimental data but believing that derived data is always true. </li></ul><ul><li>Believing Trembl is always right. </li></ul><ul><li>Believing computational gene predictions are always correct. </li></ul>
    71. 71. Quality Delusions <ul><li>The bioinformatics does not have to be sound, because we only trust wet-lab results anyway. </li></ul><ul><li>Worrying about errors in experimental data but believing that derived data is always true. </li></ul><ul><li>Believing Trembl is always right. </li></ul><ul><li>Believing computational gene predictions are always correct. </li></ul>
    72. 72. Black Box Science <ul><li>Producing irreproducible bioinformatics analyses </li></ul><ul><ul><li>Not collecting the provenance of the analysis. </li></ul></ul><ul><ul><li>Not testing during software development. </li></ul></ul><ul><li>Try re-running experiments described in the journal Bioinformatics from before 5 years ago </li></ul><ul><li>UniGene </li></ul><ul><ul><li>What is happening during UniGene clustering? </li></ul></ul><ul><ul><li>‘ Human’ descriptions (via NCBI), are not exact. </li></ul></ul><ul><ul><li>The Human Transcriptome Map project and other microarray analysts ended up reclustering UniGene [Marco Roos]. </li></ul></ul>
    73. 73. “ No experiment is reproducible.” Wyszowski's Law “ An experiment is reproducible until another laboratory tries to repeat it.” Alexander Kohn
    74. 74. Sin 7 <ul><li>Instant Gratification </li></ul><ul><li>Greed? Gluttony? </li></ul><ul><li>Always the immediate return. </li></ul><ul><li>Never investing for the future. </li></ul><ul><li>The quick and dirty fix. </li></ul><ul><li>Refusing to model or abstract. </li></ul><ul><li>Refusing to plan for recording and exchanging. </li></ul><ul><li>Just getting the next quick fix. </li></ul><ul><li>The pressure to deliver now and pay later </li></ul>www.CartoonStock.com .
    75. 75. Hackery <ul><li>Deliver now, pay later </li></ul><ul><ul><li>Producing crap, non-reusable, software because only the biological results matter for publication X. </li></ul></ul><ul><ul><li>Collect! Analyse! Er…now what? </li></ul></ul><ul><li>Spaghetti-ism </li></ul><ul><ul><li>Over-indulgence in PERL </li></ul></ul><ul><ul><li>Over-indulgence in Ascii Art flat files. </li></ul></ul><ul><ul><li>Modelling a system by hacking up XSD fragments on a whiteboard. </li></ul></ul><ul><ul><li>Writing perl scripts that resemble my high-school BASIC of the 80s. </li></ul></ul>
    76. 76. “ I am sure one could reuse large parts of re-annotation for building transcriptome maps, if they only used workflows and ontologies”. Marco Roos A Biologist and Bioinformatician VL-e Project, Amsterdam
    77. 77. “ Bioinformaticians have reached the standards of the 1980s, while computer scientists are working on the standards of the 2020s, leaving roughly 40 years to bridge. Marco Roos A Biologist and Bioinformatician VL-e Project, Amsterdam
    78. 78. Blind faith in XML <ul><li>It’s in XML, thus all data integration problems are solved. </li></ul><ul><ul><li>Er…no. </li></ul></ul><ul><ul><li>All those vocabularies e.g. SBML, GenBank XML etc </li></ul></ul><ul><li>The good thing about XML is that it is human readable. </li></ul><ul><ul><li>Arrrrgh! </li></ul></ul><ul><li>Insisting that XML is not text. </li></ul><ul><li>Insisting that XML is text </li></ul>XML
    79. 79. Blind Faith in Foo. <ul><li>There's a new thing to use. </li></ul><ul><li>we don't understand it yet. </li></ul><ul><li>so it sucks up all the stuff we already know we don't understand. </li></ul><ul><li>Lack of appreciation about exactly what the new technology addresses in itself before trying to make it work for us . </li></ul>
    80. 80. Pioneering development methods <ul><li>Development by anecdote </li></ul><ul><ul><li>I heard in the pub that the way to go was Foo. </li></ul></ul><ul><ul><li>Though I have no idea what Foo is or why it is the way to go. </li></ul></ul><ul><li>Design by hacking </li></ul><ul><ul><li>It would be better if I wrote the script I need so I know what it does, how it does it and how to modify it later because I haven’t specified what it was supposed to do in the first place. </li></ul></ul><ul><ul><li>Hmmm…..We call that Extreme Programming or Emergent Semantics or Web 2.0 in CS  . </li></ul></ul>
    81. 81. Open Source Blinkers <ul><li>Why does Open source have special merit? </li></ul><ul><li>Commercial solutions with added special sauce can rock too. </li></ul><ul><li>Shall I duck? </li></ul>
    82. 82. Sin Summary Maybe only one “original sin” in bioinformatics. Parochialism and Insularity Exceptionalism Autonomy or death! Vanity: Pride and Narcissism Monolith Meglomania Scientific method Sloth Instant Gratification Reinvention Churn
    83. 83. Can we become less sinful? Why do these sins exist? Are bioinformaticians particularly naughty? No naughtier than Computer Scientists. And its all very hard. Though they are naughty…
    84. 84. Why? <ul><li>Selfish Scientist – Self-interested Scientist </li></ul><ul><ul><li>Reputation, need to get results right now, win. </li></ul></ul><ul><ul><li>Fear of dependency, fear of being left behind. </li></ul></ul><ul><ul><li>Understand the incentives and barriers to adoption. </li></ul></ul><ul><li>Bioinformatics as it is practiced </li></ul><ul><ul><li>Social and funding structure perpetuates this. </li></ul></ul><ul><ul><li>Production vs Research. </li></ul></ul><ul><li>Real, inherent issues. It is hard. </li></ul><ul><li>Hybrid exhaustion and pressure. </li></ul><ul><ul><li>Biology + Computing + Bioinformatics </li></ul></ul>
    85. 85. Luddism? Surely not! <ul><li>Refusing to have biology go beyond a cottage industry. </li></ul><ul><li>Being scared to do it properly. </li></ul><ul><li>Railing against big science </li></ul><ul><li>The cult of amateurism. </li></ul>[Stevens]
    86. 86. Research – Production Confusion <ul><li>Novelty vs Standards </li></ul><ul><li>Neither the funding nor the social structures of bioinformatics allow us to treat these two differently in any principled manner </li></ul><ul><li>How do you get funding for production software other than claiming to be researching stuff? </li></ul><ul><li>How do you get a publication out of a bit of research software without claiming a potential user-base? </li></ul>
    87. 87. Practical Steps? <ul><li>Create means to share know-how </li></ul><ul><ul><li>Understanding outside my expertise. e.g. sources of error. </li></ul></ul><ul><ul><li>A comprehensive catalogue of web services </li></ul></ul><ul><ul><li>A Facebook for workflow builders. </li></ul></ul><ul><ul><li>Learn from others. Even Computer Science. And other Sciences. </li></ul></ul><ul><ul><li>Try and create a culture of raising quality. Somehow. </li></ul></ul>
    88. 88. FaceBook & Bazaar for Workflow e-Scientists myexperiment.org Trials start August 2007!
    89. 89. Delivery Bulge
    90. 90. Practical Steps for IT Platforms? <ul><li>Stop building monolithic solutions </li></ul><ul><ul><li>Strong force in business enterprises </li></ul></ul><ul><li>Component-ise Bioinformatics </li></ul><ul><ul><li>Loosely coupled systems </li></ul></ul><ul><ul><li>Stable APIs, standardised metadata. </li></ul></ul><ul><ul><li>Design to combine. </li></ul></ul><ul><ul><li>Sort out the b***dy naming/id problem </li></ul></ul><ul><ul><li>If you can’t agree, agree on the bridge. </li></ul></ul><ul><li>Raise the level of abstraction </li></ul><ul><ul><li>Less Perl, more workflows  </li></ul></ul><ul><ul><li>Enable users to extract the data they need without hassling you. </li></ul></ul>
    91. 91. Practical Steps? <ul><li>Presume and design for incremental change </li></ul><ul><ul><li>Minimise disruption. </li></ul></ul><ul><li>Presume others use our stuff </li></ul><ul><ul><li>And respect that </li></ul></ul><ul><ul><li>Describe to build Trust </li></ul></ul><ul><li>Presume others add value to our stuff </li></ul><ul><ul><li>Be easily part of loosely coupled systems. Lightweight programming models. </li></ul></ul><ul><ul><li>Presume, and enable, content and function mashing. </li></ul></ul>
    92. 92. Web 2.0 Design Patterns <ul><li>http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html </li></ul>26/2/2007 | myExperiment | Slide <ul><li>The Long Tail </li></ul><ul><li>Data is the Next Intel Inside </li></ul><ul><li>Users Add Value </li></ul><ul><li>Network Effects by Default </li></ul><ul><li>Some Rights Reserved </li></ul><ul><li>The Perpetual Beta </li></ul><ul><li>Cooperate, Don't Control </li></ul><ul><li>Software Above the Level of a Single Device </li></ul>
    93. 93. Practical Steps? <ul><li>Presume scientific practice naughtiness </li></ul><ul><ul><li>Try to deal with it, or expose it? </li></ul></ul><ul><ul><li>Transparency and accurate collection and reporting. </li></ul></ul><ul><ul><li>Provenance. </li></ul></ul><ul><ul><li>A prerequisite to publication. </li></ul></ul><ul><ul><li>The end of Black Box Science. </li></ul></ul><ul><ul><li>Peer pressure. </li></ul></ul><ul><ul><li>E.g. Workflows, but will a scientist give away their secrets or expose their mistakes? </li></ul></ul>
    94. 94. The Final Word Sin writes histories, goodness is silent.   Thomas Fuller
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×