Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July ...
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
guest5132891The ID crisis, the format heterogeneity, the reinventing of the wheel... yep, we still feel the pain. It should be a paper, a highly accessed one, on Bioinformatics, now :)4 years ago
The Seven Deadly Sins of BioinformaticsPresentation Transcript
The Seven Deadly Sins of Bioinformatics Professor Carole Goble [email_address] The University of Manchester, UK The myGrid project OMII-UK
Roadmap
Sins of BioScience
With examples
Why are we like this?
The Selfish Scientist? E-Science is me-Science.
Challenges
Technical
Social
Intractable Problems in Bioinformatics. Have we sinned? Are these part of the intractable problem?
The traditional sins….
Lust
Gluttony
Greed
Sloth
Wrath
Envy
Pride
http://en.wikipedia.org/wiki/Seven_deadly_sins [Stevens and Lord]
Methodology
Email a handful of bioinformaticans.
Stand well back.
Collect.
Edit.
Therapy on the cheap.
We all felt better.
I am grateful to…
Phil Lord (University of Newcastle)
Anil Wipat (University of Newcastle)
Matthew Pocock (University of Newcastle)
Robert Stevens (University of Manchester)
Paul Fisher (University of Manchester)
Duncan Hull (Manchester Centre for Systems Biology)
Norman Paton (University of Manchester)
Marco Roos (University of Amsterdam)
Rodrigo Lopez (EBI)
Tom Oinn (EBI)
Andy Law (Roslin Institute)
Graham Cameron (EBI)
They came up with more than seven. But I beat them into submission. Many are highly inter-related. Hopefully they are all too familiar.
Sins
Parochialism and Insularity
Exceptionalism
Autonomy or death!
Vanity: Pride and Narcissism
Monolith Meglomania
Scientific method Sloth
Instant Gratification
Parochialism
“ being provincial, being narrow in scope, or considering only small sections of an issue.” http://en.wikipedia.org/wiki/Parochialism
Insularity
“ a person, group of people, or a community that is only concerned with their limited way of life and not at all interested in new ideas or other cultures.” http://en.wikipedia.org/wiki/Insularity
Sin 1
Reinvention
Reinventing the Wheel. Rediscovering the same problems. Rediscovery of techniques & methods.
Creating…
Yet another identity scheme. Yet another representation mechanism for data.
Yet another ontology. Yet another data warehouse.
Yet another integration framework. Yet another query or ontology or workflow language.
Result? Misery. Or more work for the boys.
Comparative Genomics? Tisk! Its Comparative Bioinformatics Bioinformatics is about mapping one schema to another, one format to another, one id scheme to another. What a waste of time. What a handy distraction from doing some Real Science™.
“The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study”... and is frequently many, many more.
http://bioinformatics.roslin.ac.uk/lawslaws.html
The Selfish Scientist
“ A biologist would rather share their toothbrush than their (gene) names”
Mike Ashburner
Professor Genetics
University of Cambridge
UK
Amongst the many
Some causes of the Identity Crisis
Conflation of the ID for a thing, something to call the thing, a description of the thing, with the thing itself (reference/referent)
Internal vs external IDs
Opaque vs human-interpretable IDs
Situation-dependent 'parts' of a resource get different IDs
e.g. the gene in a disease process vs the disease in a metabolic process
Annotation attribution and log differentiation
Two organisations attach annotations to two IDs, state they are referring to the same thing, they now have provenance about which of them asserted which facts
[Pocock]
Id Reinvention
Global Identity naming mechanism for data objects in the Life Sciences
LSIDs and URIs and PURLs. WS-Naming and all its friends
Half the debaters haven’t actually read the LSID or URL or PURL specs. Or provided use cases.
Web Pages are not Data Assets.
“ you could do this with HTTP based identifiers given <insert hack>”.
The debate rages! 124 messages in the last week.
W3C Semantic Web Health Care and Life Sciences Interest Group [email_address]
“ The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.”
Different codes to signify the sex of animals.
crimap uses '0' female and '1' male.
Keightly algorithm. ‘1' female and ‘0' male.
Knott & Haley QTL analysis algorithm ‘1' female and ‘2' male
When they'll use '3' and '4' and then we'll know they're doing it deliberately.
http://bioinformatics.roslin.ac.uk/lawslaws.html
EMBOSS lists more than 20 different sequence formats.
“ Nearly every collection of sequences that dares call itself a database has stored its data in its own format.”
The Montagues and The Capulets.. Let me get my bullet-proof vest …
The “Oh No” OBO Pragmatists Aesthetics Philosophers Life Scientists Capulets Knowledge Representation Montagues A means to an end Content providers Theoreticians The end Mechanism providers Spiritual guides The Montagues and The Capulets …SOFG 2004, KCap 2005, Comparative and Functional Genomics 2004 Endurants, Perdurants, Being, Substance, Event
Yet another database …
Organism databases
Counter example
Generic Model Organism Database Toolkit.
FlyBase, WormBase, SGD, BeeBase and many other large and small community databases
BioBabel
bioperl
biojava
biopython
bioruby
biophp
biosql
biouml
biofoo
biobar
Integration
Workflows Management Systems
Counter example
Taverna
http://www.mygrid.org.uk
Reinvent wheels in creating 'Transcriptional Units' ('genes' derived from ESTs and mRNA), within species and between species.
This holds for many genome assembly related stuff
Genome data compilers for E. coli, Drosophila, Plant species, etcetera reuse each other's code?
Usually something new is added, but large parts could have been reused.
Any more ?
Another Web 2.0 Web Site? Another Web interface to a database? Another portal?
Whole database systems. ACeDB is not a lone-case.
Genome data compilers for E. coli, Drosophila, Plant species, etcetera reuse each other's code?
Text miners require synonyms and reinvent the wheel to get them in many cases.
Add your favourite here….
Reuse Rocks. Collaboration through workflow and web services
VL-e Project
“ instant collaboration” with Martijn Schuemie (Rotterdam) through a web service that discloses their protein synonym data.
Exchanging services and (sub)workflows with food scientists.
Web services make that easier.
Recycling, Reuse, Repurposing
A Trypanosomiasis in Cattle workflow (by Paul) reused without change for Trichuris muris Infection (by Jo).
Identified the biological pathways believed to be involved in the ability of mice to expel the parasite.
Workflows are memes. Scientific commodities. To be exchanged and traded and vetted and mashed. Users add value.
Warning! Reuse is Hard
Writing reusable workflows is hard.
Local services
Permissions. Licences
What does it DO?
Writing reusable services is hard.
What does it DO?
Predicting the unknown required by the unknown.
Finding workflows, services and tools is hard
Where do you go?? What does it DO??
Creating web services is still a bottleneck. For quick solutions it is still seen as too much extra trouble.
Bullying and the Borg
If a group is working in a field, you get bullied at for trying out something different.
Can YOU think of an example??
You may actually be doing something different, but you use some common words.
“ Why do this? It's already been solved by Foo - the massively unwieldy, slow-moving, monolithic, meeting paralysed international effort for Things Mentioning Foo”.
Reinvention or Invention? Pre-dating
BioMOBY pre-dates (Semantic) Web service revolution
OBO and OBO-Edit pre-dates OWL and Protégé-OWL
20 years of Knowledge Representation.
Taverna pre-dates a reliable Open Source BPEL engine
20 years of functional programming.
There ARE features that Bioinformatics needs that other solutions don’t cater for.
A few months in the laboratory (or the computer) can save a few hours in the library (or on Google). Westheimer's Law (with additions).
No tool is an island…
Assume
only we will use it, whatever it may be.
that it will be freestanding and unlinked to anything else.
that it will always work and will keep on working.
That everyone will understand it.
“Well I know what I mean. And so does my mate. So I don’t need to specify it. Or document it properly. Or keep the metadata up to date.”
Never mind the interface, just look at my implementation!
Metadata matters. Models matter.
Interfaces matter. Services matter.
I know what it means...
A hacker who studied ontology
Was famed for his sense of frivolity
When his program inferred
That Clyde ISA Bird †
He blamed – not his code – but zoology
† Clyde ISA Elephant
“ AI limericks” by Henry Kautz http:// www.cs.washington.edu/homes/kautz/misc/limericks.html
Not just bioinformatics Computer Science is Guilty!
Why don’t biologists modularise OWL ontologies properly? Er, well, like how should we do it “properly” and where are the tools to help us? We don’t know and we haven’t got any. But here are some vague guidelines. W3C Semantic Web for Life Sciences mailing list, 2005
“ I don't blame them [MGED/PSI community] because to truly comprehend RDF/OWL is not an easy task, it takes not just the understand of technology itself but more so the vision on how things should and can work in SW.” “ One thing we have to remember is that biologists are building ontologies to do a job of work. They are not produced as some end of CS or SW research” “ Principles are all well and good, but we should know from decades of software engineering that saying "do it properly" isn't a solution. We need tooling and methodologies that do not in themselves hinder a domain specialist. In many cases it is easier to re-develop than re-use or even cut-and-paste from an existing ontology than it is to muck around “doing it properly”” “ There is actually a gap between the view of ontology for CS people and for biological people. The ontology in biologist's eyes are more of a treaty than logical representation, that in CS view is on the reverse of that view. It needs dialog to bring the view to a middle ground and mechanisms to stretch to both directions.”
Standards are boring (but important)
“ Blue collar Science” (John Quackenbush)
Nobody is going to win a Nobel prize for creating a standard schema, ontology or whatever. (Duncan Hull)
“ Standardise where you need standards, don’t where you don’t. Standardise messages not structures” (Graham Cameron)
Drive on the left or the right?
Self promotion
Not making shareable reusable software, because we can publish every single monolithic software solution.
And get promoted.
Applies equally to databases and ontologies.
Production vs Novelty
Not all software and databases are equal.
Research – Production Confusion
Novelty vs Standards
Neither the funding nor the social structures of bioinformatics allow us to treat these two differently in any principled manner
How do you get funding for production software other than claiming to be researching stuff?
How do you get a publication out of a bit of research software without claiming a potential user-base?
Trust I don’t trust your code I don’t trust your data I don’t trust you will still be around in 1 year
Sin 2
Exceptionalism
Biologist exceptionalism
Biological exceptionalism
Biology exceptionalism
A cause of Reinvention Syndrome
“ Bioinformatics is special”
“ Domain specific outcomes requires-specific approaches and technologies”
Biologist exceptionalism
I know there is already a gene name for that gene, but, I don't like it and it doesn't fit in with my schema.
It would be better if I wrote the script I need so I know what it does, how it does it and how to modify it later because I haven’t specified what it was supposed to do in the first place.
I’m different. We are all individuals.
Biological exceptionalism
“ Biology is all exception.”
“ Don’t complicate everyone’s life for the sake of a few esoteric cases”. Cameron’s 5 th Commandment of Curation
Exceptionalism paralysis.
Gather requirements expansively, prune ruthlessly
The EMBL/GenBank/DDBJ/Feature Table
We are so much more complex…
“ There are proteins, and there are records about proteins. Records come in different formats. If I make a statement using this url, is it about the record? or the protein?” Alan Ruttenberg
“ [Usually] we have one entry per gene. We have several entries for a single gene when description of variations are too complicated to describe in FT lines (of course, this criteria depends on the annotator). For viruses, it is much more messy, due to ribosomal frame-shifts. Formalise that!” Eric Jain UniProtDB
er…decomposition and untangling?
Other Sciences….
CERN: UML meta-modelling mechanisms in order to migrate models over time without losing data.
Ensembl: “Our data models are complicated - I don't think specifying them will help. We need to understand them instead.”
And?
Confusing meta-mechanisms with models
Biology Exceptionalism
Biology is harder than anything else in the whole wide world because there is lots of it and its complicated.
Drawing graphs of data sets over time.
Physics wipes you off the map.
The real problem is complexity not scale.
The number of data sets, their diversity and how they overlap.
How they change.
Their Reliability.
Sin 3
Autonomy or death!
Combined with churn and indifference to users.
Compounded by the Early Adopter tendency of the community and a monopoly mentality.
“ Hell is other people’s systems” as John Paul Sartre would have said if he had been a bioinformatician.
Autonomy is death!
Change my interface / format whenever I feel like it, despite the fact I wanted lots of users and I have lots of users who depend on this. And I won’t bother to debug either or provide backwards compatibility.
An interface is a contract between data provider and data consumer
Document interface; warn if it is unstable
Do not make changes lightly
Even little fiddly changes can break things
Provide plenty of advance warning
When possible, maintain legacy interfaces until clients can port their scripts
Support as many interfaces as you can
HTML (least desired)
Text only (better)
HTTP-XML (even better)
SOAP-XML (sweet!)
Easy Interfaces + Power User Interfaces
… and he could say it again today.
Law's Second Law
“Error messages should never be provided” corollary... “If error messages are provided, they should be utterly cryptic so as to convey as little information as possible to the end user”
Workflow commodities
Workflow published with its paper and its data set.
So what happens when I want to run this workflow again?
Is the service dead?
Is the dataset still there?
Was it designed to be reproduced or reused in the first place?
The myGrid Semantic Sweatshop
Services and Workflows in the wild.
Curated by experts using an ontology.
Supplied by service providers (like EMBOSS) in text.
Or annotations (like BioMOBY, but they aren’t good annotations!)
Tagged by the Masses.
Multi-perspective
Scientist for finding.
Machinery for validation.
Hard work. Look how tired they are.
Semantic
The myGrid Semantic Sweatshop notice how tired they look Franck Tanoh Katy Wolstencroft
Churn, Churn, Churn
“ Stability is more important than Standards or Smartness. Discuss”
Constant churn and change for change sake.
Impact on everyone else who uses the previous mechanism.
A few voices, very loud, vested interest, for their application, win.
You know what? Why don’t we stick with something for a while and rally behind it? Or at least figure out the cost of change.
Maybe this is a sin inherited from Computer Science.
Churn, Churn, Churn
We expect the content to change, but why does everything else.
Constant churn and change for change sake.
Maybe this is a sin inherited from Computer Science.
The W3C Identity War. Web Services vs REST
Impact on everyone else who uses the previous mechanism.
A few voices, very loud, vested interest, for their application, win.
You know what? Why don’t we stick with something for a while and rally behind it? Or at least figure out the cost of change.
“ Stability is more important than Standards or Smartness. Discuss”
Sin 4
Vanity
Pride
Narcissism
conceit, egotism or simple selfishness.
Applied to a social group, denotes elitism or an indifference to the plight of others
I know it all.
Claiming to know everything about biology and everything about computers.
This is really irritating to both biologists and computer scientists.
Even they don’t claim to know everything about biology or computer science.
Computer scientists do know a lot of stuff. And they publish too.
“ Biologists are the experts on everything because we produce the data”
And what would you suggest, Mr. Smartie Pants?
Think like me!
Building interfaces that only you can use.
Not actually using your tools in the field.
I understand workflows
Workflows are for biologists.
My granny can do workflows...
Designing good experiments is hard.
Workflows are computational experimental protocols. Ergo….
Writing workflows should be expected to be hard.
Writing good workflows is really hard.
Writing good reusable workflows is really really hard.
Misunderstanding and disrespecting users
A good User Experience outweighs smart features. Can I use it? Is the user interface familiar? Does it fit with my needs?
Gain-Pain pay-off
Just enough, just in time
Gain Pain Very BAD Good, but Unlikely Just right
Sin 5
Monolith Meglomania
delusions of grandeur.
obsession with grandiosity and extravagance.
Data mining - “my data is mine, and your data is mine”
More, more, more!
Integration – the more the merrier. No.
Every link is a potential dead link.
Every dependency can find its way on to your critical path.
Toolkits: Information Integrator, GMOD, BioMART, BioWarehouse, blah blah…
50% warehouses fail.
Uber-tools” and “Uber-databases”
Biomart, Ensembl, etc etc….
[Cameron]
The trouble with warehouses
30% of data migration projects fail (Source: Standish Group)
50% of data warehousing / Business Intelligence projects fail (Source: NCR)
“ Warehouses work? Piffle. They never manage to maintain synchrony with the source data. Mostly they fall down of their own weight!” Graham Cameron, EMBL-EBI
"Our ability to capture and store data far outpaces our ability to process and exploit it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again. Data tombs also represent missed opportunities." Usamma Fayyad Yahoo! Research! Laboratories!
We believe that attempts to solve the issues of scientific data management by building large, centralised, archival repositories are both dangerous and unworkable” Microsoft 2020 Science report.
More More More
“ Emacs of Biology”
End-user apps/libraries in bioinformatics workbenches with loads of crap bundled in, none of it kept up to date, none of it properly integrated.
Keep it simple and modular
Don’t reinvent Eclipse.
Mash-Up Data Marshalling
Content syndication and feeds
Emphasis shifts to the user creating specific integration by mapping.
Just in time, just enough design
On demand integration – or rather, aggregation.
Mash Up Application User interface Protocol objects Protocol Protocol
Distributed Annotation System Mash-Up http://www.biodas.org Reference Server AC003027 AC005122 M10154 Annotation Server Annotation Server AC003027 M10154 WI1029 AFM820 AFM1126 WI443 AC005122 Annotation Server
Sin 6
Scientific Method Sloth
Its easier to think of a new name than use someone else’s.
I want my own view over data and views are difficult, so I’ll create my own database.
Leads to Reinvention, Exceptionalism
Often the result of Instant Gratification
Ennui
Garbage in, garbage out
Running analysis over the wrong datasets
E.g. Identifying chicken proteins in mouse cells.
Configuration traditionalism
Not changing the parameters of BLAST. Ever.
Top list ennui
If there is a list only looking at the first one.
Look no further than the first Blast hit / first Google hit.
Arbitrary cut-offs on rank-ordered result list
Absolute truth above, absolute falsehood below
E.g. differentially expressed genes in microarray analyses.
Its black and white
Arbitrary cut-offs on rank-ordered result list
Everything above is absolute truth and everything below complete falsehood.
sequence similarity when looking for orthologs.
protein identifications using Mascot scores.
differentially expressed genes in microarray analyses.
Quality Delusions
The bioinformatics does not have to be sound, because we only trust wet-lab results anyway.
Worrying about errors in experimental data but believing that derived data is always true.
Believing Trembl is always right.
Believing computational gene predictions are always correct.
Quality Delusions
The bioinformatics does not have to be sound, because we only trust wet-lab results anyway.
Worrying about errors in experimental data but believing that derived data is always true.
Believing Trembl is always right.
Believing computational gene predictions are always correct.
Black Box Science
Producing irreproducible bioinformatics analyses
Not collecting the provenance of the analysis.
Not testing during software development.
Try re-running experiments described in the journal Bioinformatics from before 5 years ago
UniGene
What is happening during UniGene clustering?
‘ Human’ descriptions (via NCBI), are not exact.
The Human Transcriptome Map project and other microarray analysts ended up reclustering UniGene [Marco Roos].
“ No experiment is reproducible.” Wyszowski's Law “ An experiment is reproducible until another laboratory tries to repeat it.” Alexander Kohn
Sin 7
Instant Gratification
Greed? Gluttony?
Always the immediate return.
Never investing for the future.
The quick and dirty fix.
Refusing to model or abstract.
Refusing to plan for recording and exchanging.
Just getting the next quick fix.
The pressure to deliver now and pay later
www.CartoonStock.com .
Hackery
Deliver now, pay later
Producing crap, non-reusable, software because only the biological results matter for publication X.
Collect! Analyse! Er…now what?
Spaghetti-ism
Over-indulgence in PERL
Over-indulgence in Ascii Art flat files.
Modelling a system by hacking up XSD fragments on a whiteboard.
Writing perl scripts that resemble my high-school BASIC of the 80s.
“ I am sure one could reuse large parts of re-annotation for building transcriptome maps, if they only used workflows and ontologies”. Marco Roos A Biologist and Bioinformatician VL-e Project, Amsterdam
“ Bioinformaticians have reached the standards of the 1980s, while computer scientists are working on the standards of the 2020s, leaving roughly 40 years to bridge. Marco Roos A Biologist and Bioinformatician VL-e Project, Amsterdam
Blind faith in XML
It’s in XML, thus all data integration problems are solved.
Er…no.
All those vocabularies e.g. SBML, GenBank XML etc
The good thing about XML is that it is human readable.
Arrrrgh!
Insisting that XML is not text.
Insisting that XML is text
XML
Blind Faith in Foo.
There's a new thing to use.
we don't understand it yet.
so it sucks up all the stuff we already know we don't understand.
Lack of appreciation about exactly what the new technology addresses in itself before trying to make it work for us .
Pioneering development methods
Development by anecdote
I heard in the pub that the way to go was Foo.
Though I have no idea what Foo is or why it is the way to go.
Design by hacking
It would be better if I wrote the script I need so I know what it does, how it does it and how to modify it later because I haven’t specified what it was supposed to do in the first place.
Hmmm…..We call that Extreme Programming or Emergent Semantics or Web 2.0 in CS .
Open Source Blinkers
Why does Open source have special merit?
Commercial solutions with added special sauce can rock too.
Shall I duck?
Sin Summary Maybe only one “original sin” in bioinformatics. Parochialism and Insularity Exceptionalism Autonomy or death! Vanity: Pride and Narcissism Monolith Meglomania Scientific method Sloth Instant Gratification Reinvention Churn
Can we become less sinful? Why do these sins exist? Are bioinformaticians particularly naughty? No naughtier than Computer Scientists. And its all very hard. Though they are naughty…
Why?
Selfish Scientist – Self-interested Scientist
Reputation, need to get results right now, win.
Fear of dependency, fear of being left behind.
Understand the incentives and barriers to adoption.
Bioinformatics as it is practiced
Social and funding structure perpetuates this.
Production vs Research.
Real, inherent issues. It is hard.
Hybrid exhaustion and pressure.
Biology + Computing + Bioinformatics
Luddism? Surely not!
Refusing to have biology go beyond a cottage industry.
Being scared to do it properly.
Railing against big science
The cult of amateurism.
[Stevens]
Research – Production Confusion
Novelty vs Standards
Neither the funding nor the social structures of bioinformatics allow us to treat these two differently in any principled manner
How do you get funding for production software other than claiming to be researching stuff?
How do you get a publication out of a bit of research software without claiming a potential user-base?
Practical Steps?
Create means to share know-how
Understanding outside my expertise. e.g. sources of error.
A comprehensive catalogue of web services
A Facebook for workflow builders.
Learn from others. Even Computer Science. And other Sciences.
Try and create a culture of raising quality. Somehow.
FaceBook & Bazaar for Workflow e-Scientists myexperiment.org Trials start August 2007!
Delivery Bulge
Practical Steps for IT Platforms?
Stop building monolithic solutions
Strong force in business enterprises
Component-ise Bioinformatics
Loosely coupled systems
Stable APIs, standardised metadata.
Design to combine.
Sort out the b***dy naming/id problem
If you can’t agree, agree on the bridge.
Raise the level of abstraction
Less Perl, more workflows
Enable users to extract the data they need without hassling you.
Practical Steps?
Presume and design for incremental change
Minimise disruption.
Presume others use our stuff
And respect that
Describe to build Trust
Presume others add value to our stuff
Be easily part of loosely coupled systems. Lightweight programming models.
Presume, and enable, content and function mashing.
Must for budding bioinformaticians :) 2 years ago
It should be a paper, a highly accessed one, on Bioinformatics, now :) 4 years ago