Grafana in space: Monitoring Japan's SLIM moon lander in real time
Knowledge Management in a Knowledge Based Discipline
1. Knowledge Management in a Knowledge Based
Discipline
Robert Stevens
BioHealth Informatics Group
University of Manchester
Robert.Stevens@manchester.ac.uk
2. Introduction
• How do we do (molecular)biology
• Managing stamp albums
• A knowledge based discipline
• Representing knowledge computationally
• Ontologies that define what entities are in the
domain
• Describing biological knowledge ontologically
• Using ontologies and is it enough?
3. Ernest Rutherford
“All science is either physics or stamp collecting”
Image: http://en.wikipedia.org/wiki/File:Ernest_Rutherford2.jpg
8. Speed of sequencing
• First human genome
– 10+ years to produce
– Cost $500 million
– Huge international effort
• Now done in 10 weeks
– (for $399)
– http://tinyurl.com/genomecost
– http://www.23andme.com
12. What is Knowledge?
• Knowledge – all information
and an understanding to
carry out tasks and to infer
new information
• Information -- data
equipped with meaning
• Data -- un-interpreted
signals that reach our
senses
Michael Ashburner
Professor
University of Cambridge
UK
I
S
M
B
Name
Job
Institution
Country
C
o
n
f
man
academic, senior
ancient university, 5 rated
European
important figure in biology
B
I
O
L
O
G
Y
13. A Knowledge Based Discipline
• Rather than laws captured in mathematics….
• We have lots of facts: the discipline’s knowledge
• Rather than “calculating” what a protein does, we
investigate and write it down
• Equivalent to writing down the trajectories of all
thrown objects and not doing ballistics!
• To do biology one needs “the knowledge”
14. Heterogeneity
• 28 ways to format the representations of a biological
sequence
• Though one way to represent the bases or amino
acids…
• Different words same concept
• Different concepts same words
• Different and implicit data schema
15. Categories and Category Labels
GO:0000368
U2-type nuclear mRNA 5' splice site recognition
spliceosomal E complex formation
spliceosomal E complex biosynthesis
spliceosomal CC complex formation
U2-type nuclear mRNA 5'-splice site recognition
16. An Identity Crisis
• Database entries have identifiers unique within their
database
• The type of entity described in an entry doesn’t have
an identifier
• Different entries about the same type talk about it
differently
• How do we know when an entry in one DB talks
about the same thing as another entry in another
DB?
• That’s the skill of a bioinformatician
17. Why: Society of Biologists
• To do particle physics necessarily has central
organisation
• One central place to generate data
• A communitarian attitude
• It is still possible to do biology in the “garden shed”
• Historicaly less need to organise
• Hence…
19. Biology is Special
• Large quantities of data: No it doesn’t
• Complex data: Yes it does
• Volatile data: Types of data and what is recorded
changes rapidly
• Nothing that special about biology
• …except that it has all the problem and often to a
large degree
22. Creating Woods, not Trees
Genes
Proteins
Pathways
Interactions
Literature
Complex
Machines
Virtual
Organism
…. from biological facts, we make a system that is some model of a real organism
25. A Biologist’s Skills
• By the time a biologist has finished a Ph.D. he/she is
about ready for action
• They have a comprehensive knowledge of the facts
of a (narrow) domain
• He/she also knows how to do experimentation in that
domain
• There are so many facts, it is difficult to move outside
one’s sub-discipline
• Yet in a systems view such movement is mandatory
26. The Role of Knowledge
• A lot of facts
• Perhaps organised into a system
• No equivalent of “laws of mechanics” – we
can’t do this biology with mathematics
• Or at least not without knowing what the
numbers mean...
• This is why we’ve been using ontologies!
27. What is an Ontology?
• A description of that which exists (in our data)
• What it means to be a member of a category
• What categories of things exist and how do I
recognise that a particular object is a member of a
given category
29. Why develop an ontology?
• To make domain assumptions explicit
– Easier to change domain assumptions
– Easier to understand and update legacy data
• To separate domain knowledge from operational knowledge
– Re-use domain and operational knowledge
separately
• A community reference for applications
• To share a consistent understanding of what information means.
31. Controlled Vocabulary
• An Ontology isn’t a controlled vocabulary, but can be
used to deliver one
• By agreeing upon the categories in a domain and
agreeing upon their labels we are controlling
vocabulary
• Addresses one major problem in biology
• Also forces examination of definitions
• Makes domain assumptions explicit
33. Post-Genomic Biology
• Fly, mouse, yeast, worm all have their own
terminologies
• I want to compare genomes
• How?
• The genomic sequence is easily dealt with
computationally and comparisons are easy
• This is not true of the annotations or knowledge of
those sequences
• Need a common understanding
34. Annotation of Data
• Big effort to create controlled vocabularies using
ontologies
• A huge annotation efffort – describe the entities in DB
with terms from ontologies
• The Gene Ontology (http://www.geneontology.org))
• The Open Biomedical Ontologies Consortiym
35. Genotype Phenotype
Sequence
Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue /
enzyme source
Development
Anatomy
Pheonotype
Plasmodium
life cycle
-Sequence types
and features
-Genetic Context
- Molecule role
- Molecular Function
- Biological process
- Cellular component
-Protein covalent bond
-Protein domain
-UniProt taxonomy
-Pathway ontology
-Event (INOH pathway
ontology)
-Systems Biology
-Protein-protein
interaction
-Arabidopsis development
-Cereal plant development
-Plant growth and developmental stage
-C. elegans development
-Drosophila development FBdv fly
development.obo OBO yes yes
-Human developmental anatomy, abstract
version
-Human developmental anatomy, timed version
-Mosquito gross anatomy
-Mouse adult gross anatomy
-Mouse gross anatomy and development
-C. elegans gross anatomy
-Arabidopsis gross anatomy
-Cereal plant gross anatomy
-Drosophila gross anatomy
-Dictyostelium discoideum anatomy
-Fungal gross anatomy FAO
-Plant structure
-Maize gross anatomy
-Medaka fish anatomy and development
-Zebrafish anatomy and development
-NCI Thesaurus
-Mouse pathology
-Human disease
-Cereal plant trait
-PATO PATO attribute and value.obo
-Mammalian phenotype
-Habronattus courtship
-Loggerhead nesting
-Animal natural history and life history
eVOC (Expressed
Sequence Annotation
for Humans)
38. GO in Analysis
• Microarray analysis one of the original visions for GO
• Clustering of modulated genes cluster about
functional attributes of their proteins
• GO also used in, for example, semantic similarity;
text analysis; etc.
39. Fact Management
• When “stamp collecting” we’re collecting facts
• Biology is a fact management activity
• Knowing what these fact mean is very import
• Science is perofrmed on data and the smeantics of
data enable us to do science
• Semantic e-Science
40. Summary
• The nature of modern biology gives it interesting
knowledge (fact) management issues
• It is a knowledge based discipline
• Not unique, but often extreme
• Ontologies seen as one component in management
(but not a panacea)
41. acknowledgements
• All these people provided slides and input:
• Duncan Hull
• Simon Jupp
• Phil Lord
• Carole goble
Slide Title: G 2 P
Slide contains two semicircles labelled Genotype and Phenotype
Text says: Classic Biology; Modern Biology
Ide
“Data are the uninterpreted signals that reach our senses every minute in time by the zillions…Information is data equipped with meaning…Knowledge is the whole body of data and information that people bring to bear to practical use in action, in order to carry out tasks and to create new information.”(Schreiber et al. 1998)
Slide Title: Literature
Lots of books in a library
Slide Title
Slide contains:
Book on the left with a plus sign
Black and white image, man sat at an old valve-style computer (i.e. manchester baby)
Text saying: genes, proteins, interactions, pathways
Mouse on the right
Text below images says:
(left) Literature
(middle) complex machines
(right) Organism
(bottom) “…. from biological facts, we make a system that is some model of a real thing” - Robert Stevens – 2008
Ide
Ide
Slide Title: Genotype to Pathway
QTL to Pathway workflow
This workflow:
Identifies all the genes, and their Ensembl ids, in a QTL region using BioMart
Cross-references the gene ids to Entrez and Uniprot ids
Entrez and Uniprot ids then map onto KEGG gene ids
The KEGG gene ids are then used to identify KEGG pathways, including a description and an ID
These lists of descriptions and IDs are then returned back to the user
Slide Title: Pathway to Phenotype
Pathways to PubMed abstracts workflow
This workflow:
Takes in a list of KEGG pathway descriptions
Appends a search string to the end of each description
Searches through PubMed using the NCBI eUtils Web Services
For each article found in PubMed, as a PubMed id, an abstract is returned along with the date of publication
These abstracts are then returned to the user as a single file
Thos abstracts, coupled with abstracts from the phenotype, provide evidence linking those pathways to the phenotype