Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
The Changing Nature of Biomedical Research: Semantic e-Science
1. The Changing Nature of Biomedical
Research: Semantic e-Science
Robert Stevens
BioHealth Informatics Group
University of Manchester
Robert.Stevens@manchester.ac.uk
7. Classic and Modern Biology
Genotype Phenotype
Modern biology
Classic biology
8. Speed of sequencing
• First human genome
– 10+ years to produce
– Cost $500 million
– Huge international effort
• Now done in 10 weeks
– (for $399)
– http://tinyurl.com/genomecost
– http://www.23andme.com
13. Creating Woods, not Trees
Genes
Proteins
Pathways
Interactions
Literature
Complex
Machines
Virtual
Organism
…. from biological facts, we make a system that is some model of a real organism
18. Bioinformatics Experiments are Data
pipelines
Resources/Services
Investigate the evolutionary relationships between proteins
Protein
sequences
Multiple
sequence
alignment
Query
[Peter Li]
My
data
My
tool
19. Linking together data resources
Hypo Science – the routine for the many
Hyper Science – big projects, big science
20. The In Silico Experiment
• We can mine these data for possible hypotheses
• “what are the genes that are involved in some disease
phenotype?”
• Correlate genes in QTL with differentially regulated genes in
microarray via pathways; query the literature base with these
genes, pathways and phenotype; …
• Resulting facts form some hypothesis: A co-ordinated set of
SNPs increase cholesterol biosynthesis in macrophage, while
delaying apoptosis of these cells; increased super-oxide
production aids tolerance to trypanosomiasis in cattle
21. How bioinformatics was
DoneIntegrating data sets
• Slave labour
• Collections of Scripts
• Warehouses
• Applications
– Galaxy
– Gaggle
– Integr8
– Ensembl
– …..
• Workflows!
12181 acatttctac caacagtgga
tgaggttgtt ggtctatgtt
ctcaccaaat ttggtgttgt
12241 cagtctttta aattttaacc
tttagagaag agtcatacag
tcaatagcct tttttagctt
12301 gaccatccta
22. Workflows: E. Science laboris
• Data preparation and analysis pipelines.
• Data preparation pipelines
• Data integration pipelines
• Data analysis pipelines
• Data annotation pipelines
• Warehouse population refreshing
• Data and text mining
• Knowledge extraction.
• Parameter sweeps over
simulations/computations
• Model building and verification
• Knowledge management and model
population
• Hypothesis generation and modelling
23. • A workflow is a specification.
• WFmS is the machinery for
coordinating the execution of
(scientific) services and linking
together (scientific) resources.
• Handles cross cutting concerns like:
error handling, service invocation,
data movement, data streaming, data
provenance tracking, process
auditing, execution monitoring,
security access, blah blah…..
• Agile software development
Workflows: E. Science laboris
Enactment
Engine
My
data
My
tool
24. Workflow Execution Engine
Workflow execution engine
Local desktop and remote server
Implicit iteration over large data collections
Nested workflows
Automated data flow
Event history log and data provenance tracking
Within-workflow programming
Extensibility points for plug-ins
Graphical workbench
For Professionals
Plug-in architecture
Incorporate new service without
coding. Services as they are.
Access to local and remote
resources and analysis tools
Re-Design
Rewritten
25. • Comparing resistant vs. susceptible
strains – Microarrays
• Mapping quantitative traits –
Classical genetics QTL
• Integrated Microarray data,
genomic sequences, pathway data,
literature mining.
Trypanosomiasis Study
Paul Fisher, et al Nucleic Acids Research,
2007, 35(16)
28. • Eliminated user bias and premature filtering
• The scale and complexity of data and
literature.
• Systematic data analysis
• Data analysis provenance
• Manageable amount of output data for
biologists to interpret and verify
• Data driven science
“Looking where others hadn’t”
“make sense of this data” -> “does this make sense?”
http://www.youtube.com/watch?v=Y6_Kz5L010g
30. … A Fact Based Discipline
• Rather than laws captured in mathematics….
• We have lots of facts: the discipline’s knowledge
• Rather than “calculating” what a protein does, we
investigate and write it down
• Equivalent to writing down the trajectories of all
thrown objects and not doing ballistics!
• To do biology one needs “the knowledge”
31. Heterogeneity
• 28 ways to format the representations of a biological
sequence
• Though one way to represent the bases or amino
acids…
• Different words same concept
• Different concepts same words
• Different and implicit data schema
32. An Identity Crisis
• Database entries have identifiers unique within their
database
• The type of entity described in an entry doesn’t have
an identifier
• Different entries about the same type talk about it
differently
• How do we know when an entry in one DB talks
about the same thing as another entry in another
DB?
• That’s the skill of a bioinformatician
33. Categories and Category Labels
GO:0000368
U2-type nuclear mRNA 5' splice site recognition
spliceosomal E complex formation
spliceosomal E complex biosynthesis
spliceosomal CC complex formation
U2-type nuclear mRNA 5'-splice site recognition
34. The Role of Knowledge
• A lot of facts
• Perhaps organised into a system
• No equivalent of “laws of mechanics” – we
can’t do this biology with mathematics
• Or at least not without knowing what the
numbers mean...
• This is why we’ve been using ontologies!
36. Post-Genomic Biology
• Fly, mouse, yeast, worm all have their own
terminologies
• I want to compare genomes
• How?
• The genomic sequence is easily dealt with
computationally and comparisons are easy
• This is not true of the annotations or knowledge of
those sequences
• Need a common understanding
37. Annotation of Data
• Big effort to create controlled vocabularies using
ontologies
• A huge annotation effort – describe the entities in DB
with terms from ontologies
• The Gene Ontology (http://www.geneontology.org)
• The Open Biomedical Ontologies Consortium
38.
39. GO in Analysis
• Microarray analysis one of the original visions for GO
• Clustering of modulated genes cluster about
functional attributes of their proteins
• GO also used in, for example, semantic similarity;
text analysis; etc.
41. Shield users and applications
from service interoperability and incompatibility plumbing.
Turn your app into a service
Service
providers
Not only web
services
How a
bioinformatician
assumes stuff
should work
42. Pettifer, University of Manchester
inside
A collection
of
interactive
tools for
analysing
protein
sequence
and
structure
http://utopia.cs.manchester.ac.uk/
43. Semantic Descriptions of All
• Not just bio-entities in data
• The laboratory experiments by which they were
generated
• The protocols for their analysis
• The services for their analysis
44. Semantic Integration
• Same identifiers means integration and interoperation
• Most workflow hobbled by syntactic and semantic
heterogeneity
• Syntactic integration (Bio2RDF)
• Semantic integration via ontologies and naming
schemes
• Enables better e-Science through semantic science
45. Fact Management
• When “stamp collecting” we’re collecting facts
• Biology is a fact management activity
• Knowing what these facts mean is very important
• Science is performed on data and the semantics of data
enable us to do science
• Semantic e-Science
46. Summary
• The nature of modern biology gives it interesting
knowledge (fact) management issues
• It is a knowledge based discipline
• Not unique, but often extreme
• Ontologies seen as one component in management
(but not a panacea)
• E-Science gives infra-structure for management;
semantics enable analysis
• Actually, very light use of semantics
Title Slide
The Changing Nature of Biomedical Research: Semantic e-Science
Introduction
(Modern bio-molecular) Science
E-Science
Semantics and science
Semantic e-Science
Ernest Rutherford Slide
All science is either physics or stamp collecting
Mathematical Sciences
Lists of formulae
Laws in Biology
Charles Darwin and Origin of Species
Central Dogma
Classis and Modern Biology
Slide contains two semicircles labelled Genotype and Phenotype
Text says: Classic Biology; Modern Biology
Speed of Sequencing
First human genome
10+ years to produce
Cost $500 million
Huge international effort
Now done in 10 weeks
(for $399)
http://tinyurl.com/genomecost
http://www.23andme.com
1000+ Databases
according to Nucleic Acids Research
- Contains a graph of database growth
PubMed: 2 Papers per minute
~700,000 individual papers
Grows at 2 papers per minute
(see http://blogs.bbsrc.ac.uk for details)
Creating Woods not trees
Slide contains:
Book on the left with a plus sign
Black and white image, man sat at an old valve-style computer (i.e. manchester baby)
Text saying: genes, proteins, interactions, pathways
Mouse on the right
Text below images says:
(left) Literature
(middle) complex machines
(right) Organism
(bottom) “…. from biological facts, we make a system that is some model of a real thing” - Robert Stevens – 2008
Network of chemicals
Shows a pathway of chemical interactions and compounds
Systems within systems
Shows lots of organs and tissues associated with a person (in centre)
UniProt a database?
Slide seems to contain a database entry with Greek characters on it
Navigating the Web of Knowledge in Bioinformatics
Shows lots of diagrams with numerous bits of bioinformatics on them.
Data piplelines
PL: In bioinformatics, we have services/resources which biologists use in their bioinformatics analyses. Services can be repositories such as the EMBL database which contains gene sequences or analysis programs like ClustalW and Blast, an algorithm which measures the similarity between nucleotide or protein sequences.
These services are often combined into an ‘in silico’ bioinformatics experiment such as the one shown here. The swissprot database and the clustal-w analysis services can be integrated into an in silico experiment to investigate the evolutionary relationships between proteins.
** NO TITLE FOR SLIDE
Linking together data resources
Hypo Science – the routine for the many
Hyper Science – big projects, big science
The in silico Experiment
We can mine these data for possible hypotheses
“what are the genes that are involved in some disease phenotype?”
Correlate genes in QTL with differentially regulated genes in microarray via pathways; query the literature base with these genes, pathways and phenotype; …
Resulting facts form some hypothesis: A co-ordinated set of SNPs increase cholesterol biosynthesis in macrophage, while delaying apoptosis of these cells; increased super-oxide production aids tolerance to trypanosomiasis in cattle
Interoperating data services / integrating datasets
Slide shows Hannah web page process (mish-mash) and written protocol
Text says:
Slave labour
Collections of Scripts
Warehouses
Applications
Galaxy
Gaggle
Integr8
Ensembl
…..
Workflows!
Workflows: E. Science laboris
Slide shows Taverna workflow
Text says:
Data preparation and analysis pipelines.
Data preparation pipelines
Data integration pipelines
Data analysis pipelines
Data annotation pipelines
Warehouse population refreshing
Data and text mining
Knowledge extraction.
Parameter sweeps over simulations/computations
Model building and verification
Knowledge management and model population
Hypothesis generation and modelling
Workflows: E. Science laboris
A workflow is a specification.
WFmS is the machinery for coordinating the execution of (scientific) services and linking together (scientific) resources.
Handles cross cutting concerns like: error handling, service invocation, data movement, data streaming, data provenance tracking, process auditing, execution monitoring, security access, blah blah…..
Agile software development
Taverna 2 Re-design
Usability of professional workbench
Seamless integration with myExp and BioCat
Workflows in Production
Taverna Lite Author Workbench. Workflow Player. Application Port. Results browser.
Component maker. Workflow template maker.
“myGrid-in-a-Box”
Virtualised Taverna server deployment and distribution, bundle of myExperiment, BioCatalogue and database/tools components.
Vertical Markets
Taverna4Chemistry, Taverna4Plants, Taverna4Mouse
“Taverna Inside”
Platform, plug-in, integration
Beanshell scripting and XML processing support inside the workflows
Taverna 2:
long running workflows, data reference handling, data streaming and staging, multiple extensibility points.
Complete the Taverna 2 properties
New data reference handling, security management, provenance management, asynchronous processor and data streaming, explicit monitoring and steering support, new dispatch layer better, supports dynamic service binding and service invocation through a resource broker, improved concurrency handling at the workflow level
Taverna Remote Execution Service (T-REX)
Running workflows on a server
Running workflows inside other applications
Taverna is for informatics people (bioinformaticians, cheminformaticians etc). We need other interfaces for uptake by laboratory scientists and health workers
Trypanosomiasis Study
Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance
A form of Sleeping sickness in cattle – Known as n’gana
Caused by Trypanosoma brucei
Some cattle breeds more resistant than others
What are the differences between resistant and susceptible cattle?
Can we breed cattle resistant to n’gana infection
References Paul Fisher NAR paper
Genotype to Pathway
QTL to Pathway workflow
This workflow:
Identifies all the genes, and their Ensembl ids, in a QTL region using BioMart
Cross-references the gene ids to Entrez and Uniprot ids
Entrez and Uniprot ids then map onto KEGG gene ids
The KEGG gene ids are then used to identify KEGG pathways, including a description and an ID
These lists of descriptions and IDs are then returned back to the user
Pathway to Phenotype
Pathways to PubMed abstracts workflow
This workflow:
Takes in a list of KEGG pathway descriptions
Appends a search string to the end of each description
Searches through PubMed using the NCBI eUtils Web Services
For each article found in PubMed, as a PubMed id, an abstract is returned along with the date of publication
These abstracts are then returned to the user as a single file
Thos abstracts, coupled with abstracts from the phenotype, provide evidence linking those pathways to the phenotype
Looking where others hadn’t
Includes link to youtube video
Text says:
Eliminated user bias and premature filtering
The scale and complexity of data and literature.
Systematic data analysis
Data analysis provenance
Manageable amount of output data for biologists to interpret and verify
Data driven science
“make sense of this data” -> “does this make sense?”
Transferring Characteristics
Lots of wiggly lines with protein names
A fact based discipline
Rather than laws captured in mathematics….
We have lots of facts: the discipline’s knowledge
Rather than “calculating” what a protein does, we investigate and write it down
Equivalent to writing down the trajectories of all thrown objects and not doing ballistics!
To do biology one needs “the knowledge”
Heterogeneity
28 ways to format the representations of a biological sequence
Though one way to represent the bases or amino acids…
Different words same concept
Different concepts same words
Different and implicit data schema
An identity crisis
Database entries have identifiers unique within their database
The type of entity described in an entry doesn’t have an identifier
Different entries about the same type talk about it differently
How do we know when an entry in one DB talks about the same thing as another entry in another DB?
That’s the skill of a bioinformatician
Categories and Category Labels
Shows go category and various labels associated with it
The role of knowledge
A lot of facts
Perhaps organised into a system
No equivalent of “laws of mechanics” – we can’t do this biology with mathematics
Or at least not without knowing what the numbers mean...
This is why we’ve been using ontologies!
Uses of Ontology in Bioinformatics
Shows a spider diagram with “description” in centre and “knowledge acquisition” at top (one of nodes)
Post-genomic biology
Fly, mouse, yeast, worm all have their own terminologies
I want to compare genomes
How?
The genomic sequence is easily dealt with computationally and comparisons are easy
This is not true of the annotations or knowledge of those sequences
Need a common understanding
Annotation of data
Big effort to create controlled vocabularies using ontologies
A huge annotation effort – describe the entities in DB with terms from ontologies
The Gene Ontology (http://www.geneontology.org)
The Open Biomedical Ontologies Consortium
** NO SLIDE TITLE
Lots of lines leading back to metabolism from acetylcholine biosynthesis at bottom
Looks similar to tree diagram with lots of nodes, with “is a” in it, e.g. biosynthesis – is a - metabolism
The Sequence Ontology
GO in Analysis
Microarray analysis one of the original visions for GO
Clustering of modulated genes cluster about functional attributes of their proteins
GO also used in, for example, semantic similarity; text analysis; etc.
BioCatalogue screen shots
Shield Users and applications
Slide shows taverna on top of arrows to services (Jun diagram).
Utopia slide
Utopia is a collection of interactive tools for analysing protein sequence and structure. Up front are user-friendly and responsive visualisation applications, behind the scenes a sophisticated model that allows these to work together and hides much of the tedious work of dealing with file formats and web services.
Workflows under the hood
e-Laboratories (portals)
Systems Biology, e-Health
Web based execution
Running workflows over the web through myExperiment
Visualisation clients that call workflows in the background
Fact Management
When “stamp collecting” we’re collecting facts
Biology is a fact management activity
Knowing what these fact mean is very import
Science is performed on data and the semantics of data enable us to do science
Semantic e-Science
Summary
The nature of modern biology gives it interesting knowledge (fact) management issues
It is a knowledge based discipline
Not unique, but often extreme
Ontologies seen as one component in management (but not a panacea)
Acknowledgments
With all the people who work on myGrid, myExperiment, Taverna … etc etc.