View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
Development in Web has changed the way of research.
The resources are now mostly outside a researcher’s office,
Scientific data, knowledge and computational resources are typically distributed over the Internet.
This paradigm is largely known as e-Science.
E-Science is an infrastructure for systematic development of research methods that involve distributed resources (Web services, data and knowledge resources, and computational resources) and their application to research
use of computational and mathematical techniques to store, manage, and analyse the data from molecular biology in order to answer questions about biological phenomena (Lord et al., 2004).
emerged from molecular biology laboratories,
enormous amount of data is produced,
various tools (Web services) that operate on that data.
Bioinformaticians typically decompose high-level tasks into simpler modules and choose the most appropriate class of service to accomplish each sub-task using different data resources, many of which are distributed (Wroe et al., 2004).
e.g. Taverna/Feta: only ~15-20% functionally described
backlog – and the number of services is growing
Semantic Descriptions in Bioinformatics Domain
Our approach – Mine the literature Literature: Still the largest and most popular source of knowledge. Hypothesis : The semantic profiles of entities and events can be extracted from the domain literature.
Example Semantically Annotated Web Service Annotations combine textual descriptions ontological mappings text
Narrowing down the specific meaning of a concept described by a given term.
For example, in biomedicine, terms can be assigned to classes such as genes, proteins, mRNAs, diseases, etc.
Can help in building controlled vocabularies
by classifying Instances of specific and focused sub-classes of interest.
Controlled Vocabulary Building – Solution
Building controlled vocabulary from literature
Term Classification driven approach 1) get a corpus 2) get all terms 3) get seed examples 4) find relevant ones using term profiling and comparison to seed examples Learn bioinformatics terms from literature
characterise sentences in which terms appear (nouns, verbs and context-patterns)
Comparing candidate term profiles to
average seed term
Lexical Profile Term (t) Lexical Profile LP(t) protein (1) Protein Protein sequence (1) protein (2) sequence (3) protein sequence protein sequence alignment
protein sequence alignment
Contextual Profile Verb Profile Produce Noun Profile genscan, program, list, transcript Left Pattern (LP) Class-Level (LP 1 ) <Term> , produce, <NP> , of Right Pattern (RP) Class-Level (RP 1 ) of, <NP> Sentence Genscan program node can produce a list of nucleotide FASTAs of predicted transcripts
Comparison between Profile based term classification and generic Term Recognition (c-Value method)
Statistics about textual corpus Full Text Articles # of documents 2,691 # of distinct candidate terms 113,280 # of candidate term occurrences 533,418 # of distinct sentences 294,614 # of distinct context noun stems ~79,000 # of distinct context verb stems ~2,500
The Bioinformatics Controlled Vocabulary Number of Terms ATR (C-Value) – total number of candidate terms 113,280 Number of terms with lexical similarity to resource terms 95,437 Number of terms with context noun similarity to resource terms 103,104 Number of terms with context verb similarity to resource terms 73,478 Number of terms with context pattern similarity to resource terms 21,182 Number of terms with combined contextual similarity (Nouns ∪ Verbs ∪ Patterns) 98,307
2 nd Module Mining Semantic Descriptions from Literature
Predicate-driven rules: each verb associated with the type of “information content” it provides
Function Associated verbs Generic functionality/ Task specification applied, access, achieve, align, allow, based, developed, implemented, present, provide, used, is a, called Inputs, outputs accept, applied, create, provide, query, retrieve, starts with, take, used, generate Comparison outperform, perform, compare Implementation technique, Programming language implement(ed) Composition, subtasks contain(ed), construct(ed), generate(d) Availability available
Information Extraction Input Sentence: “ Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences” SC instance (resource) Matrix Global Alignment Tool MatGAT SC Application Task Generate Predicted input DNA or protein sequences Predicted output similarity/identity matrices Descriptors similarity/identity matrices, DNA or protein sequences
We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm , and we use an improved noise model for gene expression data [PMC 1810316].
The GeneClass algorithm for predicting differential gene expression starts with a candidate set of motifs; representing known or putative regulatory element sequence patterns and a candidate set of regulators or parentSS [PMC 1810316].
Target set: We extend the original GeneClass algorithm to use all target genes for which both motif and expression data is available [PMC 1810316].
Evaluated for their capability to be used for semantic description of a given bioinformatics resource (0) irrelevant (1) partially useful (2) useful HeatMapper The HeatMapper tool has already proven to be very useful in several studies Kalign To compare Kalign to other MSA programs, the following test sets were used. Cognitor To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program Evaluation of semantic profiles
What Next ? (Proposed in BioHackathon2010) Phylogenetic trees are then generated by the ClustalW program by the neighbour-joining method [PMC1973088] . We also used the CLUSTALW program for multialignment as a control process [PMC434493] . Resource1 Resource2 Resource3 Phylogenetic Tree ClustalW Program Multialignment RDF Store # Data # Task Phylogenetic Tree Generated by ClustalW Program Multialignment Is used for