Literature Based Framework for Semantic Descriptions of e-Science resources

  • 818 views
Uploaded on

Literature Based Framework for Semantic Descriptions of e-Science resources

Literature Based Framework for Semantic Descriptions of e-Science resources

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
818
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
11
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • A brief introduction of my recent affiliations
  • This slide can be replaced with James’
  • Mention that this example is taken from myGrid project.
  • The volume of knowledge being generated in different research domains is increasing, with new concepts and terms being added continuously. Therefore, automated methods are required to automatically distil information, extract facts, discover implicit links and generate hypotheses relevant to user’s needs. Automatic acquisition of knowledge from unstructured text typically starts with the identification of terminology relevant for a specific domain, topic or task. Terms provide a means of communication, and it is the terms and their relationships that convey knowledge across scientific articles in particular (Krauthammer and Nenadic 2004). Terms are usually structurally organised not only to help information retrieval and extraction, but also to facilitate the smooth expansion of terminology where newly discovered terms/concepts are integrated into an existing taxonomy.
  • The volume of knowledge being generated in different research domains is increasing, with new concepts and terms being added continuously. Therefore, automated methods are required to automatically distil information, extract facts, discover implicit links and generate hypotheses relevant to user’s needs. Automatic acquisition of knowledge from unstructured text typically starts with the identification of terminology relevant for a specific domain, topic or task. Terms provide a means of communication, and it is the terms and their relationships that convey knowledge across scientific articles in particular (Krauthammer and Nenadic 2004). Terms are usually structurally organised not only to help information retrieval and extraction, but also to facilitate the smooth expansion of terminology where newly discovered terms/concepts are integrated into an existing taxonomy.

Transcript

  • 1.
    • Hammad Afzal
    • PhD, Computer Science
    • The University of Manchester, UK
    • Seminar at:
    • National University of Sciences and Technology,
    • Islamabad
    • Dated: April, 2010
    A Literature based framework for semantic descriptions of e-Science resources [email_address]
  • 2. Who am I
    • A former PhD Student at University of Manchester (Finished in Dec, 2009).
    • A former Research Fellow at Digital Enterprise Research Institute (DERI), National University of Ireland (Finished in Dec, 2009)
    • At University of Manchester:
      • Text Mining Group.
      • Worked to automate the process of Semantic Service Descriptions of
      • Bioinformatics resources using Natural Language Processing (NLP) techniques
      • on large amount of relevant literature available online
    • At DERI:
      • Unit of Natural Language Processing.
      • Worked on development of methods for the semi/automatic generation of
      • multilingual lexicons for domain ontologies, exploiting Web-based and
      • language resources.
  • 3. e-Science Perspective
    • Development in Web has changed the way of research.
    • The resources are now mostly outside a researcher’s office,
      • Scientific data, knowledge and computational resources are typically distributed over the Internet.
      • This paradigm is largely known as e-Science.
    • E-Science is an infrastructure for systematic development of research methods that involve distributed resources (Web services, data and knowledge resources, and computational resources) and their application to research
  • 4. e-Science Resources
    • The resources involved in e-Science are known as e-Science resources, which can be
      • Scientific literature databases (e.g. PubMed, PubChem etc).
      • Tool repositories (e.g. bioinformatics tools and services provided by the European Bioinformatics Institute (EBI) etc).
      • Social network like portals where scientists can exchange knowledge and comments etc (e.g. myExperiment, F1000 Biology).
  • 5. Semantic Web
      • Provides machine understandability by adding machine processable semantics to conventional Web infrastructure
      • revolutionised the paradigms of resource sharing and service provision by adding meaning to resources (services, data) through associated semantics (formal descriptions of their meaning).
  • 6. Semantic Web
    • Ontologies
      • Integral part of Semantic Web.
      • The specification of conceptualisations, used to help programs and humans share knowledge (Gruber, 1995).
      • Capture and computationally present knowledge shared by people in a certain community (Hadzic and Chang, 2004).
      • Represents a set of concepts (typically with precise definitions), which are mutually linked through a number of relationships.
      • Examples :
  • 7. Bioinformatics e-Resources
    • Bioinformatics – a pioneer adopter of e-Science
      • use of computational and mathematical techniques to store, manage, and analyse the data from molecular biology in order to answer questions about biological phenomena (Lord et al., 2004).
      • emerged from molecular biology laboratories,
        • enormous amount of data is produced,
        • various tools (Web services) that operate on that data.
      • Bioinformaticians typically decompose high-level tasks into simpler modules and choose the most appropriate class of service to accomplish each sub-task using different data resources, many of which are distributed (Wroe et al., 2004).
  • 8. Bioinformatics e-Resources
  • 9. Semantic Descriptions of Bioinformatics e-Resources
    • A number of bioinformatics tools and resources available for service use and composition
      • guessimate is 3000+ Web Services publically available
      • how to find a service, what is out there to use?
      • provenance?
    • Efficient use of resources require making them discoverable by potential users.
      • Their functional capabilities need to be described, so that they are not only accessible by humans but also by machines (resource crawlers, software agents etc).
  • 10. BioCatalogue Beta version at http://beta.biocatalogue.org/ Launch June 2009 at ISMB
  • 11.
    • Semantic annotation of bioinformatics services
      • annotate functional capabilities
      • e.g. Taverna, myGrid, myExperiment, EBI
    • Not only services and tools
      • databases, repositories, corpora
    • Manual curation
      • e.g. myGrid, BioCatalogue etc.
      • e.g. Taverna/Feta: only ~15-20% functionally described
      • backlog – and the number of services is growing
    Semantic Descriptions in Bioinformatics Domain
  • 12. Our approach – Mine the literature Literature: Still the largest and most popular source of knowledge. Hypothesis : The semantic profiles of entities and events can be extracted from the domain literature.
  • 13. Example Semantically Annotated Web Service Annotations combine textual descriptions ontological mappings text
  • 14. Detailed approach
  • 15. The rest of the talk
    • Methodology
      • A literature based methodology to develop and maintain existing domain knowledge representations, in particular Controlled Vocabularies, Ontologies.
      • An integrated literature based methodology for extraction of resource description profiles
      • Building semantic networks of resources from their descriptions.
    • What next?
  • 16. 1 st Module Building Controlled Vocabulary from Literature
  • 17.  
  • 18. Terminology Building
    • First step towards knowledge acquisition from unstructured text.
    • Structurally organised terms help in
      • Information Retrieval (IR)
      • Information Extraction (IE) etc
      • Document Summarization etc
    • Used in annotation tasks,
      • predefined and authorised terms known as controlled vocabularies (CVs) provide domain-specific tags to enrich data or textual resources
      • Terms provide basis for Ontologies, Controlled Vocabularies, Taxonomies used in Semantic Web
    • Terms are automatically identified in literature using Automatic Term Recognition (ATR) techniques
  • 19. Controlled Vocabulary Building – a challenging task
      • In dynamic domains, new terms representing new domain concepts are continuously introduced.
      • Generic ATR techniques fail to differentiate between terms related to a specific task and generic domain terms in heterogeneous text (in particular scientific articles in cross-disciplinary domains)
  • 20.
    • Term Classification
      • Assigning terms to domain-specific classes.
      • Narrowing down the specific meaning of a concept described by a given term.
        • For example, in biomedicine, terms can be assigned to classes such as genes, proteins, mRNAs, diseases, etc.
      • Can help in building controlled vocabularies
        • by classifying Instances of specific and focused sub-classes of interest.
    Controlled Vocabulary Building – Solution
  • 21. Building controlled vocabulary from literature
  • 22. Term Classification driven approach 1) get a corpus 2) get all terms 3) get seed examples 4) find relevant ones using term profiling and comparison to seed examples Learn bioinformatics terms from literature
  • 23. Bioinformatics terminology
    • Use seed terms to bootstrap
      • e.g. known descriptors used in existing service descriptions, either in literature or service repositories
        • 250 terms identified, manual pruning after automatic term recognition
      • examples of lexical constituents and textual behaviour (pragmatics)
        • lexical profiling
        • contextual profiling
  • 24. Bioinformatics terminology
    • Lexical profiling
      • what is in the name
    • Contextual profiling
      • characterise sentences in which terms appear (nouns, verbs and context-patterns)
    • Comparing candidate term profiles to
      • average seed term
      • best-match
  • 25. Lexical Profile Term (t) Lexical Profile LP(t) protein (1) Protein Protein sequence (1) protein (2) sequence (3) protein sequence protein sequence alignment
    • protein
    • sequence
    • alignment
    • protein sequence
    • sequence alignment
    • protein sequence alignment
  • 26. Contextual Profile Verb Profile Produce Noun Profile genscan, program, list, transcript Left Pattern (LP) Class-Level (LP 1 ) <Term> , produce, <NP> , of Right Pattern (RP) Class-Level (RP 1 ) of, <NP> Sentence Genscan program node can produce a list of nucleotide FASTAs of predicted transcripts
  • 27. Profile Comparisons
  • 28. Bioinformatics terminology
    • Comparison between Profile based term classification and generic Term Recognition (c-Value method)
  • 29. Statistics about textual corpus Full Text Articles # of documents 2,691 # of distinct candidate terms 113,280 # of candidate term occurrences 533,418 # of distinct sentences 294,614 # of distinct context noun stems ~79,000 # of distinct context verb stems ~2,500
  • 30. The Bioinformatics Controlled Vocabulary Number of Terms ATR (C-Value) – total number of candidate terms 113,280 Number of terms with lexical similarity to resource terms 95,437 Number of terms with context noun similarity to resource terms 103,104 Number of terms with context verb similarity to resource terms 73,478 Number of terms with context pattern similarity to resource terms 21,182 Number of terms with combined contextual similarity (Nouns ∪ Verbs ∪ Patterns) 98,307
  • 31. 2 nd Module Mining Semantic Descriptions from Literature
  • 32.  
  • 33. Mining service descriptions
  • 34.  
  • 35.
    • Informatics concepts
      • general concepts of data, data
      • structures, databases, metadata
    • Bioinformatics concepts
      • domain-specific data sources and
      • algorithms for searching and
      • analysing data
      • e.g. Smith-Waterman algorithm
    Semantic classes – myGrid Ontology
  • 36.
    • Molecular biology concepts
      • higher level concepts used to describe bioinformatics data types, used as inputs and outputs
      • in services
        • e.g. protein sequence, nucleic acid
        • sequence
    • Task concepts
      • generic tasks a service operation can
        • perform
        • e.g. retrieving, displaying, aligning
    Semantic classes – myGrid Ontology
  • 37. Semantic classes identification
    • Engineered from MyGrid bioinformatics sub-ontology
    Semantic class Typical terminological heads Application application, tool, service, software, system, program Algorithm algorithm, method, approach, procedure, analysis, alignment Data data, record, report, sequence, structure Data Resource resource, database, dataset, repository
  • 38. Resource mentions
    • Named-entity recognition (NER) task
    • Recognition of service mentions using
      • terminological (semantic) heads of automatically recognised terms
        • Apollo2Go Web Service is an Application
        • BIND database is a Data source
        • assign the corresponding semantic class
      • Hearst patterns (co-ordinations, appositions, enumerations, etc.)
  • 39. Semantic classes and instances
  • 40. Semantic classes and instances
  • 41.  
  • 42. Extraction/functional rules
    • Manually designed predicate-driven rules: Subject (Arg) – Verb (Predicate) – Object (Arg)
    • Applied on dependency parsed sentences
      • Stanford parser
      • no phrase structures
      • complex sentences
      • information in sub-clause
    “ Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences” “ Term_App generates similarity/identity matrices for DNA or protein sequences”
  • 43. Extraction/functional rules
    • Phrase structures identified and integrate with the dependency
    • Predicate-dependent rules applied to extract specific ‘content’ and profile the services
    • Profiles collated for all mentions
      • service name variation
    “ Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences” “ Term_App generates similarity/identity matrices for DNA or protein sequences”
  • 44. Extraction/functional rules
  • 45. Extraction/functional rules
    • Predicate-driven rules: each verb associated with the type of “information content” it provides
    Function Associated verbs Generic functionality/ Task specification applied, access, achieve, align, allow, based, developed, implemented, present, provide, used, is a, called Inputs, outputs accept, applied, create, provide, query, retrieve, starts with, take, used, generate Comparison outperform, perform, compare Implementation technique, Programming language implement(ed) Composition, subtasks contain(ed), construct(ed), generate(d) Availability available
  • 46. Information Extraction Input Sentence: “ Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences” SC instance (resource) Matrix Global Alignment Tool MatGAT SC Application Task Generate Predicted input DNA or protein sequences Predicted output similarity/identity matrices Descriptors similarity/identity matrices, DNA or protein sequences
  • 47.  
  • 48. Experiments
    • 2657 BMC Bioinformatics articles
      • full-text articles before March 2008
    • 108 predicates used
    Semantic Class Total # of instances Algorithm 5,722 Application 2,076 Data 2,662 Data Resource 1,992 Total 12,452
  • 49. Example – GeneClass
    • Resource descriptors
    Descriptors Frequency of co-occurrence motif data 4 differential gene expression 3 reliable predictive model 2 genome-wide protein-DNA binding data 2 transcriptional gene regulation 2 gene expression data 1 2) MyGrid terms BIND 3) Related resources Robust GeneClass Algorithm
  • 50. Example – GeneClass Functional Content Predicate (Task) Subject Functional Description Input/Output predict GeneClass Algorithm predicting differential gene expression starts with a candidate set of motifs x003bc
  • 51. Example – GeneClass
    • Sentences
    • We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm , and we use an improved noise model for gene expression data [PMC 1810316].
    • The GeneClass algorithm for predicting differential gene expression starts with a candidate set of motifs; representing known or putative regulatory element sequence patterns and a candidate set of regulators or parentSS [PMC 1810316].
    • Target set: We extend the original GeneClass algorithm to use all target genes for which both motif and expression data is available [PMC 1810316].
  • 52. Evaluated for their capability to be used for semantic description of a given bioinformatics resource (0) irrelevant (1) partially useful (2) useful HeatMapper The HeatMapper tool has already proven to be very useful in several studies Kalign To compare Kalign to other MSA programs, the following test sets were used. Cognitor To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program Evaluation of semantic profiles
  • 53.
    • Two experiments:
      • 15 well-known resources with descriptions already available
      • 15 new resources
    Evaluation of semantic profiles Quality comparison of various components of resource description profiles from the two experiments
  • 54. 3 rd Module Mining Semantic Networks from Literature
  • 55.  
  • 56. What next?
    • Good recall, poor precision
      • context needs a better model
    • Mining parameter values
      • sub-language of parameters
    • Candidate service/resource mentions
      • an entity whose profile looks like a service
      • comparison of semantic profiles
      • network of services [ISMB 2009]
    • Do we have good service ontologies?
  • 57. What Next ? (Proposed in BioHackathon2010) Phylogenetic trees are then generated by the ClustalW program by the neighbour-joining method [PMC1973088] . We also used the CLUSTALW program for multialignment as a control process [PMC434493] . Resource1 Resource2 Resource3 Phylogenetic Tree ClustalW Program Multialignment RDF Store # Data # Task Phylogenetic Tree Generated by ClustalW Program Multialignment Is used for
  • 58. Conclusion
    • Literature mining approach to service description and annotation
    • Aims
      • reduce curation efforts
      • provide semantic synopses of services for the Semantic Web
    • Potential of text mining
      • integration with other annotation approaches
      • extracting the entire service context is still challenging
  • 59. Related Selected Publications
    • Hammad Afzal, James Eales, Robert Stevens, Goran Nenadic (2010): Mining Semantic Networks of Bioinformatics Web Resources from the Literature, Journal of Biomedical Semantics.
    • Hammad Afzal, Robert Stevens, Goran Nenadic (2009): Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature,
    • 6th European Semantic Web Conference (ESWC) on the Semantic Web: Research and Applications. Heraklion, Crete, Greece, Springer-Verlag
    • Hammad Afzal, Robert Stevens, Goran Nenadic (2008): Towards Semantic Annotation of Bioinformatics Services: Building a Controlled Vocabulary,
    • Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008).
  • 60. Thanks