Your SlideShare is downloading. ×
0
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

1,245

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,245
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • INDUS – a federated, query centric approach to the problem of knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources Learning algorithms that can be decomposed into information gathering (obtained by answering queries) and hypothesis generation can be easily linked to INDUS INDUS makes possible the exchange of data and findings between scientists or institutions working on related problems (e.g., bioinformatics)
  • Design that is tailored for predictive model building using machine learning algorithms from distributed, semantically heterogeneous, autonomous data sources
  • INDUS – a federated, query centric approach to the problem of knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources Learning algorithms that can be decomposed into information gathering (obtained by answering queries) and hypothesis generation can be easily linked to INDUS INDUS makes possible the exchange of data and findings between scientists or institutions working on related problems (e.g., bioinformatics)
  • Transcript

    • 1. Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources Doina Caragea, Jyotishman Pathak, Jie Bao, Adrian Silvescu, Carson Andorf, Drena Dobbs and Vasant Honavar July 26, 2005
    • 2. Semantic Web Vision
    • 3. Background and Motivation
      • Transformation of biology from a data poor science into a data rich science
      • Proliferation of autonomous, semantically heterogeneous, distributed data sources (more than 500 data repositories of interest to molecular biologists alone)
      • Needed: Software tools for knowledge acquisition from semantically heterogeneous distributed data sources
      InterPro MIPS Swissprot
    • 4. INDUS ( IN telligent D ata U nderstanding S ystem) Goal: knowledge discovery from large, distributed, semantically heterogeneous data
    • 5. Outline
      • Ontology-Based Information Integration
      • Learning Classifiers from Semantically Heterogeneous Data
      • INDUS: Information Integration and Knowledge Acquisition System
      • Summary and Work in Progress
    • 6. Semantically Heterogeneous Data Data sources need to be made self-describing by specifying the relevant meta data D 1 D 2 Aspartyl/asparaginyl beta-hydroxylase Beta-adrenergic receptor kinase 2 Protein Name 1.14.11.16 Peptide-aspartate beta-dioxygenase TPR TPR_REGION TPR MAQRKNAKSS GNSSSSGSGS … Q12797 2.7.1.126 Beta-adrenergic receptor kinase RGS PROT_KIN_DOM PH_DOMAIN MADLEAVLAD VSYLMAMEKS … P35626 EC Number Prosite Motifs Protein Sequence Protein ID RIIa HSP70 Pfam Domains 415 692 Length BCY1 SSE1 Gene 16.19.01 cyclic nucleotide binding (cAMP, cGMP, etc.) VSSLPKESQA ELQLFQNEIN … P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN … P32589 MIPS Funcat AA Sequence Accession Number AN
    • 7. Meta Data
      • Schema – structure of data
      • Specification of the attributes of the data and their types
      • Ontology – conceptualization of semantics of data
      • Domains of attributes and relationships between values
      Schema for protein data in D 1 EC Number: EC Hierarchy Prosite Motifs: Motifs Protein Sequence: AA String Protein Name: String Protein ID : Swissprot ID
    • 8. Attribute value hierarchy An attribute value hierarchy (AVH) is a partial order ontology over the values of attributes of data Example: MIPS Funcat Hierarchy
    • 9. Making data sources self-describing - Ontology-extended data source Data Schema Ontology + + MIPS Funcat: MIPS Hierarchy Prosite Motifs: Motifs Length: Positive Integer Gene: Gene ID Accession Number: MIPS ID RIIa HSP70 415 692 BCY1 SSE1 16.19.01 cyclic nucleotide binding (cAMP, cGMP.) VSSLPKESQA ELQLFQNEIN P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN P32589
    • 10. User view MIPS Swissprot User Schema Data Sources of Interest User View User Ontology A user view is given by :
      • a set of ontology-extended data sources that are of interest to the user
      • a user schema and ontology (defining a virtual data source)
      • a set of mappings from data source schemas and ontologies to the user schema and ontology
      GO Function: GO Hierarchy Structural Class: SCOP Protein: AA String Source: Species String PID: Swissprot ID
    • 11. Mappings
      • The interoperation between the schema and ontology associated with a data source and a user schema and ontology is facilitated by specifying mappings at:
        • Schema Level: between attributes in different schemas
        • Ontology Level: between values of the attributes described in different ontologies
    • 12. Mappings at schema level Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
    • 13. Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene Set AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
    • 14. Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein Sequence : D 1 ≡ AA Composition : D U AA Sequence : D 2 ≡ AA Composition : D U Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
    • 15. Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein Sequence : D 1 ≡ AA Composition : D U AA Sequence : D 2 ≡ AA Composition : D U EC Number : D 1 ≡ GO Function : D U’ MIPS Funcat : D 2 ≡ GO Function : D U Protein ID: SwissProt ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: SwissProt ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
    • 16. Mappings at ontology level D U D U D 1
    • 17. Mappings at ontology level EC 2.7.1.126 : D 1 ≡ GO 0047696 : D U D U D 1
    • 18. Mappings at ontology level D U EC 2.7.1 : D 1  GO 00047696 : D U D 1
    • 19. Mappings at ontology level D 1 EC 2.7.1.126 : D 1  GO 0004672 : D U D U
    • 20. Integration ontology
      • An ontology (O U ,  ) is called an integration ontology of a set of data source ontologies O 1 ,…,O K if there exists K partial injective mappings Φ 1 ,…, Φ K from O 1 ,…,O K , respectively, to O U such that:
        • x  i y implies Φ i (x)  Φ i (y), for all x,y  O i
          • Order preservation
        • ( x:O i op y:O U )  IC, then ( Φ i (x) op y ), for all x  O i and y  O U
          • Semantic correspondence preservation
      • We provide user-friendly tools for specifying semantic correspondences that are used to infer mappings semi-automatically
      • The consistency of the set of mappings between data source schemas and ontologies and user schema and ontology can be checked using a reasoner
    • 21. Sample Query
      • Return ALL proteins whose GO function isa nucleotide binding
      • Return ALL proteins whose GO function isa kinase activity OR those that are involved in the GO process phosphate metabolism
    • 22. Outline
      • Ontology-Based Information Integration
      • Learning Classifiers from Semantically Heterogeneous Data
      • INDUS: Information Integration and Knowledge Acquisition System
      • Summary and Work in Progress
    • 23. Learning classifiers from data Data Labeled Examples Standard learning algorithms assume centralized access to data Unlabeled Examples Learner Classifier (hypothesis) Classification Learning Classifier Class
    • 24. Human and yeast protein training data GO 0016208: AMP binding GO 0005515: protein binding GO 0004597: peptide-aspartate GO 0047696: beta-adrenergic-receptor kinase activity GO Function VSSLPKESQA ELQLFQNEIN STPFGLDLGN NNSVLAVARN MAQRKNAKSS GNSSSSGSGS MADLEAVLAD VSYLMAMEKS Sequence Mainly alpha Alpha beta Yeast P39708 Mainly alpha Yeast Q01574 Not Known Human Q12797 Mainly beta Few Secondary Structures Human P35626 Structural Classes Source PID Attributes/Features/Variables Class/Label Examples/ Instances/ Cases
    • 25. Probabilistic models for protein function classification GO 0016208: AMP binding GO 0005515: protein binding GO 0004597: peptide-aspartate GO 0047696: beta-adrenergic-receptor kinase activity GO Function VSSLPKESQA ELQLFQNEIN STPFGLDLGN NNSVLAVARN MAQRKNAKSS GNSSSSGSGS MADLEAVLAD VSYLMAMEKS Sequence P39708 Q01574 Q12797 P35626 PID
      • Naïve Bayes Algorithm
        • Very simple algorithm, works surprisingly well in practice
        • Treats every sequence S as a “bag” of amino-acids A 1 ,…,A n
        • “ Gold standard” for evaluating other methods
      Most probable class of c ( S ) is:
    • 26. Learning classifiers from data revisited Learning = Information extraction + Hypothesis generation Query s ( D,h i ->h i+1 ) Answer s ( D,h i ->h i+1 ) Information extraction = Sufficient statistics gathering Data D Learner Partial hypothesis h i Hypothesis Generation h i+ 1  R ( h i , s ( D, h i ->h i+1 )) Statistical query formulation
    • 27. Sufficient statistics for learning classifiers
      • A statistic s(D) is called a sufficient statistic for a parameter θ if s(D) captures all the information about the parameter θ , contained in the data D. We are interested in minimal sufficient statistics [Cassela and Berger, 2001].
      • A statistic s L (D) is called a sufficient statistic for learning a hypothesis h using a learning algorithm L applied to a data set D if there exists an algorithm that takes s L (D) as input and outputs h [Caragea et al. , 2004a].
    • 28. Naïve Bayes learning as information gathering and hypothesis generation count(AminoAcid,Class) and count(Class) Sufficient statistics: Naïve Bayes class: Query answering engine Naïve Bayes Data For each a i & For each c j Counts Counts(A i |c j ), Counts(c j ) P ( c j ) & P ( a i |c j ) Compute
    • 29. Learning classifiers from distributed data Information extraction from distributed data + Hypothesis generation Query s ( D,h i ->h i+1 ) Answer s ( D,h i ->h i+1 ) Query Decomposition Answer Composition D 1 D 2 D K Learner Partial hypothesis h i Query answering engine q 1 q 2 q K Statistical Query Formulation Hypothesis Generation h i+ 1  R ( h i , s ( D, h i ->h i+1 ))
    • 30. Learning classifiers from semantically heterogeneous data sources O Query s ( D,h i ) Answer s ( D,h i ) Query Decomposition Answer Composition D 1 ,O 1 D 2 , O 2 D K , O K Ontology M(O 1 ...O K , O) Mappings from O 1 … O K to O Statistical Query Formulation Hypothesis Generation h i+ 1  R ( h i , s ( D, h i )) Learner Partial hypothesis h i q 2 q K q 1
    • 31. Outline
      • Ontology-Based Information Integration
      • Learning Classifiers from Semantically Heterogeneous Data
      • INDUS: Information Integration and Knowledge Acquisition System
      • Summary and Work in Progress
    • 32. Ontology-based information integration in INDUS
    • 33. Capabilities of INDUS
      • INDUS provides support for:
      • Specification and update of schemas and ontologies
      • Specification of mappings between ontologies
      • Registration of new data sources
      • Specification of user views
      • Specification and execution of queries across distributed, semantically heterogeneous data sources
        • Learning classifiers from semantically heterogeneous data
    • 34. INDUS Tools
      • Ontology Editor for specifying or modifying ontologies
      • Schema Editor for specifying or modifying data source schemas
      • Mapping Editor for specifying mappings between ontologies and between schemas
      • Data Editor for registering data sources with INDUS
      • View Editor for defining user views
      • Query Interface for formulating queries and displaying results
    • 35. INDUS Users: Domain Ontologists
      • A domain ontologist can:
      • Specify or update ontologies
      • Specify or update schemas
      • Specify or update mappings between ontologies
      • Specify or update mappings between schemas
    • 36. INDUS Users: Data Providers
      • A data provider can:
      • Associate a predefined schema and ontology with a data source
      • Specify data source location, type and access procedures
      • Register a data source
      • Act as a domain ontologist
    • 37. INDUS Users: Domain Experts
      • A domain expert can specify an application view, i.e.,
      • Select data sources of interest in an application domain
      • Select an application specific schema
      • Select an application specific ontology
      • Select relevant mappings
      • A domain expert can serve as
      • Domain ontologist
      • Data provider
    • 38. INDUS Users: Domain Scientists
      • A domain scientist can
      • Select an application view
      • Formulate and execute queries
      • A domain scientist can act as
      • Domain ontologist
      • Data provider
      • Domain expert
    • 39. INDUS
      • Some features of INDUS
      • Clear distinction between structure and semantics of data
      • Data integration from a user perspective - User-specifiable ontologies and mappings (no single global ontology)
      • Data integration on the fly
      • Semantic integrity of queries ensured by means of semantics preserving mappings
    • 40. Related work
      • Information integration : [Sheth and Larson, 1990; Davidson et al. , 2001; Eckman, 2003; Levy, 1998]
      • Biological data integration : SRS [Etzold et al., 2003], K2 [Tannen et al., 2003], Kleisli [Chen et al., 2003], IBM’s Discovery Link [Haas et al., 2001], TAMBIS [Stevens et al., 2003], Bio-Mediator [Shaker et al., 2004], etc.
      • Ontology and mappings editors : Protégé [Noy et al., 2000], Clio [Eckman et al., 2002], DAG-Edit etc.
      • Ontology-extended relational algebra : [Bonatti et al. , 2003]
    • 41. Outline
      • Ontology-Based Information Integration
      • Learning Classifiers from Semantically Heterogeneous Data
      • INDUS: Information Integration and Knowledge Acquisition System
      • Summary and Work in Progress
    • 42. Summary
    • 43. Work in progress
      • Ontologies and mappings
      • Support for more expressive ontologies (beyond hierarchies) [Bao et al., 2005]
      • Support for interactive specification of mappings between ontologies, including automated generation of candidate mappings
      • Support for modular ontologies and mappings [Bao and Honavar, 2004]
      • Scalability: efficient mechanisms for storage, manipulation, retrieval and use of large ontologies and mappings
      • More powerful reasoning to ensure the semantic integrity of mappings
      • Support for import, export, and sharing of ontologies and mappings (e.g. OBO and OWL)
    • 44. Work in progress
      • Query Processing
      • Query optimization under access, bandwidth and computational constraints
      • Implementation of data retrieval procedures (iterators) for widely used bioinformatics data sources
      • Support for data caching and data sharing
    • 45. Work in progress
      • Knowledge Acquisition
      • Support for learning classifiers and other predictive models from semantically heterogeneous data [Caragea et al., 2005]
      • Support for statistical queries - including queries over partially specified data [Caragea et al., 2004 ]
      • Support for annotating and sharing results of knowledge acquisition
    • 46. Work in progress
      • Applications in bioinformatics - data driven discovery of macromolecular sequence-structure-function relationships
        • Prediction of protein function [Andorf et al., 2004]
        • Prediction of protein-protein, protein-DNA and protein-RNA interfaces [Yan et al., 2004]
        • Analysis, visualization, and interpretation of gene expression data [Caragea et al., 2005]
        • Modeling and discovery of gene regulatory networks
      • Usability studies
      • Design of better user interfaces
      • Performance evaluation
    • 47. http://www.cild.iastate.edu/software/indus.html

    ×