INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Hetero...
Background and Motivation <ul><li>Transformation of biology from a data poor science into a data rich science </li></ul><u...
Outline <ul><li>INDUS Information Integration System </li></ul><ul><li>INDUS Tools: Technical Details and Demo </li></ul><...
Ontology-based information integration in INDUS
Semantically Heterogeneous Data Sources D 1 D 2 Aspartyl/asparaginyl beta-hydroxylase Beta-adrenergic receptor kinase 2 Pr...
Capabilities of INDUS <ul><li>INDUS provides support for: </li></ul><ul><li>Specification and update of schemas and ontolo...
INDUS Tools <ul><li>Ontology Editor  for specifying or modifying ontologies </li></ul><ul><li>Schema Editor  for specifyin...
INDUS Users: Domain Ontologists <ul><li>A domain ontologist can: </li></ul><ul><li>Specify or update ontologies  </li></ul...
INDUS Users: Data Providers <ul><li>A data provider can: </li></ul><ul><li>Associate a predefined schema and ontology with...
INDUS Users: Domain Experts <ul><li>A domain expert can specify an application view, i.e., </li></ul><ul><li>Select data s...
INDUS Users: Domain Scientists <ul><li>A domain scientist can  </li></ul><ul><li>Select an application view  </li></ul><ul...
Outline <ul><li>INDUS Information Integration System </li></ul><ul><li>INDUS Tools: Technical Details and Demo </li></ul><...
Semantically Heterogeneous Data Data sources need to be made self-describing by specifying the relevant meta data D 1 D 2 ...
Meta Data <ul><li>Schema – structure of data  </li></ul><ul><li>Specification of the attributes of the data and their type...
Attribute value hierarchy An  attribute value hierarchy  (AVH) is a partial order   ontology over the values of  attribute...
Making data sources self-describing - Ontology-extended data source Data Schema Ontology + + MIPS Funcat:  MIPS Hierarchy ...
INDUS: Ontology Editor
INDUS: Schema Editor
INDUS: Data Editor
User view MIPS Swissprot User Schema Data Sources of Interest User View User Ontology A  user view   is given by : <ul><li...
Mappings <ul><li>The interoperation between the schema and ontology associated with a data source and a user schema and on...
Mappings at schema level Protein ID:  Swissprot ID Protein Name:  String Protein Sequence:  AA String Prosite Motifs:  AA ...
Mappings at schema level Protein ID : D 1 ≡  PID : D U Accession Number AN : D 2 ≡  PID : D U Protein ID:  Swissprot ID Pr...
Mappings at schema level Protein ID : D 1 ≡  PID : D U Accession Number AN : D 2 ≡  PID : D U Protein Sequence : D 1 ≡  AA...
Mappings at schema level Protein ID : D 1 ≡  PID : D U Accession Number AN : D 2 ≡  PID : D U Protein Sequence : D 1 ≡  AA...
Mappings at ontology level D U D U D 1
Mappings at ontology level EC 2.7.1.126 : D 1   ≡  GO 0047696 : D U D U D 1
Mappings at ontology level D U EC 2.7.1 :  D 1      GO 00047696 :  D U D 1
Mappings at ontology level D 1 EC 2.7.1.126 : D 1      GO 0004672  : D U D U
INDUS: View Editor
INDUS: Mapping Editor
Sample Query <ul><li>Return ALL  proteins  whose  GO function   isa  nucleotide binding </li></ul><ul><li>Return ALL  prot...
Query processing in Indus Query  Formulation
INDUS: Query Editor
INDUS <ul><li>Some features of INDUS </li></ul><ul><li>Clear distinction between structure and semantics of data </li></ul...
Related work <ul><li>Information integration : [Sheth and Larson, 1990; Davidson  et al. , 2001; Eckman, 2003; Levy, 1998]...
Outline <ul><li>INDUS Information Integration System </li></ul><ul><li>INDUS Tools: Technical Details and Demo </li></ul><...
Summary
Work in progress <ul><li>Ontologies and mappings </li></ul><ul><li>Support for more expressive ontologies (beyond hierarch...
Work in progress <ul><li>Query Processing </li></ul><ul><li>Query optimization under access, bandwidth and computational c...
Work in progress <ul><li>Knowledge Acquisition </li></ul><ul><li>Support for learning classifiers and other predictive mod...
Work in progress <ul><li>Applications in bioinformatics - data driven discovery of macromolecular sequence-structure-funct...
Relevant Publications <ul><li>Caragea, D., Pathak, J., Bao, J., Silvescu, A., Andorf., C., Dobbs, D. and Honavar, V. (2005...
http://www.cs.iastate.edu/~dcaragea/indus.html
Upcoming SlideShare
Loading in …5
×

INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

1,338 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,338
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • INDUS – a federated, query centric approach to the problem of knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources Learning algorithms that can be decomposed into information gathering (obtained by answering queries) and hypothesis generation can be easily linked to INDUS INDUS makes possible the exchange of data and findings between scientists or institutions working on related problems (e.g., bioinformatics)
  • Design that is tailored for predictive model building using machine learning algorithms from distributed, semantically heterogeneous, autonomous data sources
  • INDUS – a federated, query centric approach to the problem of knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources Learning algorithms that can be decomposed into information gathering (obtained by answering queries) and hypothesis generation can be easily linked to INDUS INDUS makes possible the exchange of data and findings between scientists or institutions working on related problems (e.g., bioinformatics)
  • INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources

    1. 1. INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically Heterogeneous Data Sources Jie Bao, Doina Caragea, Jyotishman Pathak, Adrian Silvescu, Carson Andorf, Changhui Yan, Drena Dobbs and Vasant Honavar June 28, 2005
    2. 2. Background and Motivation <ul><li>Transformation of biology from a data poor science into a data rich science </li></ul><ul><li>Proliferation of autonomous, semantically heterogeneous, distributed data sources (more than 500 data repositories of interest to molecular biologists alone) </li></ul><ul><li>Needed: Software tools for knowledge acquisition from semantically heterogeneous distributed data sources </li></ul>InterPro MIPS Swissprot
    3. 3. Outline <ul><li>INDUS Information Integration System </li></ul><ul><li>INDUS Tools: Technical Details and Demo </li></ul><ul><li>Summary and Work in Progress </li></ul>
    4. 4. Ontology-based information integration in INDUS
    5. 5. Semantically Heterogeneous Data Sources D 1 D 2 Aspartyl/asparaginyl beta-hydroxylase Beta-adrenergic receptor kinase 2 Protein Name 1.14.11.16 Peptide-aspartate beta-dioxygenase TPR TPR_REGION TPR MAQRKNAKSS GNSSSSGSGS … Q12797 2.7.1.126 Beta-adrenergic receptor kinase RGS PROT_KIN_DOM PH_DOMAIN MADLEAVLAD VSYLMAMEKS … P35626 EC Number Prosite Motifs Protein Sequence Protein ID RIIa HSP70 Pfam Domains 415 692 Length BCY1 SSE1 Gene 16.19.01 cyclic nucleotide binding (cAMP, cGMP, etc.) VSSLPKESQA ELQLFQNEIN … P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN … P32589 MIPS Funcat AA Sequence Accession Number AN
    6. 6. Capabilities of INDUS <ul><li>INDUS provides support for: </li></ul><ul><li>Specification and update of schemas and ontologies </li></ul><ul><li>Specification of mappings between ontologies </li></ul><ul><li>Registration of new data sources </li></ul><ul><li>Specification of user views </li></ul><ul><li>Specification and execution of queries across distributed, semantically heterogeneous data sources </li></ul>
    7. 7. INDUS Tools <ul><li>Ontology Editor for specifying or modifying ontologies </li></ul><ul><li>Schema Editor for specifying or modifying data source schemas </li></ul><ul><li>Mapping Editor for specifying mappings between ontologies and between schemas </li></ul><ul><li>Data Editor for registering data sources with INDUS </li></ul><ul><li>View Editor for defining user views </li></ul><ul><li>Query Interface for formulating queries and displaying results </li></ul>
    8. 8. INDUS Users: Domain Ontologists <ul><li>A domain ontologist can: </li></ul><ul><li>Specify or update ontologies </li></ul><ul><li>Specify or update schemas </li></ul><ul><li>Specify or update mappings between ontologies </li></ul><ul><li>Specify or update mappings between schemas </li></ul>
    9. 9. INDUS Users: Data Providers <ul><li>A data provider can: </li></ul><ul><li>Associate a predefined schema and ontology with a data source </li></ul><ul><li>Specify data source location, type and access procedures </li></ul><ul><li>Register a data source </li></ul><ul><li>Act as a domain ontologist </li></ul>
    10. 10. INDUS Users: Domain Experts <ul><li>A domain expert can specify an application view, i.e., </li></ul><ul><li>Select data sources of interest in an application domain </li></ul><ul><li>Select an application specific schema </li></ul><ul><li>Select an application specific ontology </li></ul><ul><li>Select relevant mappings </li></ul><ul><li>A domain expert can serve as </li></ul><ul><li>Domain ontologist </li></ul><ul><li>Data provider </li></ul>
    11. 11. INDUS Users: Domain Scientists <ul><li>A domain scientist can </li></ul><ul><li>Select an application view </li></ul><ul><li>Formulate and execute queries </li></ul><ul><li>A domain scientist can act as </li></ul><ul><li>Domain ontologist </li></ul><ul><li>Data provider </li></ul><ul><li>Domain expert </li></ul>
    12. 12. Outline <ul><li>INDUS Information Integration System </li></ul><ul><li>INDUS Tools: Technical Details and Demo </li></ul><ul><li>Summary and Work in Progress </li></ul>
    13. 13. Semantically Heterogeneous Data Data sources need to be made self-describing by specifying the relevant meta data D 1 D 2 Aspartyl/asparaginyl beta-hydroxylase Beta-adrenergic receptor kinase 2 Protein Name 1.14.11.16 Peptide-aspartate beta-dioxygenase TPR TPR_REGION TPR MAQRKNAKSS GNSSSSGSGS … Q12797 2.7.1.126 Beta-adrenergic receptor kinase RGS PROT_KIN_DOM PH_DOMAIN MADLEAVLAD VSYLMAMEKS … P35626 EC Number Prosite Motifs Protein Sequence Protein ID RIIa HSP70 Pfam Domains 415 692 Length BCY1 SSE1 Gene 16.19.01 cyclic nucleotide binding (cAMP, cGMP, etc.) VSSLPKESQA ELQLFQNEIN … P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN … P32589 MIPS Funcat AA Sequence Accession Number AN
    14. 14. Meta Data <ul><li>Schema – structure of data </li></ul><ul><li>Specification of the attributes of the data and their types </li></ul><ul><li>Ontology – conceptualization of semantics of data </li></ul><ul><li>Domains of attributes and relationships between values </li></ul>Schema for protein data in D 1 EC Number: EC Hierarchy Prosite Motifs: Motifs Protein Sequence: AA String Protein Name: String Protein ID : Swissprot ID
    15. 15. Attribute value hierarchy An attribute value hierarchy (AVH) is a partial order ontology over the values of attributes of data Example: MIPS Funcat Hierarchy
    16. 16. Making data sources self-describing - Ontology-extended data source Data Schema Ontology + + MIPS Funcat: MIPS Hierarchy Prosite Motifs: Motifs Length: Positive Integer Gene: Gene ID Accession Number: MIPS ID RIIa HSP70 415 692 BCY1 SSE1 16.19.01 cyclic nucleotide binding (cAMP, cGMP.) VSSLPKESQA ELQLFQNEIN P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN P32589
    17. 17. INDUS: Ontology Editor
    18. 18. INDUS: Schema Editor
    19. 19. INDUS: Data Editor
    20. 20. User view MIPS Swissprot User Schema Data Sources of Interest User View User Ontology A user view is given by : <ul><li>a set of ontology-extended data sources that are of interest to the user </li></ul><ul><li>a user schema and ontology (defining a virtual data source) </li></ul><ul><li>a set of mappings from data source schemas and ontologies to the user schema and ontology </li></ul>GO Function: GO Hierarchy Structural Class: SCOP Protein: AA String Source: Species String PID: Swissprot ID
    21. 21. Mappings <ul><li>The interoperation between the schema and ontology associated with a data source and a user schema and ontology is facilitated by specifying mappings at: </li></ul><ul><ul><li>Schema Level: between attributes in different schemas </li></ul></ul><ul><ul><li>Ontology Level: between values of the attributes described in different ontologies </li></ul></ul><ul><li>The consistency of the set of mappings between data source schemas and ontologies and user schema and ontology can be checked using a reasoner </li></ul>
    22. 22. Mappings at schema level Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
    23. 23. Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene Set AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
    24. 24. Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein Sequence : D 1 ≡ AA Composition : D U AA Sequence : D 2 ≡ AA Composition : D U Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
    25. 25. Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein Sequence : D 1 ≡ AA Composition : D U AA Sequence : D 2 ≡ AA Composition : D U EC Number : D 1 ≡ GO Function : D U’ MIPS Funcat : D 2 ≡ GO Function : D U Protein ID: SwissProt ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: SwissProt ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
    26. 26. Mappings at ontology level D U D U D 1
    27. 27. Mappings at ontology level EC 2.7.1.126 : D 1 ≡ GO 0047696 : D U D U D 1
    28. 28. Mappings at ontology level D U EC 2.7.1 : D 1  GO 00047696 : D U D 1
    29. 29. Mappings at ontology level D 1 EC 2.7.1.126 : D 1  GO 0004672 : D U D U
    30. 30. INDUS: View Editor
    31. 31. INDUS: Mapping Editor
    32. 32. Sample Query <ul><li>Return ALL proteins whose GO function isa nucleotide binding </li></ul><ul><li>Return ALL proteins whose GO function isa kinase activity OR those that are involved in the GO process phosphate metabolism </li></ul>
    33. 33. Query processing in Indus Query Formulation
    34. 34. INDUS: Query Editor
    35. 35. INDUS <ul><li>Some features of INDUS </li></ul><ul><li>Clear distinction between structure and semantics of data </li></ul><ul><li>Data integration from a user perspective - User-specifiable ontologies and mappings (no single global ontology) </li></ul><ul><li>Data integration on the fly </li></ul><ul><li>Semantic integrity of queries ensured by means of semantics preserving mappings </li></ul>
    36. 36. Related work <ul><li>Information integration : [Sheth and Larson, 1990; Davidson et al. , 2001; Eckman, 2003; Levy, 1998] </li></ul><ul><li>Biological data integration : SRS [Etzold et al., 2003], K2 [Tannen et al., 2003], Kleisli [Chen et al., 2003], IBM’s Discovery Link [Haas et al., 2001], TAMBIS [Stevens et al., 2003], Bio-Mediator [Shaker et al., 2004], etc. </li></ul><ul><li>Ontology and mappings editors : Protégé [Noy et al., 2000], Clio [Eckman et al., 2002], DAG-Edit etc. </li></ul><ul><li>Ontology-extended relational algebra : [Bonatti et al. , 2003] </li></ul>
    37. 37. Outline <ul><li>INDUS Information Integration System </li></ul><ul><li>INDUS Tools: Technical Details and Demo </li></ul><ul><li>Summary and Work in Progress </li></ul>
    38. 38. Summary
    39. 39. Work in progress <ul><li>Ontologies and mappings </li></ul><ul><li>Support for more expressive ontologies (beyond hierarchies) [Bao et al., 2005] </li></ul><ul><li>Support for interactive specification of mappings between ontologies, including automated generation of candidate mappings </li></ul><ul><li>Support for modular ontologies and mappings [Bao and Honavar, 2004] </li></ul><ul><li>Scalability: efficient mechanisms for storage, manipulation, retrieval and use of large ontologies and mappings </li></ul><ul><li>More powerful reasoning to ensure the semantic integrity of mappings </li></ul><ul><li>Support for import, export, and sharing of ontologies and mappings (e.g. OBO and OWL) </li></ul>
    40. 40. Work in progress <ul><li>Query Processing </li></ul><ul><li>Query optimization under access, bandwidth and computational constraints </li></ul><ul><li>Implementation of data retrieval procedures (iterators) for widely used bioinformatics data sources </li></ul><ul><li>Support for data caching and data sharing </li></ul>
    41. 41. Work in progress <ul><li>Knowledge Acquisition </li></ul><ul><li>Support for learning classifiers and other predictive models from semantically heterogeneous data [Caragea et al., 2005] </li></ul><ul><li>Support for statistical queries - including queries over partially specified data [Caragea et al., 2004 ] </li></ul><ul><li>Support for annotating and sharing results of knowledge acquisition </li></ul>
    42. 42. Work in progress <ul><li>Applications in bioinformatics - data driven discovery of macromolecular sequence-structure-function relationships </li></ul><ul><ul><li>Prediction of protein function [Andorf et al., 2004] </li></ul></ul><ul><ul><li>Prediction of protein-protein, protein-DNA and protein-RNA interfaces [Yan et al., 2004] </li></ul></ul><ul><ul><li>Analysis, visualization, and interpretation of gene expression data [Caragea et al., 2005] </li></ul></ul><ul><ul><li>Modeling and discovery of gene regulatory networks </li></ul></ul><ul><li>Usability studies </li></ul><ul><li>Design of better user interfaces </li></ul><ul><li>Performance evaluation </li></ul>
    43. 43. Relevant Publications <ul><li>Caragea, D., Pathak, J., Bao, J., Silvescu, A., Andorf., C., Dobbs, D. and Honavar, V. (2005). Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources. In: Proceedings of the 2nd International Workshop on Data Integration in Life Sciences (DILS'05), San Diego, CA. </li></ul><ul><li>Caragea, D., Pathak, J., and Honavar, V. (2004). Learning Classifiers from Semantically Heterogeneous Data. In: Proceedings of the Third International Conference on Ontologies, DataBases and Applications of Semantics for Large Scale Information Systems (ODBASE’04), October 25-29, 2004, Agia Napa, Cyprus. </li></ul><ul><li>Caragea, D., Silvescu, A., and Honavar, V. (2004). A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems. Vol. 1, No. 2. Invited Paper. </li></ul>
    44. 44. http://www.cs.iastate.edu/~dcaragea/indus.html

    ×