Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources BOSC Vienna, Austria July 20, 2007 K...
Outline <ul><li>Motivation </li></ul><ul><li>-- GenMAPP </li></ul><ul><li>-- Project requirements </li></ul><ul><li>XMLPip...
Outline <ul><li>Motivation </li></ul><ul><li>-- GenMAPP </li></ul><ul><li>-- Project requirements </li></ul><ul><li>XMLPip...
How GenMAPP Works http://www.GenMAPP.org <ul><li>Graphics tools make MAPPs that store gene IDs and </li></ul><ul><li>vecto...
GenMAPP Design and Implementation <ul><li>Stand-alone program implemented in Visual Basic </li></ul><ul><li>--program and ...
 
MAPPFinder Determines Which GO Terms Are Overrepresented in a GenMAPP Expression Dataset
Maintaining and Updating GenMAPP Gene Databases has been a Bottleneck for Development <ul><li>Microarrays use different ge...
XMLPipeDB:  A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources <ul><li>Requirements: </...
GenMAPP Gene Database Schema for Escherichia coli K12
Data Sources Required for a “Minimal” GenMAPP Gene Database UniProt •  UniProt complete proteome sets for many species are...
XMLPipeDB Use Case Diagram
XMLPipeDB Use Case Diagram
Produces: Java source code SQL DDL file Hibernate mappings Apache Ant build.xml
XMLPipeDB Use Case Diagram
UniProtDB and GODB Required Only Nominal Post-processing <ul><li>•  Naming:  XSD or DTD definitions might use names that <...
XMLPipeDB Use Case Diagram
“ Rule of Three” XMLPipeDB Utilities Library is a Suite of Java Classes that Provide Functions Common to Most XMLPipeDB Da...
XMLPipeDB Use Case Diagram
GenMAPP Builder Interacts with PostgreSQL in Three Ways
GenMAPP Builder Uses the XMLPipeDB Utilities Library to Configure the PostgreSQL Database … and import XML
<ul><li>GenMAPP Builder Has </li></ul><ul><li>Customized Profiles </li></ul><ul><li>for each Primary Data Source </li></ul...
The User Chooses Which Gene ID Systems and Relations to Export to the Gene Database
GenMAPP Gene Database for  Escherichia coli  K12 Was the First Milestone for XMLPipeDB <ul><li>Loading the XML files into ...
The Next Challenge was to Create a Gene Database for the Plant,  Arabidopsis thaliana
The Next Challenge was to Create a Gene Database for the Plant,  Arabidopsis thaliana <ul><li>Creating an Arabidopsis Gene...
Wish List for XMLPipeDB Development <ul><li>Additional XML datasources </li></ul><ul><li>Additional data management featur...
XSD-to-DB Adam Carasso Jeffrey Nicholas Scott Spicer XMLPipeDBUtils David Hoffman Babak Naffas Jeffrey Nicholas Ryan Nakam...
Upcoming SlideShare
Loading in …5
×

XMLPipeDB

2,923 views

Published on

Title: XMLPipeDB: A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources
Author: Kam Dalquist

Published in: Technology
  • Be the first to comment

  • Be the first to like this

XMLPipeDB

  1. 1. A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources BOSC Vienna, Austria July 20, 2007 Kam D. Dahlquist Department of Biology Jeffrey Nicholas John David N. Dionisio Department of Electrical Engineering & Computer Science http://xmlpipedb.cs.lmu.edu Loyola Marymount University
  2. 2. Outline <ul><li>Motivation </li></ul><ul><li>-- GenMAPP </li></ul><ul><li>-- Project requirements </li></ul><ul><li>XMLPipeDB Implementation </li></ul><ul><ul><li>-- XSD-to-DB </li></ul></ul><ul><ul><li>-- UniProtDB and GODB </li></ul></ul><ul><ul><li>-- XMLPipeDB Utilities </li></ul></ul><ul><ul><li>-- GenMAPP Builder </li></ul></ul><ul><li>Wish List for Development </li></ul>
  3. 3. Outline <ul><li>Motivation </li></ul><ul><li>-- GenMAPP </li></ul><ul><li>-- Project requirements </li></ul><ul><li>XMLPipeDB Implementation </li></ul><ul><ul><li>-- XSD-to-DB </li></ul></ul><ul><ul><li>-- UniProtDB and GODB </li></ul></ul><ul><ul><li>-- XMLPipeDB Utilities </li></ul></ul><ul><ul><li>-- GenMAPP Builder </li></ul></ul><ul><li>Wish List for Development </li></ul>
  4. 4. How GenMAPP Works http://www.GenMAPP.org <ul><li>Graphics tools make MAPPs that store gene IDs and </li></ul><ul><li>vector coordinates for all </li></ul><ul><li>graphical objects </li></ul><ul><li>Separate Expression Dataset </li></ul><ul><li>files store data and color- </li></ul><ul><li>coding instructions </li></ul><ul><li>Gene Databases store IDs, </li></ul><ul><li>annotation, and hyperlinks </li></ul><ul><li>to public gene and protein </li></ul><ul><li>databases </li></ul><ul><li>Stand-alone program implemented in Visual Basic, </li></ul><ul><li>accessory files are Microsoft Access databases </li></ul>
  5. 5. GenMAPP Design and Implementation <ul><li>Stand-alone program implemented in Visual Basic </li></ul><ul><li>--program and source code available at www.GenMAPP.org </li></ul><ul><li>Back-end is Microsoft Jet Database Engine </li></ul><ul><li>Three types of databases used in program: </li></ul><ul><li>--MAPP database stores gene IDs and vector coordinates </li></ul><ul><li> for all graphical objects </li></ul><ul><li>--Expression Datasets store gene IDs, DNA microarray data, </li></ul><ul><li> and color-coding instructions </li></ul><ul><li>--Gene Database stores gene IDs and annotation from </li></ul><ul><li> major public databases </li></ul><ul><li>All databases are specific to a single species, users </li></ul><ul><li>match appropriate MAPPs, Expression Datasets, and </li></ul><ul><li>Gene Databases when using the program </li></ul>
  6. 7. MAPPFinder Determines Which GO Terms Are Overrepresented in a GenMAPP Expression Dataset
  7. 8. Maintaining and Updating GenMAPP Gene Databases has been a Bottleneck for Development <ul><li>Microarrays use different gene ID systems for annotation; </li></ul><ul><li>users want as much information as possible. </li></ul><ul><li>We need to capture and reliably relate gene data from </li></ul><ul><li>different sources and keep the data updated. </li></ul><ul><li>Gene Database design is data-driven; it tells GenMAPP </li></ul><ul><li>what gene ID systems and relationships are present. </li></ul><ul><li>Current GenMAPP Gene Databases are built from </li></ul><ul><li>Ensembl as the main data source. </li></ul><ul><li>-- limited to (mostly) animal species </li></ul><ul><li>-- sensitive to changes in flat file formats </li></ul>
  8. 9. XMLPipeDB: A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources <ul><li>Requirements: </li></ul><ul><li>to create Gene Databases for other species </li></ul><ul><li>(bacteria/plants) using UniProt as the main data source </li></ul><ul><li>to be robust to changes in source file formats </li></ul><ul><li>to use XML sources wherever possible </li></ul><ul><li>to take advantage of existing open source tools </li></ul><ul><li>to limit the manual manipulation of the data </li></ul>First task, reported last year at BOSC, was to build a GenMAPP Gene Database for Escherichia coli K12
  9. 10. GenMAPP Gene Database Schema for Escherichia coli K12
  10. 11. Data Sources Required for a “Minimal” GenMAPP Gene Database UniProt • UniProt complete proteome sets for many species are made available as XML downloads by the Integr8 resource Gene Ontology • OBO XML format UniProt to GO associations • GOA downloads also available at Integr8
  11. 12. XMLPipeDB Use Case Diagram
  12. 13. XMLPipeDB Use Case Diagram
  13. 14. Produces: Java source code SQL DDL file Hibernate mappings Apache Ant build.xml
  14. 15. XMLPipeDB Use Case Diagram
  15. 16. UniProtDB and GODB Required Only Nominal Post-processing <ul><li>• Naming: XSD or DTD definitions might use names that </li></ul><ul><li>are SQL reserved words and thus cannot be used as </li></ul><ul><li>table or attribute names </li></ul><ul><li>-- In UniProtDB, “end” was renamed to “endPosition” </li></ul><ul><li>-- In GODB, “to” was renamed to “to_” </li></ul><ul><li>Datatypes: Some XSD datatypes are not easily </li></ul><ul><li>supported in SQL </li></ul><ul><li>-- In UniProtDB, the definition for citationType was changed </li></ul><ul><li> from month/year to string </li></ul><ul><li>-- Some definitions were changed from SQL varchar(255) </li></ul><ul><li> to varchar(unspecified length) </li></ul><ul><li>Schema diagram automatically generated with third party tool </li></ul>
  16. 17. XMLPipeDB Use Case Diagram
  17. 18. “ Rule of Three” XMLPipeDB Utilities Library is a Suite of Java Classes that Provide Functions Common to Most XMLPipeDB Database Applications <ul><li>Loading of XML files into Java objects </li></ul><ul><li>Saving XML-derived Java objects to a relational database </li></ul><ul><li>Rudimentary query and retrieval of Java objects from </li></ul><ul><li>the relational database </li></ul><ul><li>-- HQL (Hibernate Query Language), SQL query </li></ul><ul><li>-- object browser that shows results of query </li></ul><ul><li>Configuring a client application to communicate with </li></ul><ul><li>a relational database </li></ul>
  18. 19. XMLPipeDB Use Case Diagram
  19. 20. GenMAPP Builder Interacts with PostgreSQL in Three Ways
  20. 21. GenMAPP Builder Uses the XMLPipeDB Utilities Library to Configure the PostgreSQL Database … and import XML
  21. 22. <ul><li>GenMAPP Builder Has </li></ul><ul><li>Customized Profiles </li></ul><ul><li>for each Primary Data Source </li></ul><ul><li>(currently only UniProt) </li></ul><ul><li>for each species (currently </li></ul><ul><li>Escherichia coli K12 and </li></ul><ul><li>Arabidopsis thaliana ) </li></ul><ul><li>based on TaxonID </li></ul>
  22. 23. The User Chooses Which Gene ID Systems and Relations to Export to the Gene Database
  23. 24. GenMAPP Gene Database for Escherichia coli K12 Was the First Milestone for XMLPipeDB <ul><li>Loading the XML files into the PostgreSQL database </li></ul><ul><li>took approximately 20 minutes </li></ul><ul><li>--UniProt XML (44 MB) </li></ul><ul><li>--GO XML (13 MB) </li></ul><ul><li>Export of the Gene Database took approximately 2 hours </li></ul><ul><li>Data integrity was checked by hand </li></ul><ul><li>--all 4329 records from UniProt were successfully exported </li></ul><ul><li> to the Gene Database </li></ul><ul><li>--our Gene Database is missing 219 Blattner IDs </li></ul><ul><li>--the missing IDs were not present in the UniProt XML </li></ul><ul><li>157 RNA genes </li></ul><ul><li> 1 origin of replication </li></ul><ul><li> 51 protein coding sequences </li></ul><ul><li> 10 no feature designation </li></ul>
  24. 25. The Next Challenge was to Create a Gene Database for the Plant, Arabidopsis thaliana
  25. 26. The Next Challenge was to Create a Gene Database for the Plant, Arabidopsis thaliana <ul><li>Creating an Arabidopsis Gene Database required refactoring the code </li></ul><ul><li>The Arabidopsis UniProt proteome set has 34,566 proteins </li></ul><ul><li>--UniProt XML file is 237 MB </li></ul><ul><li>--order of magnitude larger than Escherichia coli </li></ul><ul><li>GenMAPP Builder initially failed to import the large XML file </li></ul><ul><li>--the file had to be broken into smaller individual files programmatically </li></ul><ul><li>Export of the Gene Database requires 2 GB RAM and takes > 7 hours </li></ul><ul><li>While we captured all but 51 of E. coli protein-encoding genes; </li></ul><ul><li>we are missing 1173 protein-encoding genes in TAIR </li></ul><ul><li>--mapping between TAIR and UniProt seems poor; TAIR IDs appear </li></ul><ul><li> in many different fields in XML </li></ul><ul><li>Have implemented crude data integrity checks that tally IDs in XML for comparison with exported database, need more </li></ul>
  26. 27. Wish List for XMLPipeDB Development <ul><li>Additional XML datasources </li></ul><ul><li>Additional data management features for XPDutils </li></ul><ul><li>--shift loading of GOA file to import process </li></ul><ul><li>--add/delete species data </li></ul><ul><li>Automate data integrity checks </li></ul><ul><li>GUI needs substantial work </li></ul>
  27. 28. XSD-to-DB Adam Carasso Jeffrey Nicholas Scott Spicer XMLPipeDBUtils David Hoffman Babak Naffas Jeffrey Nicholas Ryan Nakamoto UniProtDB Joe Boyle Joey Barrett GODB Scott Spicer Roberto Ruiz GenMAPP Builder Joey Barrett Jeffrey Nicholas Scott Spicer Special Thanks GenMAPP.org Development Group Caskey L. Dickson, Wesley T. Citti NSF CCLI Program (http://recourse.cs.lmu.edu) http://xmlpipedb.cs.lmu.edu LMU Bioinformatics Group Kam D. Dahlquist http://myweb.lmu.edu/kdahqui [email_address] John David N. Dionisio http://myweb.lmu.edu/dondi [email_address]

×