Published on

Title: XMLPipeDB: A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources
Author: Kam Dalquist

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Both relatively new to LMU Dondi’s background in medical informatics, data visualization, person-computer interactions During my postdoc I had served as project manager for GenMAPP, want to extend features of GenMAPP, especially for other species I am not a software developer (last time I took a computer science class was AP Pascal in high school), but I’ve had a lot of experience interacting with developers I’m proud of GenMAPP, especially that it is user-friendly for biologists, and is relatively bug free (result of my extensive testing) However, I never would have been standing up in this community to talk about it because although we believed strongly that GenMAPP should be free-of-charge, we were slow to make the source code available (it is now available on SourceForge) It has only been my collaboration with Dondi that I have been educated as to what Open Source software development truly means (Cathedral and Bazaar) This is the perfect forum for talking about our work because, while I am using the fruits of XMLPipeDB for GenMAPP as first imagined, we designed this project to have components that are resusable for other purposes and that the bioinformatics developer community is our target audience
  • XMLPipeDB

    1. 1. A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources BOSC Vienna, Austria July 20, 2007 Kam D. Dahlquist Department of Biology Jeffrey Nicholas John David N. Dionisio Department of Electrical Engineering & Computer Science http://xmlpipedb.cs.lmu.edu Loyola Marymount University
    2. 2. Outline <ul><li>Motivation </li></ul><ul><li>-- GenMAPP </li></ul><ul><li>-- Project requirements </li></ul><ul><li>XMLPipeDB Implementation </li></ul><ul><ul><li>-- XSD-to-DB </li></ul></ul><ul><ul><li>-- UniProtDB and GODB </li></ul></ul><ul><ul><li>-- XMLPipeDB Utilities </li></ul></ul><ul><ul><li>-- GenMAPP Builder </li></ul></ul><ul><li>Wish List for Development </li></ul>
    3. 3. Outline <ul><li>Motivation </li></ul><ul><li>-- GenMAPP </li></ul><ul><li>-- Project requirements </li></ul><ul><li>XMLPipeDB Implementation </li></ul><ul><ul><li>-- XSD-to-DB </li></ul></ul><ul><ul><li>-- UniProtDB and GODB </li></ul></ul><ul><ul><li>-- XMLPipeDB Utilities </li></ul></ul><ul><ul><li>-- GenMAPP Builder </li></ul></ul><ul><li>Wish List for Development </li></ul>
    4. 4. How GenMAPP Works http://www.GenMAPP.org <ul><li>Graphics tools make MAPPs that store gene IDs and </li></ul><ul><li>vector coordinates for all </li></ul><ul><li>graphical objects </li></ul><ul><li>Separate Expression Dataset </li></ul><ul><li>files store data and color- </li></ul><ul><li>coding instructions </li></ul><ul><li>Gene Databases store IDs, </li></ul><ul><li>annotation, and hyperlinks </li></ul><ul><li>to public gene and protein </li></ul><ul><li>databases </li></ul><ul><li>Stand-alone program implemented in Visual Basic, </li></ul><ul><li>accessory files are Microsoft Access databases </li></ul>
    5. 5. GenMAPP Design and Implementation <ul><li>Stand-alone program implemented in Visual Basic </li></ul><ul><li>--program and source code available at www.GenMAPP.org </li></ul><ul><li>Back-end is Microsoft Jet Database Engine </li></ul><ul><li>Three types of databases used in program: </li></ul><ul><li>--MAPP database stores gene IDs and vector coordinates </li></ul><ul><li> for all graphical objects </li></ul><ul><li>--Expression Datasets store gene IDs, DNA microarray data, </li></ul><ul><li> and color-coding instructions </li></ul><ul><li>--Gene Database stores gene IDs and annotation from </li></ul><ul><li> major public databases </li></ul><ul><li>All databases are specific to a single species, users </li></ul><ul><li>match appropriate MAPPs, Expression Datasets, and </li></ul><ul><li>Gene Databases when using the program </li></ul>
    6. 7. MAPPFinder Determines Which GO Terms Are Overrepresented in a GenMAPP Expression Dataset
    7. 8. Maintaining and Updating GenMAPP Gene Databases has been a Bottleneck for Development <ul><li>Microarrays use different gene ID systems for annotation; </li></ul><ul><li>users want as much information as possible. </li></ul><ul><li>We need to capture and reliably relate gene data from </li></ul><ul><li>different sources and keep the data updated. </li></ul><ul><li>Gene Database design is data-driven; it tells GenMAPP </li></ul><ul><li>what gene ID systems and relationships are present. </li></ul><ul><li>Current GenMAPP Gene Databases are built from </li></ul><ul><li>Ensembl as the main data source. </li></ul><ul><li>-- limited to (mostly) animal species </li></ul><ul><li>-- sensitive to changes in flat file formats </li></ul>
    8. 9. XMLPipeDB: A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources <ul><li>Requirements: </li></ul><ul><li>to create Gene Databases for other species </li></ul><ul><li>(bacteria/plants) using UniProt as the main data source </li></ul><ul><li>to be robust to changes in source file formats </li></ul><ul><li>to use XML sources wherever possible </li></ul><ul><li>to take advantage of existing open source tools </li></ul><ul><li>to limit the manual manipulation of the data </li></ul>First task, reported last year at BOSC, was to build a GenMAPP Gene Database for Escherichia coli K12
    9. 10. GenMAPP Gene Database Schema for Escherichia coli K12
    10. 11. Data Sources Required for a “Minimal” GenMAPP Gene Database UniProt • UniProt complete proteome sets for many species are made available as XML downloads by the Integr8 resource Gene Ontology • OBO XML format UniProt to GO associations • GOA downloads also available at Integr8
    11. 12. XMLPipeDB Use Case Diagram
    12. 13. XMLPipeDB Use Case Diagram
    13. 14. Produces: Java source code SQL DDL file Hibernate mappings Apache Ant build.xml
    14. 15. XMLPipeDB Use Case Diagram
    15. 16. UniProtDB and GODB Required Only Nominal Post-processing <ul><li>• Naming: XSD or DTD definitions might use names that </li></ul><ul><li>are SQL reserved words and thus cannot be used as </li></ul><ul><li>table or attribute names </li></ul><ul><li>-- In UniProtDB, “end” was renamed to “endPosition” </li></ul><ul><li>-- In GODB, “to” was renamed to “to_” </li></ul><ul><li>Datatypes: Some XSD datatypes are not easily </li></ul><ul><li>supported in SQL </li></ul><ul><li>-- In UniProtDB, the definition for citationType was changed </li></ul><ul><li> from month/year to string </li></ul><ul><li>-- Some definitions were changed from SQL varchar(255) </li></ul><ul><li> to varchar(unspecified length) </li></ul><ul><li>Schema diagram automatically generated with third party tool </li></ul>
    16. 17. XMLPipeDB Use Case Diagram
    17. 18. “ Rule of Three” XMLPipeDB Utilities Library is a Suite of Java Classes that Provide Functions Common to Most XMLPipeDB Database Applications <ul><li>Loading of XML files into Java objects </li></ul><ul><li>Saving XML-derived Java objects to a relational database </li></ul><ul><li>Rudimentary query and retrieval of Java objects from </li></ul><ul><li>the relational database </li></ul><ul><li>-- HQL (Hibernate Query Language), SQL query </li></ul><ul><li>-- object browser that shows results of query </li></ul><ul><li>Configuring a client application to communicate with </li></ul><ul><li>a relational database </li></ul>
    18. 19. XMLPipeDB Use Case Diagram
    19. 20. GenMAPP Builder Interacts with PostgreSQL in Three Ways
    20. 21. GenMAPP Builder Uses the XMLPipeDB Utilities Library to Configure the PostgreSQL Database … and import XML
    21. 22. <ul><li>GenMAPP Builder Has </li></ul><ul><li>Customized Profiles </li></ul><ul><li>for each Primary Data Source </li></ul><ul><li>(currently only UniProt) </li></ul><ul><li>for each species (currently </li></ul><ul><li>Escherichia coli K12 and </li></ul><ul><li>Arabidopsis thaliana ) </li></ul><ul><li>based on TaxonID </li></ul>
    22. 23. The User Chooses Which Gene ID Systems and Relations to Export to the Gene Database
    23. 24. GenMAPP Gene Database for Escherichia coli K12 Was the First Milestone for XMLPipeDB <ul><li>Loading the XML files into the PostgreSQL database </li></ul><ul><li>took approximately 20 minutes </li></ul><ul><li>--UniProt XML (44 MB) </li></ul><ul><li>--GO XML (13 MB) </li></ul><ul><li>Export of the Gene Database took approximately 2 hours </li></ul><ul><li>Data integrity was checked by hand </li></ul><ul><li>--all 4329 records from UniProt were successfully exported </li></ul><ul><li> to the Gene Database </li></ul><ul><li>--our Gene Database is missing 219 Blattner IDs </li></ul><ul><li>--the missing IDs were not present in the UniProt XML </li></ul><ul><li>157 RNA genes </li></ul><ul><li> 1 origin of replication </li></ul><ul><li> 51 protein coding sequences </li></ul><ul><li> 10 no feature designation </li></ul>
    24. 25. The Next Challenge was to Create a Gene Database for the Plant, Arabidopsis thaliana
    25. 26. The Next Challenge was to Create a Gene Database for the Plant, Arabidopsis thaliana <ul><li>Creating an Arabidopsis Gene Database required refactoring the code </li></ul><ul><li>The Arabidopsis UniProt proteome set has 34,566 proteins </li></ul><ul><li>--UniProt XML file is 237 MB </li></ul><ul><li>--order of magnitude larger than Escherichia coli </li></ul><ul><li>GenMAPP Builder initially failed to import the large XML file </li></ul><ul><li>--the file had to be broken into smaller individual files programmatically </li></ul><ul><li>Export of the Gene Database requires 2 GB RAM and takes > 7 hours </li></ul><ul><li>While we captured all but 51 of E. coli protein-encoding genes; </li></ul><ul><li>we are missing 1173 protein-encoding genes in TAIR </li></ul><ul><li>--mapping between TAIR and UniProt seems poor; TAIR IDs appear </li></ul><ul><li> in many different fields in XML </li></ul><ul><li>Have implemented crude data integrity checks that tally IDs in XML for comparison with exported database, need more </li></ul>
    26. 27. Wish List for XMLPipeDB Development <ul><li>Additional XML datasources </li></ul><ul><li>Additional data management features for XPDutils </li></ul><ul><li>--shift loading of GOA file to import process </li></ul><ul><li>--add/delete species data </li></ul><ul><li>Automate data integrity checks </li></ul><ul><li>GUI needs substantial work </li></ul>
    27. 28. XSD-to-DB Adam Carasso Jeffrey Nicholas Scott Spicer XMLPipeDBUtils David Hoffman Babak Naffas Jeffrey Nicholas Ryan Nakamoto UniProtDB Joe Boyle Joey Barrett GODB Scott Spicer Roberto Ruiz GenMAPP Builder Joey Barrett Jeffrey Nicholas Scott Spicer Special Thanks GenMAPP.org Development Group Caskey L. Dickson, Wesley T. Citti NSF CCLI Program (http://recourse.cs.lmu.edu) http://xmlpipedb.cs.lmu.edu LMU Bioinformatics Group Kam D. Dahlquist http://myweb.lmu.edu/kdahqui [email_address] John David N. Dionisio http://myweb.lmu.edu/dondi [email_address]