XMLPipeDB

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    Both relatively new to LMU Dondi’s background in medical informatics, data visualization, person-computer interactions During my postdoc I had served as project manager for GenMAPP, want to extend features of GenMAPP, especially for other species I am not a software developer (last time I took a computer science class was AP Pascal in high school), but I’ve had a lot of experience interacting with developers I’m proud of GenMAPP, especially that it is user-friendly for biologists, and is relatively bug free (result of my extensive testing) However, I never would have been standing up in this community to talk about it because although we believed strongly that GenMAPP should be free-of-charge, we were slow to make the source code available (it is now available on SourceForge) It has only been my collaboration with Dondi that I have been educated as to what Open Source software development truly means (Cathedral and Bazaar) This is the perfect forum for talking about our work because, while I am using the fruits of XMLPipeDB for GenMAPP as first imagined, we designed this project to have components that are resusable for other purposes and that the bioinformatics developer community is our target audience

    Favorites, Groups & Events

    XMLPipeDB - Presentation Transcript

    1. A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources BOSC Vienna, Austria July 20, 2007 Kam D. Dahlquist Department of Biology Jeffrey Nicholas John David N. Dionisio Department of Electrical Engineering & Computer Science http://xmlpipedb.cs.lmu.edu Loyola Marymount University
    2. Outline
      • Motivation
      • -- GenMAPP
      • -- Project requirements
      • XMLPipeDB Implementation
        • -- XSD-to-DB
        • -- UniProtDB and GODB
        • -- XMLPipeDB Utilities
        • -- GenMAPP Builder
      • Wish List for Development
    3. Outline
      • Motivation
      • -- GenMAPP
      • -- Project requirements
      • XMLPipeDB Implementation
        • -- XSD-to-DB
        • -- UniProtDB and GODB
        • -- XMLPipeDB Utilities
        • -- GenMAPP Builder
      • Wish List for Development
    4. How GenMAPP Works http://www.GenMAPP.org
      • Graphics tools make MAPPs that store gene IDs and
      • vector coordinates for all
      • graphical objects
      • Separate Expression Dataset
      • files store data and color-
      • coding instructions
      • Gene Databases store IDs,
      • annotation, and hyperlinks
      • to public gene and protein
      • databases
      • Stand-alone program implemented in Visual Basic,
      • accessory files are Microsoft Access databases
    5. GenMAPP Design and Implementation
      • Stand-alone program implemented in Visual Basic
      • --program and source code available at www.GenMAPP.org
      • Back-end is Microsoft Jet Database Engine
      • Three types of databases used in program:
      • --MAPP database stores gene IDs and vector coordinates
      • for all graphical objects
      • --Expression Datasets store gene IDs, DNA microarray data,
      • and color-coding instructions
      • --Gene Database stores gene IDs and annotation from
      • major public databases
      • All databases are specific to a single species, users
      • match appropriate MAPPs, Expression Datasets, and
      • Gene Databases when using the program
    6.  
    7. MAPPFinder Determines Which GO Terms Are Overrepresented in a GenMAPP Expression Dataset
    8. Maintaining and Updating GenMAPP Gene Databases has been a Bottleneck for Development
      • Microarrays use different gene ID systems for annotation;
      • users want as much information as possible.
      • We need to capture and reliably relate gene data from
      • different sources and keep the data updated.
      • Gene Database design is data-driven; it tells GenMAPP
      • what gene ID systems and relationships are present.
      • Current GenMAPP Gene Databases are built from
      • Ensembl as the main data source.
      • -- limited to (mostly) animal species
      • -- sensitive to changes in flat file formats
    9. XMLPipeDB: A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources
      • Requirements:
      • to create Gene Databases for other species
      • (bacteria/plants) using UniProt as the main data source
      • to be robust to changes in source file formats
      • to use XML sources wherever possible
      • to take advantage of existing open source tools
      • to limit the manual manipulation of the data
      First task, reported last year at BOSC, was to build a GenMAPP Gene Database for Escherichia coli K12
    10. GenMAPP Gene Database Schema for Escherichia coli K12
    11. Data Sources Required for a “Minimal” GenMAPP Gene Database UniProt • UniProt complete proteome sets for many species are made available as XML downloads by the Integr8 resource Gene Ontology • OBO XML format UniProt to GO associations • GOA downloads also available at Integr8
    12. XMLPipeDB Use Case Diagram
    13. XMLPipeDB Use Case Diagram
    14. Produces: Java source code SQL DDL file Hibernate mappings Apache Ant build.xml
    15. XMLPipeDB Use Case Diagram
    16. UniProtDB and GODB Required Only Nominal Post-processing
      • • Naming: XSD or DTD definitions might use names that
      • are SQL reserved words and thus cannot be used as
      • table or attribute names
      • -- In UniProtDB, “end” was renamed to “endPosition”
      • -- In GODB, “to” was renamed to “to_”
      • Datatypes: Some XSD datatypes are not easily
      • supported in SQL
      • -- In UniProtDB, the definition for citationType was changed
      • from month/year to string
      • -- Some definitions were changed from SQL varchar(255)
      • to varchar(unspecified length)
      • Schema diagram automatically generated with third party tool
    17. XMLPipeDB Use Case Diagram
    18. “ Rule of Three” XMLPipeDB Utilities Library is a Suite of Java Classes that Provide Functions Common to Most XMLPipeDB Database Applications
      • Loading of XML files into Java objects
      • Saving XML-derived Java objects to a relational database
      • Rudimentary query and retrieval of Java objects from
      • the relational database
      • -- HQL (Hibernate Query Language), SQL query
      • -- object browser that shows results of query
      • Configuring a client application to communicate with
      • a relational database
    19. XMLPipeDB Use Case Diagram
    20. GenMAPP Builder Interacts with PostgreSQL in Three Ways
    21. GenMAPP Builder Uses the XMLPipeDB Utilities Library to Configure the PostgreSQL Database … and import XML
      • GenMAPP Builder Has
      • Customized Profiles
      • for each Primary Data Source
      • (currently only UniProt)
      • for each species (currently
      • Escherichia coli K12 and
      • Arabidopsis thaliana )
      • based on TaxonID
    22. The User Chooses Which Gene ID Systems and Relations to Export to the Gene Database
    23. GenMAPP Gene Database for Escherichia coli K12 Was the First Milestone for XMLPipeDB
      • Loading the XML files into the PostgreSQL database
      • took approximately 20 minutes
      • --UniProt XML (44 MB)
      • --GO XML (13 MB)
      • Export of the Gene Database took approximately 2 hours
      • Data integrity was checked by hand
      • --all 4329 records from UniProt were successfully exported
      • to the Gene Database
      • --our Gene Database is missing 219 Blattner IDs
      • --the missing IDs were not present in the UniProt XML
      • 157 RNA genes
      • 1 origin of replication
      • 51 protein coding sequences
      • 10 no feature designation
    24. The Next Challenge was to Create a Gene Database for the Plant, Arabidopsis thaliana
    25. The Next Challenge was to Create a Gene Database for the Plant, Arabidopsis thaliana
      • Creating an Arabidopsis Gene Database required refactoring the code
      • The Arabidopsis UniProt proteome set has 34,566 proteins
      • --UniProt XML file is 237 MB
      • --order of magnitude larger than Escherichia coli
      • GenMAPP Builder initially failed to import the large XML file
      • --the file had to be broken into smaller individual files programmatically
      • Export of the Gene Database requires 2 GB RAM and takes > 7 hours
      • While we captured all but 51 of E. coli protein-encoding genes;
      • we are missing 1173 protein-encoding genes in TAIR
      • --mapping between TAIR and UniProt seems poor; TAIR IDs appear
      • in many different fields in XML
      • Have implemented crude data integrity checks that tally IDs in XML for comparison with exported database, need more
    26. Wish List for XMLPipeDB Development
      • Additional XML datasources
      • Additional data management features for XPDutils
      • --shift loading of GOA file to import process
      • --add/delete species data
      • Automate data integrity checks
      • GUI needs substantial work
    27. XSD-to-DB Adam Carasso Jeffrey Nicholas Scott Spicer XMLPipeDBUtils David Hoffman Babak Naffas Jeffrey Nicholas Ryan Nakamoto UniProtDB Joe Boyle Joey Barrett GODB Scott Spicer Roberto Ruiz GenMAPP Builder Joey Barrett Jeffrey Nicholas Scott Spicer Special Thanks GenMAPP.org Development Group Caskey L. Dickson, Wesley T. Citti NSF CCLI Program (http://recourse.cs.lmu.edu) http://xmlpipedb.cs.lmu.edu LMU Bioinformatics Group Kam D. Dahlquist http://myweb.lmu.edu/kdahqui [email_address] John David N. Dionisio http://myweb.lmu.edu/dondi [email_address]

    boscbosc, 2 years ago

    custom

    1248 views, 0 favs, 0 embeds more stats

    Title: XMLPipeDB: A Reusable, Open Source Tool Chai more

    More Info

    © All Rights Reserved

    Go to text version
    • Total Views 1248
      • 1248 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 29
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as innappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel

    Categories

    Tags