Dahlquist_XMLPipedB_BOSC2009
Upcoming SlideShare
Loading in...5
×
 

Dahlquist_XMLPipedB_BOSC2009

on

  • 911 views

 

Statistics

Views

Total Views
911
Views on SlideShare
911
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The UniProt and GO XML schemas have each changed twice during GenMAPP Builder development In all instances, updating GenMAPP Builder to use the new schemas consisted of these fairly mechanical steps: 1. Re-run XSD-to-DB on the updated schema 2. Re-run name-clash adjustment utilities on the new file set 3. Redefine the SQL tables into the relational database 4. Replace the Java libraries in GenMAPP Builder None of these steps involved manual recoding of UniProt or GO code The changes only affected GenMAPP Builder code if the schema changes affected tables or fields whose data are exported to the GenMAPP Gene Database Underscores, slashes, periods not uniformly carried over into same UniProt XML fields
  • The UniProt and GO XML schemas have each changed twice during GenMAPP Builder development In all instances, updating GenMAPP Builder to use the new schemas consisted of these fairly mechanical steps: 1. Re-run XSD-to-DB on the updated schema 2. Re-run name-clash adjustment utilities on the new file set 3. Redefine the SQL tables into the relational database 4. Replace the Java libraries in GenMAPP Builder None of these steps involved manual recoding of UniProt or GO code The changes only affected GenMAPP Builder code if the schema changes affected tables or fields whose data are exported to the GenMAPP Gene Database Underscores, slashes, periods not uniformly carried over into same UniProt XML fields

Dahlquist_XMLPipedB_BOSC2009 Dahlquist_XMLPipedB_BOSC2009 Presentation Transcript

  • A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources BOSC Stockholm, Sweden June 27, 2009 Kam D. Dahlquist Alexandrea Alphonso Chad Villaflores Department of Biology John David N. Dionisio Derek Smith Department of Electrical Engineering & Computer Science http://xmlpipedb.cs.lmu.edu Loyola Marymount University
  • Outline
    • Motivation
    • --GenMAPP
    • --Project requirements
    • XMLPipeDB Implementation
      • --XSD-to-DB
      • --UniProtDB and GODB
      • --XMLPipeDB Utilities
      • --GenMAPP Builder
    • Lessons Learned
      • --How robust is our system to changes in XML formats?
      • --How well does our system work with other common bioinformatics XML formats?
  • How GenMAPP Works http://www.GenMAPP.org
    • Graphics tools make MAPPs that store gene IDs and vector coordinates for all graphical objects
    • Separate Expression Dataset files
    • store data and color-coding
    • instructions
    • Gene Databases store IDs,
    • annotation, and hyperlinks to
    • public gene and protein databases
    • MAPPFinder performs Gene
    • Ontology over-representation
    • analysis
    • Stand-alone program implemented in Visual Basic, accessory files are Microsoft Access databases
    View slide
  • Maintaining and Updating GenMAPP Gene Databases has been a Bottleneck for Development
    • Microarrays use different gene ID systems for annotation; users want as much information as possible.
    • We need to capture and reliably relate gene data from different sources and keep the data updated.
    • Gene Database design is data-driven; it tells GenMAPP what gene ID systems and relationships are present.
    • Current GenMAPP Gene Databases are built from Ensembl as the main data source.
    • -- limited to (mostly) animal species
    • -- sensitive to changes in flat file formats
    View slide
  • XMLPipeDB: A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources
    • Requirements:
    • to create Gene Databases for other species
    • (bacteria/plants) using UniProt as the main data source
    • to be robust to changes in source file formats
    • to use XML sources wherever possible
    • to take advantage of existing open source tools
    • to limit the manual manipulation of the data
    • Data sources required for a minimal
    • GenMAPP Gene Database:
    • UniProt XML (complete proteome sets from Integr8)
    • Gene Ontology OBO-XML
    • GOA gene association files (also from Integr8)
  • XMLPipeDB Use Case Diagram
  • XSD-to-DB is based on Hyperjaxb2
    • Reads an XSD or DTD
    • Automatically generates:
      • -- SQL schema
      • -- Java classes
      • -- Hibernate mappings
      • -- Apache Ant build.xml file
  • UniProtDB and GODB Required Only Nominal Post-processing
    • XML cannot use SQL reserved words
    • Datatypes must be supported in SQL
  • XMLPipeDB Utilities are Reusable
    • XML files are broken down into 25 record chunks for import
    • TallyEngine counts records in XML and relational database
  • GenMAPP Builder then produces…
  • GenMAPP Gene Databases
    • Escherichia coli K12
    • Arabidopsis thaliana
    • Vibrio cholerae
    • Plasmodium falciparum
  • Workflow for Interdisciplinary Undergraduate Student Projects
    • Created new species profiles for Vibrio and Plasmodium
    • Re-analyzed published microarray datasets
  • How robust is our system to changes in XML formats?
    • Data-driven GenMAPP Gene Database design allowed our system to pick up RefSeq and NCBI Gene IDs “for free” from cross-references in UniProt XML
    • The UniProt and GO XML schemas have each changed twice during GenMAPP Builder development, and were handled mostly automatically
  • How robust is our system to changes in XML formats?
    • Data-driven GenMAPP Gene Database design allowed our system to pick up RefSeq and NCBI Gene IDs “for free” from cross-references in UniProt XML
    • The UniProt and GO XML schemas have each changed twice during GenMAPP Builder development, and were handled mostly automatically
    • However, XML sources need to keep their own XSDs updated!
    • Each new species does require additional coding to handle the vagaries of its own gene ID system
  • How Well Do Bioinformatics XML Formats Perform with XMLPipeDB? Multiple dependent XSDs or DTDs could not be processed SBML MathML CellML Multiple other NCBI DTDs Error: property with the same name is generated from more than one schema component MiniML (NCBI GEO) GPML (GenMAPP) BioMart PDBML HUP-ML (proteomics data) PubChem RNAML Syntax error with naming or datatypes; would require post-processing EMBL Nucleotide EMBL CDS mzML (Mass spec data) AGML (2D gel data) dbSNP (NCBI) N/A KEML (KEGG)
    • (Post-processing performed)
    UniProt XML GO OBO-XML Export Gene Database with GenMAPP Builder Successful build of XMLPipeDB Utilities with ant and import of XML data into PostgreSQL Successful creation of PostgresQL database with automatically generated schema.sql XSD-to-DB Data Source
  • Acknowledgments http://xmlpipedb.cs.lmu.edu Initial Development Joey Barrett Joe Boyle Adam Carasso David Hoffman Babak Naffas Ryan Nakamoto Jeffrey Nicholas Roberto Ruiz Scott Spicer Current Development Alexandrea Alphonso Derek Smith Chad Villaflores … and the rest of the undergraduates from the Fall 2008 Biological Databases class Kam D. Dahlquist [email_address] John David N. Dionisio [email_address] http://sourceforge.net/projects/xmlpipedb