Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard
Databases and applications SQL ? How do we get databases and applications speaking to one another?  Application Application Java Perl Chado seq cv rad genetic phylo pub
Databases and applications posgresql driver posgresql driver Application Application SQL Java Perl Generic database interfaces only solve  part  of the problem JDBC DBI They let us  embed  SQL inside application code Chado seq cv rad genetic phylo pub method calls row objects method calls perl arrays
Why this  alone  isn’t enough Interfacing applications to databases is a tricky business… Issue: Language mismatch Issue: Data structure mismatch Issue: Repetitive code Issue: No centralized domain logic
Language mismatch String sql = " SELECT srcfeature_id, fmax, fmin  “+ “ FROM featureloc  "+ " WHERE feature_id ="  + featId; try { Statement s = conn.createStatement(); ResultSet rs = s.executeQuery(sql); rs.next(); sourceFeatureId  = rs.getInt(" srcfeature_id "); fmin  = rs.getInt(" fmin "); fmax  = rs.getInt(" fmax "); } catch (SQLException sqle) { System.err.println(this.getClass() + ": SQLException retrieving feature loc" +  " for feature_id = " + featId); sqle.printStackTrace(System.err); }
Data structure mismatch Different formalisms X relations -  set theoretic - relational  algebra classes and structs -  pointers - programs
Repetitive code Database fetch pattern construct, ask, transform, repeat, stitch Example: fetching gene models fetch genes fetch transcripts fetch exons, polypeptides fetch ancillary data  (props, cvs, pubs, syns, etc) Optimisation is difficult
No centralized domain logic Examples of domain logic: project a feature onto a virtual contig revcomp or translate a sequence search by ontology term delete a gene model Domain logic should be  reusable  by different applications
A solution:Object Oriented APIs biosql adapter chado adapter Application Perl Different perl applications share the same API API Different schemas can be added by writing adapters chado method calls domain objects Perl driver DBI biosql driver DBI Application Perl domain objects
How do OO APIs help? Issue: Language mismatch Seperation of interface from implementation Issue: Data structure mismatch API talks objects adapters hide and deal with conversion Issue: Repetitive code code centralized in both API and adapter Issue: No centralized domain logic object model encapsulates domain logic object model can be used independently of database
How do OO APIs hinder? Writing or generating adapters brittle, difficult to maintain Restrictive canned parameterized queries  v s  query language Application language bound very difficult to use a perl API from java Application bound sometimes generic, but often limited to one application Opaque domain logic
XML can help XML is  application-language neutral   XML can be used to specify: data transactions queries and query constraints XML can be used within both application languages and specialized XML languages XPath XSLT XQuery
XML middleware impl Application Perl Database XML as lingua franca interface chado query params chado xml Perl driver DBI Application query params chado xml Java any Generic  SQL to XML mapper OO APIs can be implemented on top of XML layer
XML middleware impl Application Perl interface Implementation can be any language chado query params chado xml Java driver JDBC Application query params chado xml Java any
Mapping with XORT XORT is a  specification  of how to map XML to the relational model generic: independent of chado and biology XML::XORT  is a perl  implementation  of the XORT specification Other implementations possible DBIx::DBStag  implements XORT xml->db Application language agnostic Easily wrapped for other languages
Highlights Proposal: XML mapping specification for Chado Tools Real Case
XORT Mapping Elements Table Column (except DB-specific value, e.g primary key in Chado schema -- not visible in XML) Attributes few and generic: transaction and reference control Element nesting column within table joined table within table -- joining column is implicit foreign key table within foreign key column Modules No module distinctions in chadoXML Limitations of DTD Cardinality,  NULLness, data type
Transactions and Operations Lookup - Select only Insert - Insert explicitly Delete - Unique identifier with unique key(s) -  One record per operation  Update - Two elements - Unique identifier with unique key(s) - One record per operation Force Combination of lookup, insert and update (if not lookup, then insert, else update)
Referencing Objects By global accession -  Format: dbname:accession[.version] -  Only for dbxref, feature ?, cvterm ? By a pre-defined local id -  Allows reference to objects in same file -  Need not be in DB -  Can be any symbol By lookup using unique key value(s) - Object can be in file or DB Implicitly, using foreign key to identify information in the related link table
Object Reference   By Global Accession <feature> <uniquename>CG3123</uniquename> <type_id>gene</type_id> <feature_relationship> <subject_id> Gadfly:CG3123-RA:1 </subject_id> <type_id>producedby</type_id> <feature_relationship> …… </feature> </feature>
Object Reference   By Local ID <cv id=“ SO ”> <name>Sequence Ontology</name> </cv> <cvterm id=“ exon ”> <cv_id> SO </cv_id> <name>exon</name> </cvterm> <feature> <type_id> exon </type_id>
Object Reference   By key Value (s) <feature> <type_id>      <cvterm>          <cv_id> <cv> <name>Sequence Ontology</name>                 </cv> </cv_id>          <name>exon</name>      </cvterm>   </type_id> … .       
ChadoXML Example <cv id=“ SO ”> <name>Sequence Ontology</name> </cv> <feature  op =“lookup”  id =“CG3312”> <uniquename>CG3312</uniquename> <type_id> <cvterm> <name>gene</name> <cv_id> SO </cv_id> </cvterm> <type_id> <organism_id> <feature_relationship> <subject_id> Gadfly:CG3312-RA </subject_id> <type_id> producedby </type_id> </feature_relationship> </feature>
Schema-Driven Tools DTD Generator: DDL-DTD Validator DB Not connected Syntax verification: legal XML, correct element nesting Some Semantic verfication: NULLness, cardinality, local ID reference DB Connected: reference validation Loader: XML->DB Dumper:DB->XML Driven by XML “dumpspec” XORTDiff: diff two XORT XML files
DumpSpec Driven Dumper Default behavior: given an object class and ID, dump all direct values and link tables, with refs to foreign keys. Non-default behavior specified by XML dumpspecs using same DTD with a few additions: attribute dump= all | cols | select | none attribute test = yes | no element _sql element _appdata Workaround with views, _sql Current use cases: Dump a gene for a gene detail page Dump a scaffold for Apollo
<feature dump=&quot;all&quot;> <uniquename test=&quot;yes&quot;>CG3312</uniquename> <!-- get all mRNAs of this gene --> <feature_relationship dump=&quot;all&quot;> <subject_id test=&quot;yes&quot;> <feature> <type_id><cvterm> <name>mRNA</name> </cvterm> </type_id> </feature> </subject_id> <subject_id> <feature dump=&quot;all&quot;>  <!-- get all exons of those mRNAs -->   <feature_relationship dump=&quot;all&quot;> <subject_id test=&quot;yes&quot;> <feature> <type_id> ><cvterm><name>exon</name> </cvterm> </type_id> </feature> </subject_id> <subject_id> <feature dump=&quot;all“/>  </subject_id> </feature_relationship> </feature> DumpSpec Sample
Use Case 1 Chado <-> Apollo Interaction XML Dumper XML Loader + validator GameBridge Chado GAME XML Chado XML GAME XML Chado XML GameBridge DumpSpec1 DDL XML Schema By_product Apollo
Use Case 2 Gene Page Dataflow Chado XML Dumper Chado XML DumpSpec2 acode FlyBase
To Do Lists External Object reference Dump with auto-generated XML Schema Output human-friendly
Resources Today’s slides XORT package  http://www.gmod.org Protocol draft submit to Current Protocol In Bioinformatics Using  chado  to Store Genome Annotation Data
XORT Key points Application language-neutral reusable from within multiple languages and applications Where does the domain logic live? Unlike objects, XML does not have ‘behaviour’ One solution: ChadoXML Services Another solution: Inside the DBMS
ChadoXML Services XML:: XORT Application XORT interface chado query params chado xml driver DBI ChadoXML services interface Format converters, dumpers chado xml query params chado xml other xml XSLT sequence domain logic C Perl Ontology services Java XQuery Lisp Prolog companalysis logic Java Perl
ChadoXML Services Application ChadoXML services interface Format converters, dumpers chado xml query params chado xml other xml XSLT sequence domain logic C Perl Ontology services Java XQuery Lisp Prolog Can be independent of DB companalysis logic Java Perl
PL/PgSQL Function Impl DB Functions API Application Implementation  inside  database Chado db function calls sql result objects DB Functions API posgresql driver JDBC/ DBI sequence library C
PL/PgSQL Function Impl DB Functions API Existing functions Chado DB Functions API cv module get_all_subject_ids(cvterm_id  int ); get_all_object_ids(cvterm_id  int ); fill_cvtermpath(cv_id  int );
PL/PgSQL Function Impl DB Functions API Existing functions Chado DB Functions API sequence module get_sub_feature_ids(feature_id  int ) featuresplice(fmin int, fmax  int ) get_subsequence(srcfeature_id  int , fmin int, fmax int, strand  int ) next_uniquename()
Putting it together: storing ontologies in chado Chado XML XML::XORT   or DBIx::DBStag oboxml_to_ chadoxml.xsl obo-edit Obo XML cv fill_cvtermpath() owl_to_ oboxml.xsl protege OWL
Benefits Issue: Language mismatch XORT dumpspecs and sql functions a more natural fit for application languages Issue: Data structure mismatch XML maps naturally to objects and structs Issue: Repetitive code XORT dumpspecs centralize db-fetch code XORT loader centralizes db-store code Issue: No centralized domain logic domain logic can be encoded in: PostgreSQL functions and triggers ChadoXML services
Other issues Speed? chained transformations may be slower (-) generic code is often slower (-) single point for optimization(+) Verbosity inevitable with a normalized database  reduced with XORT macros  Portability XORT highly portable (+) PostgreSQL functions must be manually ported to different DBMSs (-)
Current plans XORT wrappers Improving efficiency Documentation Extend PostgreSQL function repertoire More ChadoXML XSLTs ChadoXML adapters CGL Apollo BioPerl - Bio::{Seq,Search,Tree,..}IO::chadoxml
Conclusions ChadoXML a common GMOD format converted to other formats with XSLTs XORT centralises database interoperation logic PostgreSQL functions useful for certain kinds of domain logic Object APIs still required by many applications can be layered on top of XORT if so desired
Thanks to… Richard Bruskiewich Scott Cain Allen Day Karen Eilbeck Dave Emmert William Gelbart Mark Gibson Don Gilbert Aubrey de Grey Nomi Harris Stan Letovsky Suzanna Lewis Aaron Mackey Sima Misra Emmannel Mongin Simon Prochnik Gerald Rubin Susan Russo ShengQiang Shu Chris Smith Frank Smutniak Lincoln Stein Colin Wiel Mark Yandell Peili Zhang Mark Zythovicz

Chado-XML

  • 1.
    Chado and interoperabilityChris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard
  • 2.
    Databases and applicationsSQL ? How do we get databases and applications speaking to one another? Application Application Java Perl Chado seq cv rad genetic phylo pub
  • 3.
    Databases and applicationsposgresql driver posgresql driver Application Application SQL Java Perl Generic database interfaces only solve part of the problem JDBC DBI They let us embed SQL inside application code Chado seq cv rad genetic phylo pub method calls row objects method calls perl arrays
  • 4.
    Why this alone isn’t enough Interfacing applications to databases is a tricky business… Issue: Language mismatch Issue: Data structure mismatch Issue: Repetitive code Issue: No centralized domain logic
  • 5.
    Language mismatch Stringsql = &quot; SELECT srcfeature_id, fmax, fmin “+ “ FROM featureloc &quot;+ &quot; WHERE feature_id =&quot; + featId; try { Statement s = conn.createStatement(); ResultSet rs = s.executeQuery(sql); rs.next(); sourceFeatureId = rs.getInt(&quot; srcfeature_id &quot;); fmin = rs.getInt(&quot; fmin &quot;); fmax = rs.getInt(&quot; fmax &quot;); } catch (SQLException sqle) { System.err.println(this.getClass() + &quot;: SQLException retrieving feature loc&quot; + &quot; for feature_id = &quot; + featId); sqle.printStackTrace(System.err); }
  • 6.
    Data structure mismatchDifferent formalisms X relations - set theoretic - relational algebra classes and structs - pointers - programs
  • 7.
    Repetitive code Databasefetch pattern construct, ask, transform, repeat, stitch Example: fetching gene models fetch genes fetch transcripts fetch exons, polypeptides fetch ancillary data (props, cvs, pubs, syns, etc) Optimisation is difficult
  • 8.
    No centralized domainlogic Examples of domain logic: project a feature onto a virtual contig revcomp or translate a sequence search by ontology term delete a gene model Domain logic should be reusable by different applications
  • 9.
    A solution:Object OrientedAPIs biosql adapter chado adapter Application Perl Different perl applications share the same API API Different schemas can be added by writing adapters chado method calls domain objects Perl driver DBI biosql driver DBI Application Perl domain objects
  • 10.
    How do OOAPIs help? Issue: Language mismatch Seperation of interface from implementation Issue: Data structure mismatch API talks objects adapters hide and deal with conversion Issue: Repetitive code code centralized in both API and adapter Issue: No centralized domain logic object model encapsulates domain logic object model can be used independently of database
  • 11.
    How do OOAPIs hinder? Writing or generating adapters brittle, difficult to maintain Restrictive canned parameterized queries v s query language Application language bound very difficult to use a perl API from java Application bound sometimes generic, but often limited to one application Opaque domain logic
  • 12.
    XML can helpXML is application-language neutral XML can be used to specify: data transactions queries and query constraints XML can be used within both application languages and specialized XML languages XPath XSLT XQuery
  • 13.
    XML middleware implApplication Perl Database XML as lingua franca interface chado query params chado xml Perl driver DBI Application query params chado xml Java any Generic SQL to XML mapper OO APIs can be implemented on top of XML layer
  • 14.
    XML middleware implApplication Perl interface Implementation can be any language chado query params chado xml Java driver JDBC Application query params chado xml Java any
  • 15.
    Mapping with XORTXORT is a specification of how to map XML to the relational model generic: independent of chado and biology XML::XORT is a perl implementation of the XORT specification Other implementations possible DBIx::DBStag implements XORT xml->db Application language agnostic Easily wrapped for other languages
  • 16.
    Highlights Proposal: XMLmapping specification for Chado Tools Real Case
  • 17.
    XORT Mapping ElementsTable Column (except DB-specific value, e.g primary key in Chado schema -- not visible in XML) Attributes few and generic: transaction and reference control Element nesting column within table joined table within table -- joining column is implicit foreign key table within foreign key column Modules No module distinctions in chadoXML Limitations of DTD Cardinality, NULLness, data type
  • 18.
    Transactions and OperationsLookup - Select only Insert - Insert explicitly Delete - Unique identifier with unique key(s) - One record per operation Update - Two elements - Unique identifier with unique key(s) - One record per operation Force Combination of lookup, insert and update (if not lookup, then insert, else update)
  • 19.
    Referencing Objects Byglobal accession - Format: dbname:accession[.version] - Only for dbxref, feature ?, cvterm ? By a pre-defined local id - Allows reference to objects in same file - Need not be in DB - Can be any symbol By lookup using unique key value(s) - Object can be in file or DB Implicitly, using foreign key to identify information in the related link table
  • 20.
    Object Reference By Global Accession <feature> <uniquename>CG3123</uniquename> <type_id>gene</type_id> <feature_relationship> <subject_id> Gadfly:CG3123-RA:1 </subject_id> <type_id>producedby</type_id> <feature_relationship> …… </feature> </feature>
  • 21.
    Object Reference By Local ID <cv id=“ SO ”> <name>Sequence Ontology</name> </cv> <cvterm id=“ exon ”> <cv_id> SO </cv_id> <name>exon</name> </cvterm> <feature> <type_id> exon </type_id>
  • 22.
    Object Reference By key Value (s) <feature> <type_id>     <cvterm>         <cv_id> <cv> <name>Sequence Ontology</name>                </cv> </cv_id>         <name>exon</name>     </cvterm> </type_id> … .       
  • 23.
    ChadoXML Example <cvid=“ SO ”> <name>Sequence Ontology</name> </cv> <feature op =“lookup” id =“CG3312”> <uniquename>CG3312</uniquename> <type_id> <cvterm> <name>gene</name> <cv_id> SO </cv_id> </cvterm> <type_id> <organism_id> <feature_relationship> <subject_id> Gadfly:CG3312-RA </subject_id> <type_id> producedby </type_id> </feature_relationship> </feature>
  • 24.
    Schema-Driven Tools DTDGenerator: DDL-DTD Validator DB Not connected Syntax verification: legal XML, correct element nesting Some Semantic verfication: NULLness, cardinality, local ID reference DB Connected: reference validation Loader: XML->DB Dumper:DB->XML Driven by XML “dumpspec” XORTDiff: diff two XORT XML files
  • 25.
    DumpSpec Driven DumperDefault behavior: given an object class and ID, dump all direct values and link tables, with refs to foreign keys. Non-default behavior specified by XML dumpspecs using same DTD with a few additions: attribute dump= all | cols | select | none attribute test = yes | no element _sql element _appdata Workaround with views, _sql Current use cases: Dump a gene for a gene detail page Dump a scaffold for Apollo
  • 26.
    <feature dump=&quot;all&quot;> <uniquenametest=&quot;yes&quot;>CG3312</uniquename> <!-- get all mRNAs of this gene --> <feature_relationship dump=&quot;all&quot;> <subject_id test=&quot;yes&quot;> <feature> <type_id><cvterm> <name>mRNA</name> </cvterm> </type_id> </feature> </subject_id> <subject_id> <feature dump=&quot;all&quot;> <!-- get all exons of those mRNAs --> <feature_relationship dump=&quot;all&quot;> <subject_id test=&quot;yes&quot;> <feature> <type_id> ><cvterm><name>exon</name> </cvterm> </type_id> </feature> </subject_id> <subject_id> <feature dump=&quot;all“/> </subject_id> </feature_relationship> </feature> DumpSpec Sample
  • 27.
    Use Case 1Chado <-> Apollo Interaction XML Dumper XML Loader + validator GameBridge Chado GAME XML Chado XML GAME XML Chado XML GameBridge DumpSpec1 DDL XML Schema By_product Apollo
  • 28.
    Use Case 2Gene Page Dataflow Chado XML Dumper Chado XML DumpSpec2 acode FlyBase
  • 29.
    To Do ListsExternal Object reference Dump with auto-generated XML Schema Output human-friendly
  • 30.
    Resources Today’s slidesXORT package http://www.gmod.org Protocol draft submit to Current Protocol In Bioinformatics Using chado to Store Genome Annotation Data
  • 31.
    XORT Key pointsApplication language-neutral reusable from within multiple languages and applications Where does the domain logic live? Unlike objects, XML does not have ‘behaviour’ One solution: ChadoXML Services Another solution: Inside the DBMS
  • 32.
    ChadoXML Services XML::XORT Application XORT interface chado query params chado xml driver DBI ChadoXML services interface Format converters, dumpers chado xml query params chado xml other xml XSLT sequence domain logic C Perl Ontology services Java XQuery Lisp Prolog companalysis logic Java Perl
  • 33.
    ChadoXML Services ApplicationChadoXML services interface Format converters, dumpers chado xml query params chado xml other xml XSLT sequence domain logic C Perl Ontology services Java XQuery Lisp Prolog Can be independent of DB companalysis logic Java Perl
  • 34.
    PL/PgSQL Function ImplDB Functions API Application Implementation inside database Chado db function calls sql result objects DB Functions API posgresql driver JDBC/ DBI sequence library C
  • 35.
    PL/PgSQL Function ImplDB Functions API Existing functions Chado DB Functions API cv module get_all_subject_ids(cvterm_id int ); get_all_object_ids(cvterm_id int ); fill_cvtermpath(cv_id int );
  • 36.
    PL/PgSQL Function ImplDB Functions API Existing functions Chado DB Functions API sequence module get_sub_feature_ids(feature_id int ) featuresplice(fmin int, fmax int ) get_subsequence(srcfeature_id int , fmin int, fmax int, strand int ) next_uniquename()
  • 37.
    Putting it together:storing ontologies in chado Chado XML XML::XORT or DBIx::DBStag oboxml_to_ chadoxml.xsl obo-edit Obo XML cv fill_cvtermpath() owl_to_ oboxml.xsl protege OWL
  • 38.
    Benefits Issue: Languagemismatch XORT dumpspecs and sql functions a more natural fit for application languages Issue: Data structure mismatch XML maps naturally to objects and structs Issue: Repetitive code XORT dumpspecs centralize db-fetch code XORT loader centralizes db-store code Issue: No centralized domain logic domain logic can be encoded in: PostgreSQL functions and triggers ChadoXML services
  • 39.
    Other issues Speed?chained transformations may be slower (-) generic code is often slower (-) single point for optimization(+) Verbosity inevitable with a normalized database reduced with XORT macros Portability XORT highly portable (+) PostgreSQL functions must be manually ported to different DBMSs (-)
  • 40.
    Current plans XORTwrappers Improving efficiency Documentation Extend PostgreSQL function repertoire More ChadoXML XSLTs ChadoXML adapters CGL Apollo BioPerl - Bio::{Seq,Search,Tree,..}IO::chadoxml
  • 41.
    Conclusions ChadoXML acommon GMOD format converted to other formats with XSLTs XORT centralises database interoperation logic PostgreSQL functions useful for certain kinds of domain logic Object APIs still required by many applications can be layered on top of XORT if so desired
  • 42.
    Thanks to… RichardBruskiewich Scott Cain Allen Day Karen Eilbeck Dave Emmert William Gelbart Mark Gibson Don Gilbert Aubrey de Grey Nomi Harris Stan Letovsky Suzanna Lewis Aaron Mackey Sima Misra Emmannel Mongin Simon Prochnik Gerald Rubin Susan Russo ShengQiang Shu Chris Smith Frank Smutniak Lincoln Stein Colin Wiel Mark Yandell Peili Zhang Mark Zythovicz