Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2008 11 13 Hcls Call


Published on

Social tagging to enrich biomedical data integration.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

2008 11 13 Hcls Call

  1. 1. FlyWeb: the way to go for biological data integration <ul><ul><li>Jun Zhao, Alistair Miles and Graham Klyne </li></ul></ul><ul><ul><li>Image Bioinformatics Research Group </li></ul></ul><ul><ul><li>Department of Zoology </li></ul></ul><ul><ul><li>University of Oxford </li></ul></ul>
  2. 2. FlyWeb Application <ul><li>To answer questions about &quot;what does this gene do?” </li></ul><ul><ul><li>Gene Expression Images </li></ul></ul><ul><ul><li>Sequence and ESTs (Expressed sequence tags) of the gene </li></ul></ul><ul><ul><li>Publications about the gene </li></ul></ul><ul><ul><li>.... </li></ul></ul><ul><li>A first example of the Image Web that our group is developing </li></ul><ul><li>Investigate the feasibility of existing Semantic Web tools and technologies for real applications </li></ul>
  3. 3. Gene expression images <ul><li>Reveal gene expression pattern in different development stages </li></ul><ul><li>Important for identifying genes of interests and verifying a picture of probable gene functions </li></ul>
  4. 4. FlyWeb demonstration <ul><li> </li></ul><ul><ul><li>Run application: [ go ] </li></ul></ul><ul><li>Two examples: </li></ul><ul><ul><li>Single gene query (aos1)‏ </li></ul></ul><ul><ul><li>Use gene synonyms to enhance gene matching (rbf)‏ </li></ul></ul>
  5. 6. More than one synonyms of gene “rbf”
  6. 8. How does it work? <ul><li>Data from 3 independent sources: </li></ul><ul><ul><li> – model organism reference database, gene names and identifiers </li></ul></ul><ul><ul><li> (BDGP) – embryo in situ images </li></ul></ul><ul><ul><li> – testis in situ images </li></ul></ul><ul><li>All data accessed via SPARQL </li></ul><ul><li>Pure Ajax user application </li></ul><ul><li>Essentially, a mashup using a SPARQL API </li></ul>
  7. 9. The client side <ul><li>FlyUI: </li></ul><ul><ul><li>a library of Javascript widgets as front ends to SPARQL data sources </li></ul></ul><ul><ul><li>Built on Yahoo User Interface (YUI) library </li></ul></ul><ul><li>Widgets are composed in a browser to create the complete application </li></ul><ul><li>Each widget provides: </li></ul><ul><ul><li>A Service that implements SPARQL queries </li></ul></ul><ul><ul><li>A Model encapsulating SPARQL query results </li></ul></ul><ul><ul><li>A Renderer </li></ul></ul>The in situ search application GeneFinder Widget FlyTED Image Widget BDGP Image Widget
  8. 10. Gene name mapping <ul><li>FlyTED and BDGP use different gene names </li></ul><ul><ul><li>FlyTED data derived from spreadsheets with imperfectly controlled gene name vocabulary </li></ul></ul><ul><ul><li>BDGP's data are annotated using FlyBase's unique FBgn numbers </li></ul></ul><ul><li>Use FlyBase for automatic gene mapping </li></ul><ul><li>Additional inputs from scientists for disambiguating many-many mappings </li></ul><ul><li>Mappings are stored as JSON file to assist “GeneFinder” widget (having no use for RDF/OWL reasoning at this stage)‏ </li></ul>
  9. 11. SPARQL queries <ul><li>Free text matchings </li></ul><ul><li>Case insensitive searching </li></ul><ul><ul><li>Very important for our users </li></ul></ul><ul><ul><li>Too expensive using SPARQL Filter </li></ul></ul><ul><ul><li>Pre-generate lower-case gene names and load into the Flybase RDF DB </li></ul></ul>SELECT * WHERE { ?gene fbutil:anyName &quot; userInput &quot;^^xs:string ; a chado:Feature ; chado:name ?symbol ; chado:uniquename ?flybaseID . OPTIONAL { ?gene chado:dbxref [ chado:accession ?annotationSymbol ] . } OPTIONAL { ?gene chado:synonym [ chado:name ?synonym ] . } OPTIONAL { ?gene chado:synonym [ a syntype:FullName ; chado:name ?fullName ] . } } SELECT DISTINCT * WHERE { ?fullImageURL &quot; + flyted:associatesToGene < geneName > ; flyted:associatesToGene ?gene ; flyted:thumbnail ?thumbnailURL; rdfs:seeAlso ?flytedURL; rdfs:label ?caption }
  10. 12. The RDF data sources <ul><li>Flybase and BDGP: relational databases </li></ul><ul><li>FlyTED, an image repository built using Eprints </li></ul><ul><li>FlyAtlas (forthcoming), tissue-specific Drosophila gene expression levels, as a single spreadsheet </li></ul>
  11. 13. Creating RDF from data sources <ul><li>D2RQ mapping </li></ul><ul><ul><li>FlyBase and BDGP, native relational databases </li></ul></ul><ul><ul><li>Conservative mapping, with minimum interpretation </li></ul></ul><ul><li>OAI2SPARQL </li></ul><ul><ul><li>Harvesting N3 RDF metadata via the OAI-PMH protocol, built-in support by Eprints </li></ul></ul><ul><ul><li>Further from ESWC2008 paper </li></ul></ul><ul><li>Custom Python program </li></ul><ul><ul><li>FlyAtlas </li></ul></ul><ul><ul><li>Generating N3 from spreadsheet table </li></ul></ul>
  12. 14. More about the data sources <ul><li>Bulk download </li></ul><ul><ul><li>, ~8m triples </li></ul></ul><ul><ul><li>, ~1m triples </li></ul></ul><ul><ul><li>, ~30,000 triples </li></ul></ul><ul><li>SPARQL endpoint </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><li>Schema </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul>
  13. 15. SPARQL server <ul><li>Amazon EC2 (Elastic Compute Cloud): </li></ul><ul><ul><li>To run SPARQL endpoints </li></ul></ul><ul><ul><li>To host the demo you've just seen </li></ul></ul><ul><li>Jena TDB as triple store </li></ul><ul><ul><li>For better loading performance: ~6K tps for ~9M triples to Amazon Elastic Block Storage (EBS)‏ </li></ul></ul><ul><ul><li>For better querying performance </li></ul></ul><ul><li>SPARQLite </li></ul><ul><ul><li>home-grown SPARQL protocol implementation </li></ul></ul><ul><ul><li>More later </li></ul></ul><ul><li>Apache, Tomcat, mod_jk, etc. </li></ul>
  14. 16. SPARQLite protocol <ul><li> </li></ul><ul><ul><li>Also, a platform for exploring SPARQL service quality concerns, more later </li></ul></ul><ul><li>Motivation </li></ul><ul><ul><li>Enable streaming </li></ul></ul><ul><ul><li>Create a database connection pool </li></ul></ul><ul><li>Designed for Jena TDB/SDB + Postgres </li></ul><ul><li>Restricted forms of query (SELECT, ASK) </li></ul><ul><li>Restricted query result format (e.g. only JSON) </li></ul>
  15. 17. Lessons <ul><li>RDF provides a uniform and flexible data model </li></ul><ul><ul><li>RDF dump is cheaper and quicker </li></ul></ul><ul><ul><li>Maintaining a separate SPARQL endpoint for each data source makes it easier than a data warehouse approach for handling data updates </li></ul></ul><ul><li>RDF facilitates data re-use and re-purposing </li></ul><ul><li>SPARQL raises the point of departure for an application </li></ul><ul><li>Benefits for the future </li></ul><ul><ul><li>Linking to other data sources </li></ul></ul><ul><ul><li>Querying genes using the Fly Anatomy ontology </li></ul></ul><ul><ul><li>Magic of inference </li></ul></ul>
  16. 18. Performance <ul><li>Loading: Our datasets ~10 million triples </li></ul><ul><ul><li>Jena / RDB / Postgres, OK with <1 M triples </li></ul></ul><ul><ul><li>Jena / SDB / Postgres better, but problems with load performance with larger datasets </li></ul></ul><ul><ul><li>Jena / TDB gives much better load performance (~6K tps), even on 32 bit system with Amazon EBS storage (but not so good with local EC2 store)‏ </li></ul></ul><ul><ul><li>Virtuoso performs reasonably well </li></ul></ul><ul><li>Querying, particularly text matching and case insensitive search </li></ul><ul><ul><li>Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL </li></ul></ul><ul><ul><li>Tried with OpenLink Virtuoso, still ~10 seconds for a case-insensitive search </li></ul></ul><ul><ul><li>Any suggestions? </li></ul></ul>
  17. 19. Further lessons <ul><li>SPARQL results streaming </li></ul><ul><ul><li>Resolves out of memory errors for large datasets </li></ul></ul><ul><ul><li>Joseki / SDB / Postgres can be made to stream results, but using just a single JDBC connection, causing performance problems with concurrent requests </li></ul></ul><ul><ul><li>Therefore, SPARQLite </li></ul></ul><ul><li>The openness of SPARQL: </li></ul><ul><ul><li>SPARQL is an inherently open query language and protocol </li></ul></ul><ul><ul><li>Open endpoints are vulnerable to simple queries that can overload the service, exposing them to denial of service style attacks (whether intended or not)‏ </li></ul></ul><ul><ul><li>Futures: API key mechanism? Restricted SPARQL profiles? </li></ul></ul>
  18. 20. Future directions <ul><li>Adding new data sources: </li></ul><ul><ul><li>FlyAtlas tissue-specific Drosophila gene expression levels </li></ul></ul><ul><ul><li>More information from FlyBase – e.g. references </li></ul></ul><ul><li>More applications: </li></ul><ul><ul><li>Find out all the gene expression images of its neighbours </li></ul></ul><ul><ul><li>Find out all the genes related to “blood pressure” </li></ul></ul><ul><ul><li>... </li></ul></ul><ul><li>Linked data (dereferencable, follow-your nose)‏ </li></ul><ul><ul><li>We're thinking about this, but our application does not currently need it </li></ul></ul><ul><li>How to control and predict quality of service for open SPARQL endpoints </li></ul>
  19. 21. Acknowledgement <ul><li>Alistair Miles, Graham Klyne and David Shotton </li></ul><ul><li>Dr Helen White-Cooper and her research group </li></ul><ul><li>BBSRC for funding building the FlyTED database </li></ul><ul><li>BDGP and FlyBase for making the data available </li></ul><ul><li>JISC, for funding the FlyWeb project </li></ul><ul><li>The Jena team, esp. Andy Seaborne </li></ul>