2008 11 13 Hcls Call


Published on

Social tagging to enrich biomedical data integration.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 2008 11 13 Hcls Call

    1. 1. FlyWeb: the way to go for biological data integration <ul><ul><li>Jun Zhao, Alistair Miles and Graham Klyne </li></ul></ul><ul><ul><li>Image Bioinformatics Research Group </li></ul></ul><ul><ul><li>Department of Zoology </li></ul></ul><ul><ul><li>University of Oxford </li></ul></ul>
    2. 2. FlyWeb Application <ul><li>To answer questions about &quot;what does this gene do?” </li></ul><ul><ul><li>Gene Expression Images </li></ul></ul><ul><ul><li>Sequence and ESTs (Expressed sequence tags) of the gene </li></ul></ul><ul><ul><li>Publications about the gene </li></ul></ul><ul><ul><li>.... </li></ul></ul><ul><li>A first example of the Image Web that our group is developing </li></ul><ul><li>Investigate the feasibility of existing Semantic Web tools and technologies for real applications </li></ul>
    3. 3. Gene expression images <ul><li>Reveal gene expression pattern in different development stages </li></ul><ul><li>Important for identifying genes of interests and verifying a picture of probable gene functions </li></ul>
    4. 4. FlyWeb demonstration <ul><li>http://openflydata.org/flyui/build/apps/imagemashup2/ </li></ul><ul><ul><li>Run application: [ go ] </li></ul></ul><ul><li>Two examples: </li></ul><ul><ul><li>Single gene query (aos1)‏ </li></ul></ul><ul><ul><li>Use gene synonyms to enhance gene matching (rbf)‏ </li></ul></ul>
    5. 6. More than one synonyms of gene “rbf”
    6. 8. How does it work? <ul><li>Data from 3 independent sources: </li></ul><ul><ul><li>www.flybase.org – model organism reference database, gene names and identifiers </li></ul></ul><ul><ul><li>www.fruitfly.org (BDGP) – embryo in situ images </li></ul></ul><ul><ul><li>www.fly-ted.org – testis in situ images </li></ul></ul><ul><li>All data accessed via SPARQL </li></ul><ul><li>Pure Ajax user application </li></ul><ul><li>Essentially, a mashup using a SPARQL API </li></ul>
    7. 9. The client side <ul><li>FlyUI: </li></ul><ul><ul><li>a library of Javascript widgets as front ends to SPARQL data sources </li></ul></ul><ul><ul><li>Built on Yahoo User Interface (YUI) library </li></ul></ul><ul><li>Widgets are composed in a browser to create the complete application </li></ul><ul><li>Each widget provides: </li></ul><ul><ul><li>A Service that implements SPARQL queries </li></ul></ul><ul><ul><li>A Model encapsulating SPARQL query results </li></ul></ul><ul><ul><li>A Renderer </li></ul></ul>The in situ search application GeneFinder Widget FlyTED Image Widget BDGP Image Widget
    8. 10. Gene name mapping <ul><li>FlyTED and BDGP use different gene names </li></ul><ul><ul><li>FlyTED data derived from spreadsheets with imperfectly controlled gene name vocabulary </li></ul></ul><ul><ul><li>BDGP's data are annotated using FlyBase's unique FBgn numbers </li></ul></ul><ul><li>Use FlyBase for automatic gene mapping </li></ul><ul><li>Additional inputs from scientists for disambiguating many-many mappings </li></ul><ul><li>Mappings are stored as JSON file to assist “GeneFinder” widget (having no use for RDF/OWL reasoning at this stage)‏ </li></ul>
    9. 11. SPARQL queries <ul><li>Free text matchings </li></ul><ul><li>Case insensitive searching </li></ul><ul><ul><li>Very important for our users </li></ul></ul><ul><ul><li>Too expensive using SPARQL Filter </li></ul></ul><ul><ul><li>Pre-generate lower-case gene names and load into the Flybase RDF DB </li></ul></ul>SELECT * WHERE { ?gene fbutil:anyName &quot; userInput &quot;^^xs:string ; a chado:Feature ; chado:name ?symbol ; chado:uniquename ?flybaseID . OPTIONAL { ?gene chado:dbxref [ chado:accession ?annotationSymbol ] . } OPTIONAL { ?gene chado:synonym [ chado:name ?synonym ] . } OPTIONAL { ?gene chado:synonym [ a syntype:FullName ; chado:name ?fullName ] . } } SELECT DISTINCT * WHERE { ?fullImageURL &quot; + flyted:associatesToGene <http://openflydata.org/id/flyted/gene- geneName > ; flyted:associatesToGene ?gene ; flyted:thumbnail ?thumbnailURL; rdfs:seeAlso ?flytedURL; rdfs:label ?caption }
    10. 12. The RDF data sources <ul><li>Flybase and BDGP: relational databases </li></ul><ul><li>FlyTED, an image repository built using Eprints </li></ul><ul><li>FlyAtlas (forthcoming), tissue-specific Drosophila gene expression levels, as a single spreadsheet </li></ul>
    11. 13. Creating RDF from data sources <ul><li>D2RQ mapping </li></ul><ul><ul><li>FlyBase and BDGP, native relational databases </li></ul></ul><ul><ul><li>Conservative mapping, with minimum interpretation </li></ul></ul><ul><li>OAI2SPARQL </li></ul><ul><ul><li>Harvesting N3 RDF metadata via the OAI-PMH protocol, built-in support by Eprints </li></ul></ul><ul><ul><li>Further from ESWC2008 paper </li></ul></ul><ul><li>Custom Python program </li></ul><ul><ul><li>FlyAtlas </li></ul></ul><ul><ul><li>Generating N3 from spreadsheet table </li></ul></ul>
    12. 14. More about the data sources <ul><li>Bulk download </li></ul><ul><ul><li>http://openflydata.org/dump/flybase, ~8m triples </li></ul></ul><ul><ul><li>http://openflydata.org/dump/bdgp, ~1m triples </li></ul></ul><ul><ul><li>http://openflydata.org/dump/flyted, ~30,000 triples </li></ul></ul><ul><li>SPARQL endpoint </li></ul><ul><ul><li>http://openflydata.org/query/flybase </li></ul></ul><ul><ul><li>http://openflydata.org/query/bdgp </li></ul></ul><ul><ul><li>http://openflydata.org/query/flyted </li></ul></ul><ul><li>Schema </li></ul><ul><ul><li>http://purl.org/net/chado/schema/ </li></ul></ul><ul><ul><li>http://purl.org/net/flybase/synonym-types/ </li></ul></ul><ul><ul><li>http://purl.org/net/bdgp/schema/ </li></ul></ul>
    13. 15. SPARQL server <ul><li>Amazon EC2 (Elastic Compute Cloud): </li></ul><ul><ul><li>To run SPARQL endpoints </li></ul></ul><ul><ul><li>To host the demo you've just seen </li></ul></ul><ul><li>Jena TDB as triple store </li></ul><ul><ul><li>For better loading performance: ~6K tps for ~9M triples to Amazon Elastic Block Storage (EBS)‏ </li></ul></ul><ul><ul><li>For better querying performance </li></ul></ul><ul><li>SPARQLite </li></ul><ul><ul><li>home-grown SPARQL protocol implementation </li></ul></ul><ul><ul><li>More later </li></ul></ul><ul><li>Apache, Tomcat, mod_jk, etc. </li></ul>
    14. 16. SPARQLite protocol <ul><li>http://sparqlite.googlecode.com </li></ul><ul><ul><li>Also, a platform for exploring SPARQL service quality concerns, more later </li></ul></ul><ul><li>Motivation </li></ul><ul><ul><li>Enable streaming </li></ul></ul><ul><ul><li>Create a database connection pool </li></ul></ul><ul><li>Designed for Jena TDB/SDB + Postgres </li></ul><ul><li>Restricted forms of query (SELECT, ASK) </li></ul><ul><li>Restricted query result format (e.g. only JSON) </li></ul>
    15. 17. Lessons <ul><li>RDF provides a uniform and flexible data model </li></ul><ul><ul><li>RDF dump is cheaper and quicker </li></ul></ul><ul><ul><li>Maintaining a separate SPARQL endpoint for each data source makes it easier than a data warehouse approach for handling data updates </li></ul></ul><ul><li>RDF facilitates data re-use and re-purposing </li></ul><ul><li>SPARQL raises the point of departure for an application </li></ul><ul><li>Benefits for the future </li></ul><ul><ul><li>Linking to other data sources </li></ul></ul><ul><ul><li>Querying genes using the Fly Anatomy ontology </li></ul></ul><ul><ul><li>Magic of inference </li></ul></ul>
    16. 18. Performance <ul><li>Loading: Our datasets ~10 million triples </li></ul><ul><ul><li>Jena / RDB / Postgres, OK with <1 M triples </li></ul></ul><ul><ul><li>Jena / SDB / Postgres better, but problems with load performance with larger datasets </li></ul></ul><ul><ul><li>Jena / TDB gives much better load performance (~6K tps), even on 32 bit system with Amazon EBS storage (but not so good with local EC2 store)‏ </li></ul></ul><ul><ul><li>Virtuoso performs reasonably well </li></ul></ul><ul><li>Querying, particularly text matching and case insensitive search </li></ul><ul><ul><li>Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL </li></ul></ul><ul><ul><li>Tried with OpenLink Virtuoso, still ~10 seconds for a case-insensitive search </li></ul></ul><ul><ul><li>Any suggestions? </li></ul></ul>
    17. 19. Further lessons <ul><li>SPARQL results streaming </li></ul><ul><ul><li>Resolves out of memory errors for large datasets </li></ul></ul><ul><ul><li>Joseki / SDB / Postgres can be made to stream results, but using just a single JDBC connection, causing performance problems with concurrent requests </li></ul></ul><ul><ul><li>Therefore, SPARQLite </li></ul></ul><ul><li>The openness of SPARQL: </li></ul><ul><ul><li>SPARQL is an inherently open query language and protocol </li></ul></ul><ul><ul><li>Open endpoints are vulnerable to simple queries that can overload the service, exposing them to denial of service style attacks (whether intended or not)‏ </li></ul></ul><ul><ul><li>Futures: API key mechanism? Restricted SPARQL profiles? </li></ul></ul>
    18. 20. Future directions <ul><li>Adding new data sources: </li></ul><ul><ul><li>FlyAtlas tissue-specific Drosophila gene expression levels </li></ul></ul><ul><ul><li>More information from FlyBase – e.g. references </li></ul></ul><ul><li>More applications: </li></ul><ul><ul><li>Find out all the gene expression images of its neighbours </li></ul></ul><ul><ul><li>Find out all the genes related to “blood pressure” </li></ul></ul><ul><ul><li>... </li></ul></ul><ul><li>Linked data (dereferencable, follow-your nose)‏ </li></ul><ul><ul><li>We're thinking about this, but our application does not currently need it </li></ul></ul><ul><li>How to control and predict quality of service for open SPARQL endpoints </li></ul>
    19. 21. Acknowledgement <ul><li>Alistair Miles, Graham Klyne and David Shotton </li></ul><ul><li>Dr Helen White-Cooper and her research group </li></ul><ul><li>BBSRC for funding building the FlyTED database </li></ul><ul><li>BDGP and FlyBase for making the data available </li></ul><ul><li>JISC, for funding the FlyWeb project </li></ul><ul><li>The Jena team, esp. Andy Seaborne </li></ul>