2010 03 Lodoxf Openflydata

3,741 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,741
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Primary_spermatocyte Cup-like_pattern_of_distal_end_of_elongating_spermatids Gene Name: CG30044 GG Gene Name: CG30044 Slide Name: CG30044-mip40 Strain: wt Apical Signals: off Expression Location: Primary_spermatocyte Cup-like_pattern_of_distal_end_of_elongating_spermatids
  • Note that the thumbnail images are retrieved from the original web sites
  • FlyUI: a library of Javascript widgets as front ends to SPARQL data sources Built on Yahoo User Interface (YUI) library Widgets are composed in a browser to create the complete application Each widget provides: A Service that implements SPARQL queries A Model encapsulating SPARQL query results A Renderer
  • Initially hoped to use D2R server's SPARQL query rewriting, but some queries would kill the server, so went for SPARQLite alternative Different techniques for generating RDF applied to different kinds of data source Resulting RDF is loaded into the Jena TDB triple store.
  • This is mostly based on available off-the-shelf software Choice of triple store is influenced significantly by speed of loading ~10 million triples Also performed some experiments with OpenLink Virtuoso – performance looks pretty good Amazon EC2/EBS has worked well for us
  • 2010 03 Lodoxf Openflydata

    1. 1. OpenFlyData: the way to go for biological data integration Dr Jun Zhao Image Bioinformatics Research Group Department of Zoology University of Oxford
    2. 2. http://www.fly-ted.org.uk/509/
    3. 3. Use cases of gene expression data <ul><li>Experimental design </li></ul><ul><ul><li>Choosing which handful of genes/mutations to study in detail </li></ul></ul><ul><li>Data validation </li></ul><ul><ul><li>Mitigating problems of cross contamination </li></ul></ul>
    4. 4. OpenFlyData Application <ul><li>To answer questions about </li></ul><ul><ul><li>&quot;what does this gene do?” </li></ul></ul><ul><ul><li>“ which genes are of interests?” </li></ul></ul><ul><ul><li>.... </li></ul></ul><ul><li>Investigate the feasibility of existing Semantic Web tools and technologies for real applications </li></ul><ul><li>Create a set of reusable data sources and data query services </li></ul><ul><li>User-led and test-driven </li></ul>
    5. 5. Barriers for accessing these data <ul><li>Data are scattered at different web sites </li></ul><ul><li>Searches have to be repeated, different search interfaces, different use of terminology </li></ul><ul><li>Limited (if any) programmatic access to data … hard work to answer questions that span data sources </li></ul>
    6. 6. OpenFlyData.org demonstration <ul><li>Three gene express cross-database search applications </li></ul><ul><ul><li>Search by gene, gene expression mashup: [ go ] </li></ul></ul><ul><ul><li>Search gene expression by gene batch [ go ] </li></ul></ul><ul><ul><li>Search gene expression by tissue expression profile [ go ] </li></ul></ul>
    7. 7. The data sources <ul><li>Flybase </li></ul><ul><ul><li>The centric Drosophila genetic database </li></ul></ul><ul><li>BDGP (Berkley Drosophila Genome Project) </li></ul><ul><ul><li>An image repository of Drosophila gene expression in the development of embryo </li></ul></ul><ul><li>FlyTED </li></ul><ul><ul><li>An image repository of Drosophila gene expression in the development of testis </li></ul></ul><ul><li>FlyAtlas </li></ul><ul><ul><li>Tissue-specific Drosophila gene expression levels </li></ul></ul><ul><ul><li>Available as a single spreadsheet </li></ul></ul>
    8. 8. System architecture SPARQL endpoint Web browser FlyUI application FlyUI widget HTTP Client side SPARQL server (SPARQLite, Tomcat, Apache)‏ RDF cache (Jena TDB) ‏ FlyBase BDGP FlyTED FlyAtlas Server side
    9. 9. Creating RDF from data sources <ul><li>D2RQ mapping </li></ul><ul><ul><li>FlyBase and BDGP, native relational databases </li></ul></ul><ul><ul><li>Conservative mapping, with minimum interpretation </li></ul></ul><ul><li>OAI2SPARQL </li></ul><ul><ul><li>Harvesting N3 RDF metadata via the OAI-PMH protocol, built-in support by Eprints </li></ul></ul><ul><ul><li>Further from ESWC2008 paper </li></ul></ul><ul><li>Custom Python program </li></ul><ul><ul><li>FlyAtlas </li></ul></ul><ul><ul><li>Generating N3 from spreadsheet table </li></ul></ul>
    10. 10. The heterogeneous Drosophila gene names DATA SOURCE POSSIBLE GENE IDENTIFIERS EXAMPLES FlyBase symbol schuy full name schumacher-levy annotation symbol CG17736 Unique FlyBase id FBgn0036925 Curated synonyms CG17736, schuy, etc BDGP FlyBase id FBgn0036925 Annotation symbol CG17736 FlyAtlas Affy microarray probe id 16166608_a_at FlyTED Uncontrolled gene name schuy, CG17736/schuy
    11. 11. Gene name mapping <ul><li>Use FlyBase for automatic gene mapping </li></ul><ul><ul><li>67 out 833 genes remain unmapped or ambiguously mapped </li></ul></ul><ul><li>Additional inputs from scientists for disambiguating many-to-many mappings </li></ul><ul><li>Mappings are stored are RDF for “GeneFinder” widget </li></ul><ul><li>Evaluation of gene name mapping </li></ul><ul><ul><li>Test the SPARQL queries to the FlyTED store </li></ul></ul><ul><ul><li>Test the FlyTED gene expression search service </li></ul></ul>
    12. 12. SPARQL queries PREFIX chado: <http://purl.org/net/chado/schema> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX xs: <http://www.w3.org/2001/XML_Schema#> SELECT ?flybaseID WHERE { ?feature rdf:type chado:Feature ; chado:name “schuy”^^xs:string ; chado:uniquename ?flybaseID . } SELECT ?feature.uniquename AS flybaseID FROM feature WHERE feature.name = “schuy” SPARQL SQL
    13. 13. SPARQL protocol GET /query/flybase?query=[URL encoded query] HTTP/1.1 Host: openflydata.org Accept: application/sparql-results+json POST /query/flybase HTTP/1.1 Host: openflydata.org Accept: application/sparql-results+json Content-Type: application/x-www-form-urlencoded Content-Length: 456 query=[URL encoded query] HTTP GET HTTP POST
    14. 14. SPARQL server <ul><li>Amazon EC2 (Elastic Compute Cloud): </li></ul><ul><ul><li>To run SPARQL endpoints </li></ul></ul><ul><ul><li>To host the demo you've just seen </li></ul></ul><ul><li>Jena TDB as triple store </li></ul><ul><ul><li>For better loading performance </li></ul></ul><ul><ul><li>For better querying performance </li></ul></ul><ul><li>SPARQLite </li></ul><ul><ul><li>home-grown SPARQL protocol implementation </li></ul></ul><ul><ul><li>More later </li></ul></ul><ul><li>Apache, Tomcat, mod_jk, etc. </li></ul>
    15. 15. SPARQLite <ul><li>http://sparqlite.googlecode.com </li></ul><ul><ul><li>A platform for exploring SPARQL service quality concerns, more later </li></ul></ul><ul><ul><li>Restricted forms of query (SELECT, ASK and DESCRIBE) </li></ul></ul><ul><ul><li>Restricted query result format (e.g. only JSON) </li></ul></ul><ul><ul><li>Disallow queries containing variable predicates. </li></ul></ul><ul><ul><li>Disallow queries containing a “filter” clause and/or using unbound variables. </li></ul></ul><ul><ul><li>Limit the maximum number of requests sent from any one source in any one second to 5 </li></ul></ul><ul><li>Designed for Jena TDB/SDB + Postgres </li></ul>
    16. 16. Benefits of SW technologies <ul><li>RDF provides a uniform and flexible data model </li></ul><ul><ul><li>RDF dump is cheaper and quicker </li></ul></ul><ul><ul><li>Maintaining a separate SPARQL endpoint for each data source makes it easier than a data warehouse approach for handling data updates </li></ul></ul><ul><li>RDF facilitates data re-use and re-purposing </li></ul><ul><li>SPARQL raises the point of departure for an application </li></ul><ul><ul><li>Expressive, open-ended query protocol </li></ul></ul><ul><ul><li>Support for unanticipated queries </li></ul></ul><ul><li>Benefits for the future </li></ul><ul><ul><li>Linking to other data sources </li></ul></ul><ul><ul><li>Querying genes using the Fly Anatomy ontology </li></ul></ul><ul><ul><li>Magic of inference </li></ul></ul>
    17. 17. Costs & Risks <ul><li>Mapping data to RDF requires expertise and experience </li></ul><ul><li>Expressive query protocol is a double-edged sword </li></ul><ul><li>Performance is good for some queries, not for others… </li></ul>
    18. 18. Performance <ul><li>Loading: Our datasets ~175 million triples </li></ul><ul><ul><ul><li>Jena/TDB gives much better load performance (~15-30k tps), on a 64-bit system with Amazon EBS storage (~3hrs) </li></ul></ul></ul><ul><li>Querying: </li></ul><ul><ul><li>Good enough for real time user interaction, e.g., <1s for single gene search, 1-4s for multigene search (unions) </li></ul></ul><ul><ul><li>No significant slowdown when scale from 10m to 175m triples </li></ul></ul><ul><li>Text matching and case insensitive search </li></ul><ul><ul><li>Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL </li></ul></ul><ul><ul><li>Pre-generated lower-case gene names and loaded into the FlyBase RDF DB </li></ul></ul><ul><ul><li>Tried with OpenLink Virtuoso, still ~10 seconds for a case-insensitive search </li></ul></ul>
    19. 19. Future directions <ul><li>Adding new data sources: </li></ul><ul><ul><li>Gene expression data from EBI ArrayExpress </li></ul></ul><ul><li>More applications: </li></ul><ul><ul><li>Find out all the gene expression images of its neighbours </li></ul></ul><ul><ul><li>Find out all the genes related to “blood pressure” </li></ul></ul><ul><ul><li>... </li></ul></ul><ul><li>Linked data (dereferencable, follow-your nose)‏ </li></ul><ul><ul><li>We're thinking about this, but our application does not currently need it </li></ul></ul><ul><li>How to control and predict quality of open SPARQL endpoints </li></ul>
    20. 20. Acknowledgements <ul><li>Alistair Miles, Graham Klyne and David Shotton </li></ul><ul><li>Dr Helen White-Cooper and her research group </li></ul><ul><li>BBSRC for funding building the FlyTED database </li></ul><ul><li>BDGP, FlyAtlas and FlyBase for making the data available </li></ul><ul><li>JISC, for funding the FlyWeb project </li></ul><ul><li>EPSRC, for funding the trip to DILS </li></ul><ul><li>The Jena team, esp. Andy Seaborne </li></ul>

    ×