Your SlideShare is downloading. ×
Genome and Proteome data integration in RDF
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Genome and Proteome data integration in RDF

630

Published on

Published in: Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
630
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Semantic Web Applications and Tools for Life SciencesNovember 2008Genome and Proteome data integration in RDFNadia Anwar, Ela Hunt, Walter Kolch and Andy Pitt e Me ts tab nom Pr rip e olit ot G es sc ein an s Tr Data Discovery
  • 2. Outline• Data Integration in Bioinformatics.• Semantic data integration• Francisella• Integrating genome annotations with experimental proteomics data in RDF• Further work
  • 3. Data Integration is not a solved problem
  • 4. Information discovery is not Integrated High TP Microarray Proteomics Computational Computational Sequencing experiments experiments analysis analysis Systems Biology Synthetic Networks/ Genomics Proteomics Pathways Sequence Peptide Profiles Predictions Gene Expression ORF Prediction Transcript Profile Peptide Abundance Genome Transcript Protein Identification Comparisons Abundance Protein Interactions PT-Modifications Metabolomics LIMS LIMS LIMS LIMS Genome Regulatory Networks Metabolic Pathways Translational Medicine
  • 5. Semantic Data Integration across omes data silos Data Genes Transcripts Peptides Metabolites Genotype Information Data Discovery
  • 6. Proof of conceptFrancisella tularensis ulceroglandular tularaemia respiratory oculoglandular tularaemia tularaemia
  • 7. Bioterrorism• Francisella tularensis is a very successful intracellular pathogen that causes severe disease (respiratory tulareamia is the most acute form of the disease)• low infectious dose (10-50 bacterium compared to anthrax which requires 8,000-15,000 spores)• weaponisation fears
  • 8. Data sourcesGenome
  • 9. RDF http://img.jgi.doe.gov/cgi-bin/pub/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=639633024#export 229976 + (3)IMG_S:genomic_location_strand 229107 TPR (3)IMG_S:genomic_location_end (2)RDFS:comment (3)IMG_S:genomic_location_start (1)RDF:type (4)IMG:gene_oid=639752258 (3)IMG_S:locus_tag FTN_0209 RDF:description
  • 10. Data sourcesGenome annotations http://supfam.cs.bris.ac.uk/ RDF#type RDF:description http://purl.uniprot.org/core/Protein_Family SUPERFAMILY:cgi-bin/model.cgi?model=0040419 SUPERFAMILY:Assignment_Region 155-367 SUPERFAMILY:Score 5.1e-39 SUPERFAMILY:SCOP_ID SUPERFAMILY:cgi-bin/scop.cgi?sunid=52540 http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=118496616 SUPERFAMILY:SCOP_Fold P-loop containing nucleoside triphosphate hydrolases SUPERFAMILY:Family_ID SUPERFAMILY:Evalue 81269 7.33e-06 SUPERFAMILY:Family_Description Extended AAA-ATPase domain SUPERFAMILY:Similar_Structure 1l8q A:77-289 Francisella SuperFamily Data
  • 11. Data sourcesGenome annotations - KEGG http://www.genome.jp/dbget-bin/www_bget?pathway+ftn00010 http://img.jgi.doe.gov/schema#gene http://www.genome.jp/dbget-bin/www_bget?ftn:FTN_0298 http://img.jgi.doe.gov/schema#gene_name rdfs:comment rdfs:seeAlso rdfs:seeAlso glpX fructose http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[EC:3.1.3.11] http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[SP:A0Q4N9_FRATN] http://www.genome.jp/dbget-bin/www_bfind?F.tularensis_U112Genome annotations - NCBI protein http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=118496616 RDF:type RDF:idsymbol RDFS:#seeAlso http://purl.uniprot.org/Annotation/ RDF:description YP_897666.1 http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[refseqp-SeqVersion:YP_897666.1]+-e chromosomal http://www.ncbi.nlm.nih.gov/sites/gquery?term=Francisella+tularensis+novicida
  • 12. Data sourcesGenome annotations - GO RDF:type RDF:description mgla:GO_Annotation#ID http://amigo.geneontology.org/cgi-bin/amigo/go.cgi?view=details&query=0006749 mgla:GO_Annotation#Term glutathione http://www.genome.jp/dbget-bin/www_bget?ftn:FTN_0277 mgla:GO_Annotation#Ontology biological_process mgla:GO_Annotation#Level 7 http://www.compbio.dundee.ac.uk/Software/GOtcha/iscore 0.879989490261963 http://www.compbio.dundee.ac.uk/Software/GOtcha/cscore 5.7273821328517Poson annotations - Cogs http://www.ncbi.nlm.nih.gov/sites/entrez?db=cdd&cmd=search&term=COG0508 mgla:cogNumber mgla:cogDomain AceF https://tools.nwrce.org/cgi-bin/fnu112/poson.cgi?poson=PSN082435 mgla:cogDescription mgla:cogCategory Pyruvate/2-oxoglutarate dihydrolipoamide
  • 13. Data sources - experimentsTranscriptomics
  • 14. Data sources - experimentsProteomics
  • 15. Proteomics WT vs Mgla Mutant
  • 16. Francisella tularensis novicida U112 WildType MglA mutant Whole Cell Soluble Membrane Whole Cell Soluble Membrane (3) (3) (3) (3) (3) (3)(4) (4) (4) (4) (4) (4)Sequest DRAGON Sequest DRAGON Sequest DRAGON Sequest DRAGON Sequest DRAGON Sequest DRAGON Identification Relative Abundance P val <0.01 Two-sided t-test
  • 17. RDF - excel conversion Pval Genome Pval-1 analysisIdentified Peptide mgla:poson abundance mgla:experiment PSN rdfs:seeAlso PSNV2 rdfs:seeAlso PSNV3 rdfs:seeAlso FTN rdfs:seeAlso DDBID Peptide sequence predicate GO SP EC subject object
  • 18. Data integrationReconciled Identifiers (WashU-B) PSN.V1 (COGs) COGID (WashU-B) PSN.V2 (NCBI) PROTEINID (WashU-B) PSN.V3 (IMG) GENEID (WashU-P) DDB (Fn ORF ID) FTN (Refseq) ACNo (Gene Ontology) GOID (ENZYME) E.C.No (Uniprot) ACNo
  • 19. Data IntegrationAdding new experiments Experiment Public 2 Experiment domain data 1 PSN rdfs:seeAlso PSNV2 rdfs:seeAlso PSNV3 rdfs:seeAlso FTN rdfs:seeAlso Experiment 3 DDBID Experiment 4 GO AC No. EC
  • 20. Data integrationSesame NadiaAnwar:~ nadia$ openrdf-sesame-2.1/bin/console.sh Connected to default data directory Commands end with . at the end of a line Type help. for help > connect http://127.0.0.1:8080/openrdf-sesame/. Disconnecting from default data directory Connected to http://127.0.0.1:8080/openrdf-sesame/ > show r. +---------- |SYSTEM ("System configuration repository") |ftnRepoNative ("Francisella Test") |FrancisellaNative ("FrancisellaTestStore") |FrancisellaReified ("Native store with RDF Schema inferencing") |FrancisellaReified_index2 ("Native store with RDF Schema inferencing") |Francisella ("Native store with RDF Schema inferencing") +---------- > open FrancisellaReified_index2. Opened repository FrancisellaReified_index2
  • 21. SesameData load (ftnRepoNative) - native (spoc,posc) Data File time (s) triples francisella_locus_tag.nt 8.93 1,767 interact-prot.nt 88.51 20,682 interact-prot-peptides.nt 248,647 mgla search db.fasta.blastp4 ypURL.n3 9.7 1,719 NC_008601.nt 43.14 12,781 Ft_novicidaU112go.nt 359.14 2,548 francisella.rdf2.nt 43.41 10,434 francisellaSUPERFAMILY.nt 57.88 16,110 francisellaPROTEIN.fasta.nt 13.63 5,160 Soluble.nt 588.87 336,761 WholeCell.nt 469.02 112,625 Membranes.nt 1003.19 298,771
  • 22. Data Integration Mgla data (ftnRepoNative) analysisIdentified Peptide mgla:poson abundance PSN rdfs:seeAlso PSNV2 rdfs:seeAlso PSNV3 rdfs:seeAlso FTN rdfs:seeAlso Experiment DDBID Peptide sequenceSELECT psn, ftn, ec FROM{ftn} rdfs:seeAlso {ec}, GO SP EC{psn} rdfs:seeAlso {ftn},{analysis} mgla:poson {psn}WHERE ec LIKE “*[EC:*”USING NAMESPACEmgla =<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/>
  • 23. Data Integration Mgla data (ftnRepoNative) analysis rdf:about Identified Peptide mgla:poson mgla:sequence mgla:experiment abundance PSN rdfs:seeAlso PSNV2 rdfs:seeAlso PSNV3 rdfs:seeAlso FTN Peptide sequence rdfs:seeAlso DDBIDSELECT abundance, psn, ec, ftn FROM{ftn} rdfs:seeAlso {ec},{psn} rdfs:seeAlso {ftn}, GO SP EC{analysis} mgla:poson {psn},{analysis} mgla:experiment {abundance},WHERE ec LIKE “*[EC:*”USING NAMESPACEmgla =<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/>
  • 24. Really easy, But....• Simple excel to RDF conversion does not enable all queries• Not a simple conversion - Data needs to be “modelled” analysis rdf:aboutIdentified Peptide mgla:poson mgla:sequence mgla:experiment abundance PSN identifiedIn Experiment Peptide Peptide Sequence Replicate { sequence hasAbundance abundance
  • 25. Data IntegrationReified statements rdf:type analysis Identified Peptide Peptide sequence mgla:poson PSN rdfs:seeAlso PSNV2 rdfs:seeAlso PSNV3 rdfs:seeAlso FTN Experiment Replicate rdfs:seeAlso t jec rd f:ob DDBID analysis data rdf:type rdf:Statement rd rdf:s f: pr ubje ct GO SP EC ed analysis data icamgla:PeptideAbundance te InExperimentReplicate abundance
  • 26. SesameReified Data load - native-RDFS (spoc,posc,posc) Data File time (s) time(mins) triples FnU112Version3.nt 383.44 6.3 58,474 PosonMappings.nt 84.56 1.4 13,760 francisella_locus_tag.nt 16.73 0.3 1,767 ConstructHasGeneID.nt 23.00 0.4 1,719 interact-prot.nt 124.95 2.1 20,682 interact-prot-pepteides.nt 1127.97 18.7 248,647 interact-protSeeAlsoisbURL.nt 10.67 0.2 1,528 goAnnotation_URLID.nt 74.14 1.2 20,501 NC_008601.nt 75.84 1.3 12,781 Membranes_CogNumberURL.nt 8.60 0.1 2,548 Ft_novicida_U112_go.nt 561.38 9.3 2,548 francisella.rdf2.nt 46.19 0.8 10,602 francisellaSUPERFAMILY.nt 66.67 1.1 16,110 francisellaPROTEIN.fasta.nt 15.27 0.3 5,160 SolubleReifeid_3.rdf 1392.98 23.2 580,873 WholeCellReified_3.rdf 941.16 15.6 184,221 Membranes_3.rdf 1026.66 17.111 416,086 fnU112_draftRDFschemaV4.nt 215010.98 3,583.5 501
  • 27. Querieswhich posons have the most highly abundant peptidesselect ftn , psn, exp, abundance from{psn} rdfs:seeAlso {psnv2},{psnv2} rdfs:seeAlso {psnv3},{psnv3} rdfs:seeAlso {ftn},{analysis} fnu112:poson {psn},{analysis} rdf:type {rdf:Statement},{analysis} rdf:object {exp},{analysis} mgla:PeptideAbundance {abundance}where xsd:integer(abundance) > 100000and ftn LIKE "*FTN*"using namespacemgla=<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/>,fnu112=<http://www.francisella.org/novicida/fnu112/schema/fnu112/experiments/mgla#>
  • 28. Querieswhich posons have the most highly abundant peptides
  • 29. Querieswhich experiments have the most highly abundant peptides
  • 30. Reified statements • Reified mgla data are much bigger (4 more statements/abundance) • The really interesting queries return Java out of memory error (-Xms-1024M - Xmx 1536M) identifiedIn Experiment Peptide Sequence Replicate { • Haven’t yet tested shortcut path expression hasAbundance { {reifSubj} reifPred {reifObj} } pred {obj} abundance { {seq} identifiedIn {ExpRep} } hasAbundance {abd}<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement>.<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/WholeCell_Lvl7_02.1>.<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/InExperimentReplicate>.<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#object> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/wildtype/01_wc_01>.<#WholeCell_Lvl7_02.12> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/PeptideAbundance> "2594".
  • 31. Comparison of integrated experimental data Distinct and overlapping posons identified within each biological fraction (>20000) 171 146 185 mem sol mem MINUS sol sol MINUS memselect distinct psn from select distinct psn from{x} fns:poson {psn}, {x} fns:poson {psn},{x} fn:InExperimentReplicate {experiment}, {x} fn:InExperimentReplicate {experiment},{analysis} rdf:subject {x}, {analysis} rdf:subject {x},{analysis} rdf:object {exp}, INTERSECT {analysis} rdf:object {exp},{analysis} fn:PeptideAbundance {abundance} {analysis} fn:PeptideAbundance {abundance} select distinct psn fromwhere xsd:integer(abundance) > 20000 where xsd:integer(abundance) > 20000 {x} fns:poson {psn},and experiment LIKE "*mem*" and experiment LIKE "*sol*" {x} fn:InExperimentReplicate {experiment},MINUS MINUS {analysis} rdf:subject {x},select distinct psn from select distinct psn from {analysis} rdf:object {exp},{x} fns:poson {psn}, {x} fns:poson {psn}, {analysis} fn:PeptideAbundance {abundance}{x} fn:InExperimentReplicate {experiment}, {x} fn:InExperimentReplicate {experiment}, where xsd:integer(abundance) > 20000{analysis} rdf:subject {x}, {analysis} rdf:subject {x}, and experiment LIKE "*sol*"{analysis} rdf:object {exp}, {analysis} rdf:object {exp}, INTERSECT{analysis} fn:PeptideAbundance {abundance} {analysis} fn:PeptideAbundance {abundance} select distinct psn fromwhere xsd:integer(abundance) > 20000 where xsd:integer(abundance) > 20000 {x} fns:poson {psn},and experiment LIKE "*sol*" and experiment LIKE "*mem*" {x} fn:InExperimentReplicate {experiment},using namespace using namespace {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) > 20000 and experiment LIKE "*mem*" using namespace
  • 32. Comparison of integrated experimental data Distinct and overlapping posons identified within each biological fraction (<5000) 219 125 245 mem sol mem MINUS sol sol MINUS memselect distinct psn from select distinct psn from{x} fns:poson {psn}, {x} fns:poson {psn},{x} fn:InExperimentReplicate {experiment}, {x} fn:InExperimentReplicate {experiment},{analysis} rdf:subject {x}, {analysis} rdf:subject {x},{analysis} rdf:object {exp}, INTERSECT {analysis} rdf:object {exp},{analysis} fn:PeptideAbundance {abundance} {analysis} fn:PeptideAbundance {abundance} select distinct psn fromwhere xsd:integer(abundance) < 5000 where xsd:integer(abundance) < 5000 {x} fns:poson {psn},and experiment LIKE "*mem*" and experiment LIKE "*sol*" {x} fn:InExperimentReplicate {experiment},MINUS MINUS {analysis} rdf:subject {x},select distinct psn from select distinct psn from {analysis} rdf:object {exp},{x} fns:poson {psn}, {x} fns:poson {psn}, {analysis} fn:PeptideAbundance {abundance}{x} fn:InExperimentReplicate {experiment}, {x} fn:InExperimentReplicate {experiment}, where xsd:integer(abundance) < 5000{analysis} rdf:subject {x}, {analysis} rdf:subject {x}, and experiment LIKE "*sol*"{analysis} rdf:object {exp}, {analysis} rdf:object {exp}, INTERSECT{analysis} fn:PeptideAbundance {abundance} {analysis} fn:PeptideAbundance {abundance} select distinct psn fromwhere xsd:integer(abundance) < 5000 where xsd:integer(abundance) < 5000 {x} fns:poson {psn},and experiment LIKE "*sol*" and experiment LIKE "*mem*" {x} fn:InExperimentReplicate {experiment},using namespace using namespace {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) < 5000 and experiment LIKE "*mem*" using namespace
  • 33. Further work• Queries are slow in the native repository, database repositories are probably faster.• Adding transcriptomic experiment: Wt Vs mglA mutant GEO AC GSE5468• RDF-S inferencing?
  • 34. Acknowledgements• Funding: BBSRC -Radical Solutions for Researching the Proteome• University of Glasgow, Glasgow • Prof. Walter Kolch • Dr Andy Pitt• University of Strathclyde, Glasgow • Dr Ela Hunt (Scientific Advisor)• University of Washington, Seattle • Prof. Dave Goodlett (Scientific Advisor) • Dr Mitch Brittnacher, Mathew Radey, Laurence Rohmer • Dr Tina Guina (MglA experiment)
  • 35. Abundance thresholds....• SeRQL aggregate functions would be nice to have• Queries to find low and high abundance values: • WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance) • WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

×