Genome and Proteome data integration in RDF

Semantic Web Applications and Tools for Life Sciences
November 2008

Genome and Proteome data integration in RDF
Nadia Anwar, Ela Hunt, Walter Kolch and Andy Pitt
e Me
ts
tab
nom

Pr
rip
e olit

ot
G es
sc

ein
an

s
Tr

Data Discovery

Outline
• Data Integration in Bioinformatics.

• Semantic data integration

• Francisella

• Integrating genome annotations with experimental proteomics data in RDF

• Further work

Data Integration is not a solved problem

Information discovery is not Integrated

High TP Microarray Proteomics
Computational Computational
Sequencing experiments experiments
analysis analysis Systems Biology
Synthetic Networks/
Genomics Proteomics Pathways
Sequence Peptide Profiles Predictions
Gene Expression
ORF Prediction Transcript Profile Peptide Abundance
Genome Transcript Protein Identification
Comparisons Abundance Protein Interactions
PT-Modifications Metabolomics
LIMS LIMS LIMS LIMS

Genome Regulatory Networks Metabolic Pathways
Translational
Medicine

Semantic Data Integration across omes data silos

Data Genes Transcripts Peptides Metabolites Genotype Information

Data Discovery

Proof of concept
Francisella tularensis

ulceroglandular
tularaemia

respiratory oculoglandular
tularaemia tularaemia

Bioterrorism
• Francisella tularensis is a very successful intracellular pathogen that causes
severe disease (respiratory tulareamia is the most acute form of the disease)
• low infectious dose (10-50 bacterium compared to anthrax which requires
8,000-15,000 spores)
• weaponisation fears

RDF
http://img.jgi.doe.gov/cgi-bin/pub/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=639633024#export

229976
+

(3)IMG_S:genomic_location_strand

229107
TPR
(3)IMG_S:genomic_location_end

(2)RDFS:comment (3)IMG_S:genomic_location_start

(1)RDF:type (4)IMG:gene_oid=639752258 (3)IMG_S:locus_tag FTN_0209
RDF:description

Data sources
Genome annotations

http://supfam.cs.bris.ac.uk/

RDF#type RDF:description

http://purl.uniprot.org/core/Protein_Family
SUPERFAMILY:cgi-bin/model.cgi?model=0040419

SUPERFAMILY:Assignment_Region 155-367

SUPERFAMILY:Score 5.1e-39

SUPERFAMILY:SCOP_ID SUPERFAMILY:cgi-bin/scop.cgi?sunid=52540
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=118496616 SUPERFAMILY:SCOP_Fold
P-loop containing nucleoside triphosphate hydrolases
SUPERFAMILY:Family_ID

SUPERFAMILY:Evalue 81269

7.33e-06
SUPERFAMILY:Family_Description
Extended AAA-ATPase domain
SUPERFAMILY:Similar_Structure
1l8q A:77-289

Francisella SuperFamily Data

Data sources
Genome annotations - KEGG

http://www.genome.jp/dbget-bin/www_bget?pathway+ftn00010

http://img.jgi.doe.gov/schema#gene

http://www.genome.jp/dbget-bin/www_bget?ftn:FTN_0298

http://img.jgi.doe.gov/schema#gene_name rdfs:comment rdfs:seeAlso rdfs:seeAlso

glpX fructose http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[EC:3.1.3.11] http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[SP:A0Q4N9_FRATN]

http://www.genome.jp/dbget-bin/www_bﬁnd?F.tularensis_U112

Genome annotations - NCBI protein
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=118496616

RDF:type RDF:idsymbol RDFS:#seeAlso http://purl.uniprot.org/Annotation/

RDF:description YP_897666.1 http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[refseqp-SeqVersion:YP_897666.1]+-e chromosomal

http://www.ncbi.nlm.nih.gov/sites/gquery?term=Francisella+tularensis+novicida

Data sources
Genome annotations - GO
RDF:type RDF:description

mgla:GO_Annotation#ID http://amigo.geneontology.org/cgi-bin/amigo/go.cgi?view=details&query=0006749

mgla:GO_Annotation#Term glutathione

http://www.genome.jp/dbget-bin/www_bget?ftn:FTN_0277
mgla:GO_Annotation#Ontology biological_process
mgla:GO_Annotation#Level
7
http://www.compbio.dundee.ac.uk/Software/GOtcha/iscore
0.879989490261963
http://www.compbio.dundee.ac.uk/Software/GOtcha/cscore
5.7273821328517

Poson annotations - Cogs

http://www.ncbi.nlm.nih.gov/sites/entrez?db=cdd&cmd=search&term=COG0508
mgla:cogNumber
mgla:cogDomain AceF
https://tools.nwrce.org/cgi-bin/fnu112/poson.cgi?poson=PSN082435 mgla:cogDescription
mgla:cogCategory Pyruvate/2-oxoglutarate

dihydrolipoamide

Data sources - experiments
Transcriptomics

Data sources - experiments
Proteomics

Francisella tularensis novicida U112

WildType MglA mutant

Whole Cell Soluble Membrane Whole Cell Soluble Membrane
(3) (3) (3) (3) (3) (3)

(4) (4) (4) (4) (4) (4)

Sequest DRAGON Sequest DRAGON Sequest DRAGON
Sequest DRAGON Sequest DRAGON Sequest DRAGON

Identiﬁcation Relative Abundance

P val <0.01

Two-sided t-test

RDF - excel conversion
Pval
Genome
Pval-1
analysis

Identiﬁed Peptide mgla:poson

abundance mgla:experiment PSN rdfs:seeAlso
PSNV2 rdfs:seeAlso
PSNV3 rdfs:seeAlso
FTN

rdfs:seeAlso

DDBID
Peptide
sequence
predicate GO SP EC
subject

object

Data integration
Reconciled Identiﬁers

(WashU-B) PSN.V1

(COGs) COGID (WashU-B) PSN.V2

(NCBI) PROTEINID (WashU-B) PSN.V3 (IMG) GENEID (WashU-P) DDB

(Fn ORF ID) FTN (Refseq) ACNo

(Gene Ontology) GOID (ENZYME) E.C.No (Uniprot) ACNo

Data Integration
Adding new experiments

Experiment Public
2
Experiment domain data
1

PSN rdfs:seeAlso
PSNV2 rdfs:seeAlso
PSNV3 rdfs:seeAlso
FTN

rdfs:seeAlso
Experiment
3 DDBID

Experiment
4
GO AC No. EC

Data integration
Sesame
NadiaAnwar:~ nadia$ openrdf-sesame-2.1/bin/console.sh
Connected to default data directory

Commands end with '.' at the end of a line
Type 'help.' for help
> connect http://127.0.0.1:8080/openrdf-sesame/.
Disconnecting from default data directory
Connected to http://127.0.0.1:8080/openrdf-sesame/
> show r.
+----------
|SYSTEM ("System configuration repository")
|ftnRepoNative ("Francisella Test")
|FrancisellaNative ("FrancisellaTestStore")
|FrancisellaReified ("Native store with RDF Schema inferencing")
|FrancisellaReified_index2 ("Native store with RDF Schema inferencing")
|Francisella ("Native store with RDF Schema inferencing")
+----------
> open FrancisellaReified_index2.
Opened repository 'FrancisellaReified_index2'

Sesame
Data load (ftnRepoNative) - native (spoc,posc)

Data File time (s) triples
francisella_locus_tag.nt 8.93 1,767
interact-prot.nt 88.51 20,682
interact-prot-peptides.nt 248,647
mgla search db.fasta.blastp4 ypURL.n3 9.7 1,719
NC_008601.nt 43.14 12,781
Ft_novicidaU112go.nt 359.14 2,548
francisella.rdf2.nt 43.41 10,434
francisellaSUPERFAMILY.nt 57.88 16,110
francisellaPROTEIN.fasta.nt 13.63 5,160
Soluble.nt 588.87 336,761
WholeCell.nt 469.02 112,625
Membranes.nt 1003.19 298,771

Data Integration
Mgla data (ftnRepoNative)

analysis


abundance PSN rdfs:seeAlso
PSNV2 rdfs:seeAlso
PSNV3 rdfs:seeAlso
FTN

rdfs:seeAlso
Experiment
DDBID
Peptide
sequence
SELECT psn, ftn, ec FROM
{ftn} rdfs:seeAlso {ec},
GO SP EC
{psn} rdfs:seeAlso {ftn},
{analysis} mgla:poson {psn}
WHERE ec LIKE “*[EC:*”
USING NAMESPACE
mgla =<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/>

Data Integration
Mgla data (ftnRepoNative)

analysis
rdf:about

mgla:sequence
mgla:experiment
abundance PSN rdfs:seeAlso
PSNV2 rdfs:seeAlso
PSNV3 rdfs:seeAlso
FTN

Peptide
sequence rdfs:seeAlso

DDBID
SELECT abundance, psn, ec, ftn FROM
{ftn} rdfs:seeAlso {ec},
{psn} rdfs:seeAlso {ftn}, GO SP EC
{analysis} mgla:poson {psn},
{analysis} mgla:experiment {abundance},
WHERE ec LIKE “*[EC:*”
USING NAMESPACE
mgla =<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/>

Really easy, But....
• Simple excel to RDF conversion does not enable all queries

• Not a simple conversion - Data needs to be “modelled”

analysis
rdf:about

mgla:sequence
mgla:experiment
abundance PSN
identiﬁedIn Experiment
Peptide Peptide Sequence
Replicate

{
sequence
hasAbundance

abundance

Data Integration
Reiﬁed statements
rdf:type
analysis Identiﬁed Peptide

Peptide
sequence
mgla:poson

PSN rdfs:seeAlso
PSNV2 rdfs:seeAlso
PSNV3 rdfs:seeAlso
FTN
Experiment
Replicate
rdfs:seeAlso
t
jec
rd f:ob DDBID
analysis data
rdf:type rdf:Statement
rd

rdf:s
f:
pr

ubje
ct GO SP EC
ed

analysis data
ica

mgla:PeptideAbundance
te

InExperimentReplicate
abundance

Sesame
Reiﬁed Data load - native-RDFS (spoc,posc,posc)
Data File time (s) time(mins) triples
FnU112Version3.nt 383.44 6.3 58,474
PosonMappings.nt 84.56 1.4 13,760
francisella_locus_tag.nt 16.73 0.3 1,767
ConstructHasGeneID.nt 23.00 0.4 1,719
interact-prot.nt 124.95 2.1 20,682
interact-prot-pepteides.nt 1127.97 18.7 248,647
interact-protSeeAlsoisbURL.nt 10.67 0.2 1,528
goAnnotation_URLID.nt 74.14 1.2 20,501
NC_008601.nt 75.84 1.3 12,781
Membranes_CogNumberURL.nt 8.60 0.1 2,548
Ft_novicida_U112_go.nt 561.38 9.3 2,548
francisella.rdf2.nt 46.19 0.8 10,602
francisellaSUPERFAMILY.nt 66.67 1.1 16,110
francisellaPROTEIN.fasta.nt 15.27 0.3 5,160
SolubleReifeid_3.rdf 1392.98 23.2 580,873
WholeCellReiﬁed_3.rdf 941.16 15.6 184,221
Membranes_3.rdf 1026.66 17.111 416,086
fnU112_draftRDFschemaV4.nt 215010.98 3,583.5 501

Queries
which posons have the most highly abundant peptides
select ftn , psn, exp, abundance from
{psn} rdfs:seeAlso {psnv2},
{psnv2} rdfs:seeAlso {psnv3},
{psnv3} rdfs:seeAlso {ftn},
{analysis} fnu112:poson {psn},
{analysis} rdf:type {rdf:Statement},
{analysis} rdf:object {exp},
{analysis} mgla:PeptideAbundance {abundance}
where xsd:integer(abundance) > 100000
and ftn LIKE "*FTN*"
using namespace
mgla=<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/>,
fnu112=<http://www.francisella.org/novicida/fnu112/schema/fnu112/experiments/
mgla#>

Queries
which posons have the most highly abundant peptides

Queries
which experiments have the most highly abundant peptides

Reified statements
• Reified mgla data are much bigger (4 more statements/abundance)

• The really interesting queries return Java out of memory error (-Xms-1024M -
Xmx 1536M)
identifiedIn Experiment
Peptide Sequence
Replicate

{
• Haven’t yet tested shortcut path expression
hasAbundance
{ {reifSubj} reifPred {reifObj} } pred {obj}
abundance
{ {seq} identifiedIn {ExpRep} } hasAbundance {abd}
<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement>.
<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/WholeCell_Lvl7_02.1>.
<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/InExperimentReplicate>.
<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#object> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/wildtype/01_wc_01>.
<#WholeCell_Lvl7_02.12> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/PeptideAbundance> "2594".

Comparison of integrated experimental data
Distinct and overlapping posons identiﬁed within each biological fraction (>20000)

171 146
185
mem sol

mem MINUS sol sol MINUS mem
select distinct psn from select distinct psn from
{x} fns:poson {psn}, {x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment}, {x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x}, {analysis} rdf:subject {x},
{analysis} rdf:object {exp}, INTERSECT {analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance} {analysis} fn:PeptideAbundance {abundance}
select distinct psn from
where xsd:integer(abundance) > 20000 where xsd:integer(abundance) > 20000
{x} fns:poson {psn},
and experiment LIKE "*mem*" and experiment LIKE "*sol*"
{x} fn:InExperimentReplicate {experiment},
MINUS MINUS
{analysis} rdf:subject {x},
{analysis} fn:PeptideAbundance {abundance}
and experiment LIKE "*sol*"
{analysis} rdf:object {exp}, {analysis} rdf:object {exp},
INTERSECT
where xsd:integer(abundance) > 20000 where xsd:integer(abundance) > 20000
and experiment LIKE "*sol*" and experiment LIKE "*mem*"
using namespace using namespace
and experiment LIKE "*mem*"
using namespace

Comparison of integrated experimental data
Distinct and overlapping posons identiﬁed within each biological fraction (<5000)

219 125
245
mem sol

mem MINUS sol sol MINUS mem
{analysis} rdf:object {exp}, INTERSECT {analysis} rdf:object {exp},
where xsd:integer(abundance) < 5000 where xsd:integer(abundance) < 5000
and experiment LIKE "*mem*" and experiment LIKE "*sol*"
MINUS MINUS
where xsd:integer(abundance) < 5000
and experiment LIKE "*sol*"
{analysis} rdf:object {exp}, {analysis} rdf:object {exp},
INTERSECT
where xsd:integer(abundance) < 5000 where xsd:integer(abundance) < 5000
and experiment LIKE "*sol*" and experiment LIKE "*mem*"
using namespace using namespace
where xsd:integer(abundance) < 5000
and experiment LIKE "*mem*"
using namespace

Further work
• Queries are slow in the native repository, database repositories are probably
faster.
• Adding transcriptomic experiment:
Wt Vs mglA mutant
GEO AC GSE5468
• RDF-S inferencing?

Acknowledgements
• Funding: BBSRC -Radical Solutions for Researching the Proteome
• University of Glasgow, Glasgow
• Prof. Walter Kolch
• Dr Andy Pitt
• University of Strathclyde, Glasgow
• Dr Ela Hunt (Scientiﬁc Advisor)
• University of Washington, Seattle
• Prof. Dave Goodlett (Scientiﬁc Advisor)
• Dr Mitch Brittnacher, Mathew Radey, Laurence Rohmer
• Dr Tina Guina (MglA experiment)

Abundance thresholds....
• SeRQL aggregate functions would be nice to have

• Queries to ﬁnd low and high abundance values:

• WHERE abundance BETWEEN MEDIAN(abundance) AND
MAX(abundance)

• WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Genome and Proteome data integration in RDF

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to Genome and Proteome data integration in RDF

Similar to Genome and Proteome data integration in RDF (20)

Recently uploaded

Recently uploaded (20)

Genome and Proteome data integration in RDF