Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
RDF what and why
Jerven Bolleman
Developer
Swiss-Prot Group
Introduction
• RDF	
  
• Its	
  a	
  technology	
  
• Cost	
  and	
  affordability	
  are	
  key	
  concerns
]<--.+++++++++++.++++++++.---------.>++++++++[<---------->-]
<++.>+++++[<+++++++++++++>-]<.+++++++++++++.----------.>++++
...
What is RDF?
What?
Why?
SPARQL?
Exam
ples
Exam
ples
RDF: Resource Description Framework
• Resource
– Generalization of “Web resource”
– A thing that can be identified (but no...
Everything can be described with (loads
of) triples...
Subject
Property
(resource)
A Triple
Object
(resource
or
literal va...
Related triples form a graph...
An RDF graph can be serialized in several
ways
• RDF/XML: the W3C’s official format
– XML is well established: good for ap...
A simple example
RDF
What and why
presented by
A Triple
“Jerven Bolleman”
Literal value
RDF identifies resources with URIs
UniProt.rdf
What and why
presented by
A Triple
expasy.org/people/
Jerven_Tjalling
.Boll...
Multiple URIs may identify the same thing
expasy.org/people/
Jerven_Tjalling
.Bolleman.htm
ch.linkedin.com/
in/jervenbolle...
The life sciences have an identity
problem...
• www.genenames.org/data/hgnc_data.php?
hgnc_id=9993
– RGS11: regulator of G...
Hello, I a 9993.
I like flower?
The solution are URIs
• In RDF statements:
– subject and predicates must be URIs
– objects may be URIs or literal values

...
Example: From tab-delimited to semantic
RDF in Turtle
format
Tab delimited Converted To
An example
Example: From tab-delimited to semantic
A Triple
Q9VGZ4
P25724
Q9V3H7
Q00403
P23312
P31928
Q9NAE1
Q9TYY1
Q10666
Q21921
Int...
Example step 1: Use URIs for subjects
and objects
A Triple
Interactions.txt
...prot/Q9VGZ4
...prot/P25724
...prot/Q9V3H7
....
Example step 2: Use shorthand syntax
A Triple
Interactions.txt
prot:Q9VGZ4 .
prot:P25724 .
prot:Q9V3H7 .
prot:Q00403 .
pro...
Example step 3: Make statements
A Triple
Interactions.txt
@prefix prot:<purl.uniprot.org/uniprot/>
prot:P32234
prot:P32234...
Example step 4: Use URIs for properties
@prefix prot:<purl.uniprot.org/uniprot/>
@prefix core:<purl.uniprot.org/core/>
pro...
RDF What? Quick recap
• RDF describes data with statements (aka triples)
– statement = subject + predicate + object
– rela...
Why RDF? Isn’t there a simpler solution?
What?
Why?
SPARQL?
Exam
ples
Exam
ples
A very simple example: FASTA
• Why does everyone in the sequence world use
FASTA?
A very simple example: FASTA
• Why does everyone in the sequence world use
FASTA?
– The smallest common denominator
– You ...
A simple example: GFF
• Some people want to exchange more than
sequences, and invented GFF:
• BUT: ...
SEQ1 EMBL atg 103 1...
A simple example: GFF
• Some people want to exchange more than
sequences, and invented GFF:
• BUT: What do the columns mea...
A proper solution: XML
• There is a world beyond sequences and
bioinformatics!

• XML is an IT-industry standard
– Datatyp...
XML represents data as a tree
• XML datatypes
– Multi namespace
– XML Schema closes extensions
• Tree format
entry
Proton
...
No XML standard for other relationships
prizes:a case study
• XML datatypes
– Multi namespace
– XML Schema closes extensio...
Our data is a graph!
entry
Proton
acceptor
196activ
e
2.7.11.
-
EC
RDF advantages
• W3C standard
• Can be serialized as XML or JSON
• i.e. most benefits of XML or JSON
• Generic graph struc...
RDF is extensible
• Anyone can say Anything about Anything
– You can say something about my data
• RDF extensions remain c...
RDF data model is simple
• Everything can be said with triples

• Generic triple stores
– low maintenance data integration...
Comparison
Flat file XML RDF
Standard NO YES YES
Scalable NO YES YES +
Extendable NO NO YES
Generic

Data model
NO NO YES
Modeling data using RDF
Most common failure in RDF world:
Philosophy over pragmatism
1.	
  Be	
  honest	
  about	
  your	
  data	
  
• what	
  you...
Model real data not the the “real world”
• Describe	
  records	
  that	
  relate	
  to	
  real	
  world	
  
things	
  
• A...
Example: mouse in a lab
1.5g
<weight>
Example: mouse in a lab
1.5g
<weight>
20g
<weight>
TIME it made you a liar
Example: mouse in a lab
1.5g
<measurement>
20g
<measurement>
<weight>
<weight>
1week
3week
_:1
_:2
<age>
<age>
Describing models using
OWL
OWL: Web Ontology Language
• Will	
  be	
  presented	
  in	
  detail	
  during	
  the	
  week	
  
• Logical	
  meaning	
  ...
‹#›
DANGER
It	
  is	
  pure	
  Logic	
  (first order)	
  
45
Classification by restricting set membership
<human> a owl:Class ;
rdfs:subClassOf [ owl:onProperty <legs> .
owl:cardinali...
Classification by restricting set membership
<human> a owl:Class ;
rdfs:subClassOf [ owl:onProperty <legs> .
owl:cardinali...
Validating RDF Data
W3C workgroup in progress
• Data-­‐Shapes	
  	
  
• You	
  don’t	
  want	
  to	
  know	
  how	
  the	
  sausage	
  is	
  
...
SPARQL
What?
Why?
SPARQL?
Exam
ples
Exam
ples
Why provide a public SPARQL endpoint
• A	
  10	
  man	
  wet	
  laboratory	
  can	
  not	
  afford:	
  
– to	
  host	
  th...
← Not CPU Time...
But Brain Time
↓
The right kind of optimisation
Why provide a public SPARQL endpoint
• Classical	
  SQL	
  can	
  be	
  provided	
  on	
  the	
  web	
  
–Is	
  not	
  pra...
Data Integration Traditional
Pathway.txt
UniProt.txt
Pathway
Parser
UniProt
Parser
Pathway
Schema
UniProt
Schema
Own Lab D...
Data Integration RDF/SPARQL
Pathway.rdf
UniProt.rdf
Own Lab Data
Triple Store
SPARQL
Queries
$
$?
Why provide a public SPARQL endpoint
• Document	
  centric	
  REST	
  is	
  not	
  enough	
  
–Swiss-­‐Prot	
  available	
...
57
© 2015 SIB
58
© 2015 SIB
60
© 2015 SIB
help@uniprot.org
100
10'000
1'000'000
2015-012015-022015-032015-042015-052015-062015-072015-08
queries ask select
construct describe
Querie...
Real users
Mix between hard analytics and super specific
Estimate somewhere between:
300 - 1000 real humans per month
We k...
Using the Semantic Web for faster (Bio-) Research
Exercises with SPARQL
tutorial.sparql.uniprot.org
Why learn SPARQL
• Standardised formal query language
– implementation independent
• SPARQL ➔ SQL (via R2RML)
• SPARQL ➔ w...
Apparently it helps
kill vampires !!!
Its SPARQLy mammal time !!
Lets look at an single taxon record
www.uniprot.org/taxonomy/9993
Lets look at an single taxon record
www.uniprot.org/taxonomy/9993
@base <http://purl.uniprot.org/taxonomy/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1...
Turtle is the RDF serialization aligned with
SPARQL
• Shorthand	
  to	
  avoid	
  typing	
  so	
  much	
  
– .	
  ‘dot’	
 ...
Why don’t these queries work elsewhere?
• PREFIX	
  
– On	
  the	
  web	
  you	
  often	
  have	
  to	
  add	
  these	
  
...
a = rdf:type = <http://www.w3.org/1999/02/22-
rdf-syntax-ns#type>
<9993> rdf:type up:Taxon ;
up:rank up:Species ;
up:reviewed true ;
up:mnemonic "MARMR" ;
up:scientificName "Marmota marmot...
<9993> rdf:type up:Taxon ;
up:rank up:Species ;
up:reviewed true ;
up:mnemonic "MARMR" ;
up:scientificName "Marmota marmot...
Lets learn SPARQL
• Queries	
  over	
  RDF	
  data.	
  
– Four	
  basic	
  types	
  
• SELECT	
  
– Returns	
  “tab	
  del...
SPARQL:queries triple pattern
taxon:9606 rdf:type core:Taxon .
SPARQL:queries triple pattern
?anyTaxon rdf:type core:Taxon .
SPARQL:queries triple pattern
?anyTaxon rdf:type core:Taxon .
SELECT ?anyTaxon
WHERE {
}
SPARQL:queries triple pattern
taxon:9606 rdf:type core:Taxon .
taxon:9606 core:reviewed “true” .
SPARQL:queries triple pattern
?anyTaxon rdf:type core:Taxon .
?anyTaxon core:reviewed “true” .
SPARQL:queries triple pattern
?anyTaxon rdf:type core:Taxon .
?anyTaxon core:reviewed “true” .
SELECT ?anyTaxon
WHERE {
}
SPARQL:queries triple pattern
?anyTaxon rdf:type core:Taxon .
?anyTaxin core:reviewed “true” .
SELECT ?anyTaxon
WHERE {
}
SPARQL:queries triple pattern
?anyTaxon rdf:type core:Taxon .
$anyTaxon core:reviewed “true” .
SELECT ?anyTaxon
WHERE {
}
tutorial.sparql.uniprot.org
1: Select all taxon from NCBI/UniProt taxonomy
• Taxonomy	
  at	
  www|sparql.uniprot.org	
  
• Matches	
  NCBI	
  
• Time...
‹#›
88
Lets learn SPARQL
Shorthand a = rdf:type
2: AND join (default)
3: Shortcuts
Remember ‘;’ shortcut
4: Two variables one output column
5: Optional
• When	
  values	
  may	
  be	
  missing	
  
– yet	
  interesting	
  when	
  they	
  are	
  there	
  
• Use	
 ...
5: OPTIONAL commonName
6: UNION
• Allows	
  you	
  to	
  combine	
  query	
  patterns	
  as	
  an	
  
OR	
  operation.	
  
• Joins	
  are	
  stil...
UNION
Negation
• When	
  you	
  do	
  not	
  want	
  a	
  certain	
  category	
  of	
  
matches.
SELECT ?pet
WHERE {
?pet a pets...
Oooops
7: Not exists (Negation 1)
8: Minus (Negation 2)
MINUS{} or FILTER (NOT EXISTS{})
• Whats	
  the	
  difference?	
  
– MINUS	
  subtracts	
  results	
  
– NOT	
  EXITS	
  t...
9: MINUS all data
10: FILTER (NOT EXISTS{}) no results
11: Negation option 3
SPARQL 1.0
SELECT ?subject ?rank
WHERE {
?subject up:rank ?rank .
OPTIONAL { ?subject up:rank up:Gen...
FILTERS
• You	
  just	
  saw	
  it	
  twice	
  
– Once	
  in	
  the	
  !BOUND	
  
– Once	
  in	
  the	
  NOT	
  EXISTS	
  ...
12: Filter
13: Filter on not in
Using implicit AND between lines
Using implicit AND between lines
15: FILTER IN
16: FILTER using OR
FILTER on numbers
• <	
  	
  
– FILTER	
  (1	
  <	
  2)	
  	
  	
  	
  	
  (17)	
  
• >	
  
– FILTER	
  (2	
  >	
  1)	
  	...
Filters
• ?x	
  =	
  ?y	
  does	
  casting	
  (value	
  conversions)	
  (21)	
  
– 1.0^^xsd:float	
  =	
  1^^xsd:int	
  is...
FUNCTIONS for in filters and in binds
• Functions	
  
– STRLEN	
  
– SUBSTR	
  
– UCASE	
  
– LCASE	
  
– STRSTARTS	
  
– ...
24: SUBSTR == substring
24: STRLEN == String Length
25: CONTAINS is case sensitive is it in
there
26: REGEX, just like java|python regex
BIND
• Builds	
  new	
  Values	
  
– Closes	
  the	
  basic	
  graph	
  pattern	
  (22)	
  
• Always	
  declare	
  before	...
BIND existing variable to a new one
27: CONCAT
BIND can assign any output
Aggregate functions
• on	
  select	
  line	
  
• limited	
  in	
  number	
  
– count	
  
– sum	
  
– avg	
  
– min	
  
– m...
© 2013 SIB
30: count
© 2013 SIB
31: SAMPLE should give a random result back
© 2013 SIB
Follow the path
32: Path queries
33: Finding a grand parent using normal
joins
34: Finding a grandParent using a path
query
35: | is OR for predicate
36: Same result with UNION
37: Finding any ancestor
38: Can use the variable in a normal join
afterwards
© 2013 SIB
GROUP BY
GROUP BY
• Needed	
  for	
  aggregate	
  values	
  
• After	
  closing	
  the	
  where	
  clause	
  
– ...	
  WHERE	
  {?x...
39: GROUP BY
HAVING
• 
I have carrot !
HAVING
• FILTER	
  for	
  aggregates	
  	
  
• After	
  the	
  GROUP	
  BY	
  clause	
  
– ...	
  GROUP	
  BY	
  ?x	
  HAV...
40: HAVING
© 2013 SIB
LIMITS
&
OFFSET
41: LIMIT and OFFSET
• OFFSET	
  is	
  skip	
  first	
  results	
  
• LIMIT	
  return	
  no	
  more	
  than	
  x	
  results
ORDER
ORDER
© 2013 SIB
VALUES
• Super	
  BIND	
  
• Provide	
  inline	
  data
Marmota marmota marmota
Examples
• Parameter	
  lists	
  are	
  between	
  ()	
  
Text
VALUES (?annotation) {
(core:Disease_Annotation)
(core:Disu...
Examples
• Undef	
  means	
  no	
  value	
  at	
  	
  
– all	
  not	
  bound
Text
VALUES (?annotation ?begin) {
(core:Dise...
VALUES
• After	
  declaring	
  a	
  set	
  of	
  values	
  you	
  can	
  use	
  
them	
  in	
  your	
  query.
SELECT ?comm...
SERVICE: Using other sparql endpoints
• SERVICE<URL	
  of	
  other	
  endpoint>	
  
– Runs	
  a	
  sub	
  query	
  on	
  t...
“Life is better with friends who understand you.”
SERVICE
SERVICE
• Useful	
  
– Quick	
  experimenting	
  with	
  combing	
  multiple	
  
datasources	
  
– Quick	
  for	
  queries...
SERVICE
• Slowly	
  improving	
  
• Theoretically	
  unfixable	
  
• Practically	
  could	
  be	
  much	
  better	
  
• 10...
Lets make
some triples
Construction
• CONSTRUCT	
  
– New	
  triples	
  	
  
• downloads	
  RDF	
  
– Does	
  not	
  update	
  store
Constructing an owl:sameAs between two
URI
INSERT
• Adds	
  data	
  
– like	
  construct
DELETE
• Removes	
  data	
  
– Triples	
  matching	
  are	
  removed	
  from	
  the	
  
data	
  
– Triples	
  can	
  be	
 ...
DELETE
DELETE
INSERT
• Single	
  atomic	
  operation	
  
• Transactions	
  store	
  API	
  option
Atomic operation
© 2013 SIB
I’m exhausted now
Of Course Biology is complicated
#baseURI: http://purl.uniprot.org/unirule/UR000107224

#Rule UR000107224 Created by:bridg...
Questions
RDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorial
Upcoming SlideShare
Loading in …5
×

RDF: what and why plus a SPARQL tutorial

1,810 views

Published on

RDF what is it and why should we use it in the life sciences. Followed by an introduction to SPARQL in detail.

Published in: Science

RDF: what and why plus a SPARQL tutorial

  1. 1. RDF what and why Jerven Bolleman Developer Swiss-Prot Group
  2. 2. Introduction • RDF   • Its  a  technology   • Cost  and  affordability  are  key  concerns
  3. 3. ]<--.+++++++++++.++++++++.---------.>++++++++[<---------->-] <++.>+++++[<+++++++++++++>-]<.+++++++++++++.----------.>++++ +++[<---------->-]<++.>++++++++[<++++++++++>-]<.>+++[<-----> -]<.>+++[<++++++>-]<..>+++++++++[<--------->-]<--.>+++++++[< ++++++++++>-]<+++.+++++++++++.>++++++++[<----------->-]<++++ .>+++++[<+++++++++++++>-]<.>+++[<++++++>-]<-.---.++++++.---- ---.----------.>++++++++[<----------->-]<+.---.[-]<<<->[-]>[ -]<<[>+>+<<-]>>[<<+>>-]>>>[-]<<<+++++++++<[>>>+<<[>+>[-]<<-] >[<+>-]>[<<++++++++++>>>+<-]<<-<-]+++++++++>[<->-]>>+>[<[-]< <+>>>-]>[-]+<<[>+>-<<-]<<<[>>+>+<<<-]>>>[<<<+>>>-]<>>[<+>-]< <-[>[-]<[-]]>>+<[>[-]<-]<++++++++[<++++++<++++++>>-]>>>[>+>+ <<-]>>[<<+>>-]<[<<<<<.>>>>>-]<<<<<<.>>[-]>[-]++++[<++++++++> -]<.>++++[<++++++++>-]<++.>+++++[<+++++++++>-]<.><+++++..--- -----.-------.>>[>>+>+<<<-]>>>[<<<+>>>-]<[<<<<++++++++++++++ .>>>>-]<<<<[-]>++++[<++++++++>-]<.>+++++++++[<+++++++++>-]<- -.---------.>+++++++[<---------->-]<.>++++++[<+++++++++++>-] <.+++..+++++++++++++.>++++++++[<---------->-]<--.>+++++++++[ <+++++++++>-]<--.-.>++++++++[<---------->-]<++.>++++++++[<++ ++++++++>-]<++++.------------.---.>+++++++[<---------->-]<+. >++++++++[<+++++++++++>-]<-.>++[<----------->-]<.+++++++++++ ..>+++++++++[<---------->-]<-----.---.+++.---.[-]<<<] @
  4. 4. What is RDF? What? Why? SPARQL? Exam ples Exam ples
  5. 5. RDF: Resource Description Framework • Resource – Generalization of “Web resource” – A thing that can be identified (but not necessarily retrieved) on the Web • Description – A resource is described with statements that specify the properties and property values of the resource • Statement (aka Triple) – subject: identifies the resource – predicate: identifies a property of the resource – object: identifies the value of that property
  6. 6. Everything can be described with (loads of) triples... Subject Property (resource) A Triple Object (resource or literal value) Subject (resource)
  7. 7. Related triples form a graph...
  8. 8. An RDF graph can be serialized in several ways • RDF/XML: the W3C’s official format – XML is well established: good for application developers – very verbose, not very “readable” – e.g. uniprot.org/uniprot/P00750.rdf • N-Triple – good for loading into triple stores – e.g. uniprot.org/uniprot/P00750.nt • Turtle ⟵ most examples will use this – good for reading by humans – e.g. uniprot.org/uniprot/P00750.ttl • JSON-LD – easy for javascript/websites • .... • Conversion 100% lossless
  9. 9. A simple example RDF What and why presented by A Triple “Jerven Bolleman” Literal value
  10. 10. RDF identifies resources with URIs UniProt.rdf What and why presented by A Triple expasy.org/people/ Jerven_Tjalling .Bolleman.htm URI
  11. 11. Multiple URIs may identify the same thing expasy.org/people/ Jerven_Tjalling .Bolleman.htm ch.linkedin.com/ in/jervenbolleman owl:sameAs A Triple
  12. 12. The life sciences have an identity problem... • www.genenames.org/data/hgnc_data.php? hgnc_id=9993 – RGS11: regulator of G-protein signaling 11 • http://www.uniprot.org/taxonomy/9993 – European alpine marmot • ... Text Te What is “9993”?
  13. 13. Hello, I a 9993. I like flower?
  14. 14. The solution are URIs • In RDF statements: – subject and predicates must be URIs – objects may be URIs or literal values
 • Advantages: – No risk of “name clashes” when integrating data from different sources – Different people can make statements about the same resource:
 Distributed annotation at a global scale!
  15. 15. Example: From tab-delimited to semantic RDF in Turtle format Tab delimited Converted To An example
  16. 16. Example: From tab-delimited to semantic A Triple Q9VGZ4 P25724 Q9V3H7 Q00403 P23312 P31928 Q9NAE1 Q9TYY1 Q10666 Q21921 Interactions.txt P32234 P32234 P32234 P42643 P42643 P42643 P41932 P41932 P41932 P41932
  17. 17. Example step 1: Use URIs for subjects and objects A Triple Interactions.txt ...prot/Q9VGZ4 ...prot/P25724 ...prot/Q9V3H7 ...prot/Q00403 ...prot/P23312 ...prot/P31928 ...prot/Q9NAE1 ...prot/Q9TYY1 ...prot/Q10666 ...prot/Q21921 purl.uniprot.org/uniprot/P32234 purl.uniprot.org/uniprot/P32234 purl.uniprot.org/uniprot/P32234 ...prot/P42643 ...prot/P42643 ...prot/P42643 ...prot/P41932 ...prot/P41932 ...prot/P41932 ...prot/P41932
  18. 18. Example step 2: Use shorthand syntax A Triple Interactions.txt prot:Q9VGZ4 . prot:P25724 . prot:Q9V3H7 . prot:Q00403 . prot:P23312 . prot:P31928 . prot:Q9NAE1 . prot:Q9TYY1 . prot:Q10666 . prot:Q21921 . @prefix prot:<purl.uniprot.org/uniprot/> prot:P32234 prot:P32234 prot:P32234 prot:P42643 prot:P42643 prot:P42643 prot:P41932 prot:P41932 prot:P41932 prot:P41932
  19. 19. Example step 3: Make statements A Triple Interactions.txt @prefix prot:<purl.uniprot.org/uniprot/> prot:P32234 prot:P32234 prot:P32234 prot:P42643 prot:P42643 prot:P42643 prot:P41932 prot:P41932 prot:P41932 prot:P41932 interacts_with interacts_with interacts_with interacts_with interacts_with interacts_with interacts_with interacts_with interacts_with interacts_with prot:Q9VGZ4 . prot:P25724 . prot:Q9V3H7 . prot:Q00403 . prot:P23312 . prot:P31928 . prot:Q9NAE1 . prot:Q9TYY1 . prot:Q10666 . prot:Q21921 .
  20. 20. Example step 4: Use URIs for properties @prefix prot:<purl.uniprot.org/uniprot/> @prefix core:<purl.uniprot.org/core/> prot:P32234 prot:P32234 prot:P32234 prot:P42643 prot:P42643 prot:P42643 prot:P41932 prot:P41932 prot:P41932 core:interacts_with core:interacts_with core:interacts_with core:interacts_with core:interacts_with core:interacts_with core:interacts_with core:interacts_with core:interacts_with Interactions.ttl prot:Q9VGZ4 . prot:P25724 . prot:Q9V3H7 . prot:Q00403 . prot:P23312 . prot:P31928 . prot:Q9NAE1 . prot:Q9TYY1 . prot:Q10666 .
  21. 21. RDF What? Quick recap • RDF describes data with statements (aka triples) – statement = subject + predicate + object – related statements form a directed graph • RDF uses URIs to identify things: – subject and predicates must be URIs – objects may be URIs or literal values • Multiple serialisation formats that are 99.999999% automatically convertible
  22. 22. Why RDF? Isn’t there a simpler solution? What? Why? SPARQL? Exam ples Exam ples
  23. 23. A very simple example: FASTA • Why does everyone in the sequence world use FASTA?
  24. 24. A very simple example: FASTA • Why does everyone in the sequence world use FASTA? – The smallest common denominator – You can put in the header what you like and I can choose to ignore it • BUT: You only get a sequence... >Who|cares_about:this? THISISWHATWEWANT
  25. 25. A simple example: GFF • Some people want to exchange more than sequences, and invented GFF: • BUT: ... SEQ1 EMBL atg 103 105 . + 0 SEQ1 EMBL exon 103 172 . + 0
  26. 26. A simple example: GFF • Some people want to exchange more than sequences, and invented GFF: • BUT: What do the columns mean? – Originally, an exchange format for sequence feature descriptions, later also used for other annotations – 3 versions known (to me ;) – Not extendable without prior agreement of all users SEQ1 EMBL atg 103 105 . + 0 SEQ1 EMBL exon 103 172 . + 0
  27. 27. A proper solution: XML • There is a world beyond sequences and bioinformatics!
 • XML is an IT-industry standard – Datatypes – Multi namespaces – Schemas
 • BUT: – Hierarchical data model – Schemas close extension
  28. 28. XML represents data as a tree • XML datatypes – Multi namespace – XML Schema closes extensions • Tree format entry Proton acceptor 196 activ e 2.7.11. - EC
  29. 29. No XML standard for other relationships prizes:a case study • XML datatypes – Multi namespace – XML Schema closes extensions • Tree format entry Proton acceptor 196 activ e 2.7.11. - EC
  30. 30. Our data is a graph! entry Proton acceptor 196activ e 2.7.11. - EC
  31. 31. RDF advantages • W3C standard • Can be serialized as XML or JSON • i.e. most benefits of XML or JSON • Generic graph structure • URIs as a standard way to identify resources and their properties – data integration without name clashes – distributed annotation – normalization • Extensible!
  32. 32. RDF is extensible • Anyone can say Anything about Anything – You can say something about my data • RDF extensions remain compatible • RDF encourages data and schema reuse @prefix prot:<purl.uniprot.org/uniprot/> @prefix intact:<fake.ebi.ac.uk/intact/example> prot:P32234 prot:P32234 intact:interacts_with intact:interacts_with Interactions.ttl prot:Q9VGZ4 prot:P25724
  33. 33. RDF data model is simple • Everything can be said with triples
 • Generic triple stores – low maintenance data integration
 • SPARQL – SQL – XPath – Regular expressions for RDF for RDF for RDF
  34. 34. Comparison Flat file XML RDF Standard NO YES YES Scalable NO YES YES + Extendable NO NO YES Generic
 Data model NO NO YES
  35. 35. Modeling data using RDF
  36. 36. Most common failure in RDF world: Philosophy over pragmatism 1.  Be  honest  about  your  data   • what  you  have:  not  what  you  want   2.  Change  the  concept  change  the  IRI   •  One  concept  can  be  referred  to  by  multiple   IRI   3.  Better  to  “todo”  than  to  “debate”  
  37. 37. Model real data not the the “real world” • Describe  records  that  relate  to  real  world   things   • Acknowledge  that  they  are  records   • Model  measurements  before  “facts”
  38. 38. Example: mouse in a lab 1.5g <weight>
  39. 39. Example: mouse in a lab 1.5g <weight> 20g <weight>
  40. 40. TIME it made you a liar
  41. 41. Example: mouse in a lab 1.5g <measurement> 20g <measurement> <weight> <weight> 1week 3week _:1 _:2 <age> <age>
  42. 42. Describing models using OWL
  43. 43. OWL: Web Ontology Language • Will  be  presented  in  detail  during  the  week   • Logical  meaning  added  to  RDF  statements   • That  tools  use   • Classifies  existing  data  or  infers  new  data   • Very  powerful  and  useful
  44. 44. ‹#› DANGER It  is  pure  Logic  (first order)   45
  45. 45. Classification by restricting set membership <human> a owl:Class ; rdfs:subClassOf [ owl:onProperty <legs> . owl:cardinality 2 ] ; rdfs:subClassOf [ owl:onProperty <brains> . owl:cardinality 1 ] ; rdfs:subClassOf [ owl:onProperty <referenceGenome> . owl:allValuesFrom <HGCHR_genome> ] ; …
  46. 46. Classification by restricting set membership <human> a owl:Class ; rdfs:subClassOf [ owl:onProperty <legs> . owl:cardinality 2 ] ; rdfs:subClassOf [ owl:onProperty <brains> . owl:cardinality 1 ] ; rdfs:subClassOf [ owl:onProperty <referenceGenome> . owl:allValuesFrom <HGCHR_genome> ] ; … Lose a leg → no longer human
  47. 47. Validating RDF Data
  48. 48. W3C workgroup in progress • Data-­‐Shapes     • You  don’t  want  to  know  how  the  sausage  is   made…     • Vendors  looking  forward  to  implementing  it   • Currently  not  that  bad,  could  be  better   • First  Working  Draft
  49. 49. SPARQL What? Why? SPARQL? Exam ples Exam ples
  50. 50. Why provide a public SPARQL endpoint • A  10  man  wet  laboratory  can  not  afford:   – to  host  their  own  database  in  house  holding   all  or  even  a  bit  of  all  life  science  data.     – not  to  have  access,  and  use,  existing  life   science  information.
  51. 51. ← Not CPU Time... But Brain Time ↓ The right kind of optimisation
  52. 52. Why provide a public SPARQL endpoint • Classical  SQL  can  be  provided  on  the  web   –Is  not  practical   –No  federation   –Poor  standards  conformance   • Local SQL is expensive • Local  JSON  is  no  better   • Nor  is  local  XML
  53. 53. Data Integration Traditional Pathway.txt UniProt.txt Pathway Parser UniProt Parser Pathway Schema UniProt Schema Own Lab Data Data warehouse SQL queries $ $ $ $ $ $
  54. 54. Data Integration RDF/SPARQL Pathway.rdf UniProt.rdf Own Lab Data Triple Store SPARQL Queries $ $?
  55. 55. Why provide a public SPARQL endpoint • Document  centric  REST  is  not  enough   –Swiss-­‐Prot  available  as  REST     –(over e-mail !!) since 1986 –expasy.ch since 1993 –www.uniprot.org  since  2002   • Most user use a GUI not a CLI • developers  build  GUI  on  a  CLI
  56. 56. 57 © 2015 SIB
  57. 57. 58 © 2015 SIB
  58. 58. 60 © 2015 SIB help@uniprot.org
  59. 59. 100 10'000 1'000'000 2015-012015-022015-032015-042015-052015-062015-072015-08 queries ask select construct describe Queries per month in 2015 peak: 4 million per month
  60. 60. Real users Mix between hard analytics and super specific Estimate somewhere between: 300 - 1000 real humans per month We know they are real because they take holidays ;)
  61. 61. Using the Semantic Web for faster (Bio-) Research
  62. 62. Exercises with SPARQL tutorial.sparql.uniprot.org
  63. 63. Why learn SPARQL • Standardised formal query language – implementation independent • SPARQL ➔ SQL (via R2RML) • SPARQL ➔ webservice (via SADI) • SPARQL ➔ LDAP (e.g. SquirrelRDF) • SPARQL ➔ RDF (triplestore e.g. OWLIM-se) • SPARQL ➔ HADOOP/HIVE (e.g. SHARD) • SPARQL ➔ Linked Data Fragments – How you query independent of how you store!
  64. 64. Apparently it helps kill vampires !!!
  65. 65. Its SPARQLy mammal time !!
  66. 66. Lets look at an single taxon record www.uniprot.org/taxonomy/9993
  67. 67. Lets look at an single taxon record www.uniprot.org/taxonomy/9993
  68. 68. @base <http://purl.uniprot.org/taxonomy/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix up: <http://purl.uniprot.org/core/> . <9993> rdf:type up:Taxon ; up:rank up:Species ; up:reviewed true ; up:mnemonic "MARMR" ; up:scientificName "Marmota marmota" ; up:commonName "Alpine marmot" ; up:otherName "European marmot" ; rdfs:seeAlso <http://animaldiversity.ummz.umich.edu/site/ accounts/information/Marmota_marmota.html> , <http://www.alphagalileo.org/Organisations/ViewItem.aspx? OrganisationId=2043&ItemId=70106&CultureCode=en> , <http://www.arkive.org/alpine-marmot/marmota-marmota/ info.html> ,
  69. 69. Turtle is the RDF serialization aligned with SPARQL • Shorthand  to  avoid  typing  so  much   – .  ‘dot’  is  end  statement   – ;  ‘semi-­‐colon’  repeat  subject   – ,  ‘comma’  is  repeat  subject  and  predicate   • prefix   – before  ‘:’  is  abbreviation  of  uri
  70. 70. Why don’t these queries work elsewhere? • PREFIX   – On  the  web  you  often  have  to  add  these   – But  some  can  be  preconfigured PREFIX :<http://purl.uniprot.org/core/> SELECT ?x FROM <http://purl.uniprot.org/taxonomy/> WHERE {?x a :Taxon}
  71. 71. a = rdf:type = <http://www.w3.org/1999/02/22- rdf-syntax-ns#type>
  72. 72. <9993> rdf:type up:Taxon ; up:rank up:Species ; up:reviewed true ; up:mnemonic "MARMR" ; up:scientificName "Marmota marmota" ; up:commonName "Alpine marmot" ; up:otherName "European marmot" ; rdfs:subClassOf <9992> ; skos:narrowerTransitive <9994> ; rdfs:subClassOf taxon:9994 is a more specific classification than
  73. 73. <9993> rdf:type up:Taxon ; up:rank up:Species ; up:reviewed true ; up:mnemonic "MARMR" ; up:scientificName "Marmota marmota" ; up:commonName "Alpine marmot" ; up:otherName "European marmot" ; rdfs:subClassOf <9992> ; skos:narrowerTransitive <9994> ; rank => “The level, for nomenclatural purposes, of a taxon in a taxonomic hierarchy”
  74. 74. Lets learn SPARQL • Queries  over  RDF  data.   – Four  basic  types   • SELECT   – Returns  “tab  delimited”  results     • CONSTRUCT   – Makes  new  triples   • DESCRIBE   – Returns  all  triples  mentioning  a   resource  
  75. 75. SPARQL:queries triple pattern taxon:9606 rdf:type core:Taxon .
  76. 76. SPARQL:queries triple pattern ?anyTaxon rdf:type core:Taxon .
  77. 77. SPARQL:queries triple pattern ?anyTaxon rdf:type core:Taxon . SELECT ?anyTaxon WHERE { }
  78. 78. SPARQL:queries triple pattern taxon:9606 rdf:type core:Taxon . taxon:9606 core:reviewed “true” .
  79. 79. SPARQL:queries triple pattern ?anyTaxon rdf:type core:Taxon . ?anyTaxon core:reviewed “true” .
  80. 80. SPARQL:queries triple pattern ?anyTaxon rdf:type core:Taxon . ?anyTaxon core:reviewed “true” . SELECT ?anyTaxon WHERE { }
  81. 81. SPARQL:queries triple pattern ?anyTaxon rdf:type core:Taxon . ?anyTaxin core:reviewed “true” . SELECT ?anyTaxon WHERE { }
  82. 82. SPARQL:queries triple pattern ?anyTaxon rdf:type core:Taxon . $anyTaxon core:reviewed “true” . SELECT ?anyTaxon WHERE { }
  83. 83. tutorial.sparql.uniprot.org
  84. 84. 1: Select all taxon from NCBI/UniProt taxonomy • Taxonomy  at  www|sparql.uniprot.org   • Matches  NCBI   • Time  sync   • Adds  more  names   • Ands  images
  85. 85. ‹#› 88
  86. 86. Lets learn SPARQL Shorthand a = rdf:type
  87. 87. 2: AND join (default)
  88. 88. 3: Shortcuts
  89. 89. Remember ‘;’ shortcut
  90. 90. 4: Two variables one output column
  91. 91. 5: Optional • When  values  may  be  missing   – yet  interesting  when  they  are  there   • Use  as  sub  query   • bound  values  from  outside  stay  bound   inside   – ?x  ?y?z  .  OPTIONAL  {?x  ?b  ?c}     • ?x  same  variable  =  same  thing
  92. 92. 5: OPTIONAL commonName
  93. 93. 6: UNION • Allows  you  to  combine  query  patterns  as  an   OR  operation.   • Joins  are  still  from  outer  to  inner.  
  94. 94. UNION
  95. 95. Negation • When  you  do  not  want  a  certain  category  of   matches. SELECT ?pet WHERE { ?pet a pets:Friendly . }
  96. 96. Oooops
  97. 97. 7: Not exists (Negation 1)
  98. 98. 8: Minus (Negation 2)
  99. 99. MINUS{} or FILTER (NOT EXISTS{}) • Whats  the  difference?   – MINUS  subtracts  results   – NOT  EXITS  tests  if  the  sub  pattern  is   possible  at  all.   • Normally  the  faster  option.
  100. 100. 9: MINUS all data
  101. 101. 10: FILTER (NOT EXISTS{}) no results
  102. 102. 11: Negation option 3 SPARQL 1.0 SELECT ?subject ?rank WHERE { ?subject up:rank ?rank . OPTIONAL { ?subject up:rank up:Genus . ?subject up:rank ?genus .} FILTER(! BOUND(?genus)) }
  103. 103. FILTERS • You  just  saw  it  twice   – Once  in  the  !BOUND   – Once  in  the  NOT  EXISTS   • FILTERS  a  result  set  by  possibly  removing   values   – FILTER  do  not  add  a  value  to  the  result   • Inside  the  same  graph  pattern  order   independent.
  104. 104. 12: Filter
  105. 105. 13: Filter on not in
  106. 106. Using implicit AND between lines
  107. 107. Using implicit AND between lines
  108. 108. 15: FILTER IN
  109. 109. 16: FILTER using OR
  110. 110. FILTER on numbers • <     – FILTER  (1  <  2)          (17)   • >   – FILTER  (2  >  1)          (18)   • =   – FILTER  (1  =1)          (19)   • !=   – FILTER(1  !=  2)        (20)  
  111. 111. Filters • ?x  =  ?y  does  casting  (value  conversions)  (21)   – 1.0^^xsd:float  =  1^^xsd:int  is  true   • sameTerm(?x,  ?y)  does  not  (22)   – sameTerm(1.0^^xsd:float,  1^^xsd:int)
  112. 112. FUNCTIONS for in filters and in binds • Functions   – STRLEN   – SUBSTR   – UCASE   – LCASE   – STRSTARTS   – STRENDS   – CONTAINS   – STRBEFORE   – STRAFTER   – ENCODE_FOR_URI   – CONCAT   – langMatches   – REGEX   – REPLACE   – IRI   – STR
  113. 113. 24: SUBSTR == substring
  114. 114. 24: STRLEN == String Length
  115. 115. 25: CONTAINS is case sensitive is it in there
  116. 116. 26: REGEX, just like java|python regex
  117. 117. BIND • Builds  new  Values   – Closes  the  basic  graph  pattern  (22)   • Always  declare  before  use. SELECT ?p WHERE { { ?taxon a :Taxon . } BIND (?taxon AS ?p) }
  118. 118. BIND existing variable to a new one
  119. 119. 27: CONCAT
  120. 120. BIND can assign any output
  121. 121. Aggregate functions • on  select  line   • limited  in  number   – count   – sum   – avg   – min   – max   – groupConcat   – sample
  122. 122. © 2013 SIB 30: count
  123. 123. © 2013 SIB 31: SAMPLE should give a random result back
  124. 124. © 2013 SIB Follow the path
  125. 125. 32: Path queries
  126. 126. 33: Finding a grand parent using normal joins
  127. 127. 34: Finding a grandParent using a path query
  128. 128. 35: | is OR for predicate
  129. 129. 36: Same result with UNION
  130. 130. 37: Finding any ancestor
  131. 131. 38: Can use the variable in a normal join afterwards
  132. 132. © 2013 SIB GROUP BY
  133. 133. GROUP BY • Needed  for  aggregate  values   • After  closing  the  where  clause   – ...  WHERE  {?x  ?y  ?z}  GROUP  BY  ?x
  134. 134. 39: GROUP BY
  135. 135. HAVING • I have carrot !
  136. 136. HAVING • FILTER  for  aggregates     • After  the  GROUP  BY  clause   – ...  GROUP  BY  ?x  HAVING  (count(?y)  >  2)   – ...  GROUP  BY  ?x  HAVING  (min(?y)  =  2)   – etc...
  137. 137. 40: HAVING
  138. 138. © 2013 SIB LIMITS & OFFSET
  139. 139. 41: LIMIT and OFFSET • OFFSET  is  skip  first  results   • LIMIT  return  no  more  than  x  results
  140. 140. ORDER
  141. 141. ORDER
  142. 142. © 2013 SIB
  143. 143. VALUES • Super  BIND   • Provide  inline  data
  144. 144. Marmota marmota marmota
  145. 145. Examples • Parameter  lists  are  between  ()   Text VALUES (?annotation) { (core:Disease_Annotation) (core:Disulfide_Bond_Annotation) }
  146. 146. Examples • Undef  means  no  value  at     – all  not  bound Text VALUES (?annotation ?begin) { (core:Disease_Annotation UNDEF) (core:Disulfide_Bond_Annotation 2) }
  147. 147. VALUES • After  declaring  a  set  of  values  you  can  use   them  in  your  query. SELECT ?comment WHERE { VALUES (?annotation ?begin) { (core:Disease_Annotation UNDEF) (core:Disulfide_Bond_Annotation 2) } ?annotation rdfs:comment ?comment . }
  148. 148. SERVICE: Using other sparql endpoints • SERVICE<URL  of  other  endpoint>   – Runs  a  sub  query  on  the  other  endpoint   and  merges  it  back  into  your  query.
  149. 149. “Life is better with friends who understand you.”
  150. 150. SERVICE
  151. 151. SERVICE • Useful   – Quick  experimenting  with  combing  multiple   datasources   – Quick  for  queries  where  not  to  much  data  is  send   to  the  remote  point   • Slow   – When  you  ask  for  to  much  data   – Remote  endpoint  not  resourced  for  your   questions
  152. 152. SERVICE • Slowly  improving   • Theoretically  unfixable   • Practically  could  be  much  better   • 1000  x  speed  up  small  step  away
  153. 153. Lets make some triples
  154. 154. Construction • CONSTRUCT   – New  triples     • downloads  RDF   – Does  not  update  store
  155. 155. Constructing an owl:sameAs between two URI
  156. 156. INSERT • Adds  data   – like  construct
  157. 157. DELETE • Removes  data   – Triples  matching  are  removed  from  the   data   – Triples  can  be  bound  using  where  clause
  158. 158. DELETE
  159. 159. DELETE INSERT • Single  atomic  operation   • Transactions  store  API  option
  160. 160. Atomic operation
  161. 161. © 2013 SIB I’m exhausted now
  162. 162. Of Course Biology is complicated #baseURI: http://purl.uniprot.org/unirule/UR000107224 #Rule UR000107224 Created by:bridge on:2009-02-12 Modified by:rantunes on:2015-06-09 PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX uniprot:<http://purl.uniprot.org/uniprot/> PREFIX sequence:<http://purl.uniprot.org/sequences/> PREFIX unirule:<http://purl.uniprot.org/unirules/> PREFIX taxon:<http://purl.uniprot.org/taxonomy/> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX hamap-sparql:<http://example.org/hamap_sparql/> PREFIX up:<http://purl.uniprot.org/core/> PREFIX faldo:<http://biohackathon.org/resource/faldo#> PREFIX method:<http://example.org/method/> PREFIX keyword:<http://purl.uniprot.org/keywords/> PREFIX owl:<http://www.w3.org/2002/07/owl#> PREFIX proteome:<http://purl.uniprot.org/proteomes/> PREFIX hamap:<http://purl.uniprot.org/hamap/> PREFIX annotation:<http://purl.uniprot.org/annotation/> PREFIX xsd:<http://www.w3.org/2001/XMLSchema#> CONSTRUCT { ?this up:annotation ?annotation0, ?annotation1, ?annotation2, ?annotation3, ?annotation5; up:classifiedWith <http://purl.obolibrary.org/obo/19805>, <http://purl.obolibrary.org/obo/334>, <http://purl.obolibrary.org/obo/34354>, <http://purl.obolibrary.org/obo/43420>, <http://purl.obolibrary.org/obo/6569>, <http://purl.obolibrary.org/obo/8198>, keyword:223, keyword:560, keyword:662 . ?annotation0 a up:Function_Annotation; rdfs:comment "Catalyzes the oxidative ring opening of 3-hydroxyanthranilate to 2-amino-3-carboxymuconate semialdehyde, which
  163. 163. Questions

×