Un
chem2
bio2rdf
DBpedia
live
URI
Burner
Opencyc
Diseasome
FU-Berlin
DNB
GND
Bio2RDF
NDC
Bio2RDF
Mesh
CKAN
Freebase
Linklion
Organic
Edunet
Biomodels
RDF
Reactome
RDF
Disgenet
IServe
Linked
TCGA
RDF
License
Harvest
RKB
Explorer
Lisbon
Austrian
Ski
Racers
RKB
Explorer
LAAS
RKB
Explorer
Wiki
JISC
RKB
Explorer
Eprints
RKB
Explorer
CurriculumRKB
Explorer
NSF
RKB
Explorer
DBLP
RKB
Explorer
ACM
RKB
Explorer
Southampton
RKB
Explorer
Deepblue
RKB
Explorer
Irit
RKB
Explorer
RAE2001
Geo
nked
Data
Bio2RDF
Ncbigene
Bio2RDF
DBSNP
DBpedia
DBpedia
ES
DBpedia
CS
Alpino
RDF
YAGO
KUPKB
Bio2RDF
Taxon-
concept
Assets
GNU
Licenses
DBpedia
VIVO
University
of Florida
StatusNet
Mrblog
Bio2RDF
Dataset
EUNIS
Uniprot
KB
StatusNet
Timttmy
StatusNet
Somsants
StatusNet
Drugbank
FU-Berlin
StatusNet
Dtdns
StatusNet
Status.net
StatusNet
Fragdev
Morelab
StatusNet
Macno
DBpedia
EU
Bio2RDF
Taxon
Uniprot
Metadata
Linked
Geo
Data
Project
Wiki
Enipedia
Linked
MDB
Sider
FU-Berlin
DBpedia
DE
DBpedia
EL
DBpedia
Lite
Drug
Interaction
Knowledge
Base
StatusNet
Qdnx
Hellenic
ire Brigade
StatusNet
Lydiastench
Taxon-
concept
Occurences
W3C
StatusNet
1w6
Linked
Life
Data
Semantic Web
DogFood
UMBEL
StatusNet
Ssweeny
StatusNet
Quitter StatusNet
Jonkman
StatusNet
Thelovebug
Bio2RDF
Uniprot
Taxonomy
DBpedia
NL
StatusNet
Russwurm
DBpedia
KO
Dailymed
FU-Berlin
DBpedia
IT
Aves3D
LT
StatusNet
Gomertronic
StatusNet
Progval
Testee
DBpedia
JA
StatusNet
Cooleysekula
Product
StatusNet
Postblue
StatusNet
Skilledtests
StatusNet
Fcac
Clean
Energy
Data
Reegle
StatusNet
Legadolibre
Geo
Names
Bio2RDF
GeneID
GNI
Archiveshub
Linked
Data
Code
Haus
Ordnance
Survey
Linked
Data
NUTS
Geo-
vocab
LOD
ACBDLS
FOAF-
Profiles
Net
ble
DBpedia
FR
h
StatusNet
Ourcoffs
StatusNet
Hackerposse
LOV
Bio2RDF
Taxonomy
StatusNet
Morphtown
StatusNet
chromic
Geospecies
linkedct
StatusNet
linuxwrangling
Linked
Open Data
of
Ecology
StatusNet
chickenkiller
Taxon
concept
Functional Manipulation
of Large Data Graphs
David Hyland-Wood
david.wood@ephox.com
@prototypo
1 June 2016
Something
Something
else
a relationship
UQ Universityis a
UQ
The University of
Queensland
label
Universityis a
Group of 8
affiliation
We’ve Seen This Before
08 Oct 2007
The RDF Data Model
• Turtle
• TriG
• N-Triples
• N-Quads
• JSON-LD
• RDFa
• RDF/XML
Standard serialisation
formats:
}Turtle family of
RDF formats
Possibly lossy
alternatives:
• CSV
• ODATA
• etc
$ curl http://dbpedia.org/page/University_of_Queensland
$ curl http://dbpedia.org/data/University_of_Queensland
$ curl http://dbpedia.org/data/University_of_Queensland.n3
> University_of_Queensland.n3
https://en.wikipedia.org/wiki/University_of_Queensland
HTML
RDF in XML (Yuck!)
Many formats, e.g. sane RDF, ODATA, Microdata, JSON…
UQ
The University of
Queensland
label
affiliation
Group of 8
34228
number of undergraduate students
48771
number of students
# G8 universities ordered by the number of students
# at each university.
PREFIX dbo:<http://dbpedia.org/ontology/>
select ?name ?students ?undergrads
where {
?s dbo:affiliation <http://dbpedia.org/resource/
Group_of_Eight_(Australian_universities)> .
?s rdfs:label ?name .
OPTIONAL {?s dbo:numberOfStudents ?students}
OPTIONAL {?s dbo:numberOfUndergraduateStudents ?
undergrads}
FILTER ( lang(?name) = "en" )
} ORDER BY DESC (?students)
# G8 universities ordered by the number of students
# at each university.
PREFIX dbo:<http://dbpedia.org/ontology/>
select ?name ?students ?undergrads
where {
?s dbo:affiliation <http://dbpedia.org/resource/
Group_of_Eight_(Australian_universities)> .
?s rdfs:label ?name .
OPTIONAL {?s dbo:numberOfStudents ?students}
OPTIONAL {?s dbo:numberOfUndergraduateStudents ?
undergrads}
FILTER ( lang(?name) = "en" )
} ORDER BY DESC (?students)
# G8 universities ordered by the number of students
# at each university.
PREFIX dbo:<http://dbpedia.org/ontology/>
select ?name ?students ?undergrads
where {
?s dbo:affiliation <http://dbpedia.org/resource/
Group_of_Eight_(Australian_universities)> .
?s rdfs:label ?name .
OPTIONAL {?s dbo:numberOfStudents ?students}
OPTIONAL {?s dbo:numberOfUndergraduateStudents ?
undergrads}
FILTER ( lang(?name) = "en" )
} ORDER BY DESC (?students)
# G8 universities ordered by the number of students
# at each university.
PREFIX dbo:<http://dbpedia.org/ontology/>
select ?name ?students ?undergrads
where {
?s dbo:affiliation <http://dbpedia.org/resource/
Group_of_Eight_(Australian_universities)> .
?s rdfs:label ?name .
OPTIONAL {?s dbo:numberOfStudents ?students}
OPTIONAL {?s dbo:numberOfUndergraduateStudents ?
undergrads}
FILTER ( lang(?name) = "en" )
} ORDER BY DESC (?students)
# G8 universities ordered by the number of students
# at each university.
PREFIX dbo:<http://dbpedia.org/ontology/>
select ?name ?students ?undergrads
where {
?s dbo:affiliation <http://dbpedia.org/resource/
Group_of_Eight_(Australian_universities)> .
?s rdfs:label ?name .
OPTIONAL {?s dbo:numberOfStudents ?students}
OPTIONAL {?s dbo:numberOfUndergraduateStudents ?
undergrads}
FILTER ( lang(?name) = "en" )
} ORDER BY DESC (?students)
# G8 universities ordered by the number of students
# at each university.
PREFIX dbo:<http://dbpedia.org/ontology/>
select ?name ?students ?undergrads
where {
?s dbo:affiliation <http://dbpedia.org/resource/
Group_of_Eight_(Australian_universities)> .
?s rdfs:label ?name .
OPTIONAL {?s dbo:numberOfStudents ?students}
OPTIONAL {?s dbo:numberOfUndergraduateStudents ?
undergrads}
FILTER ( lang(?name) = "en" )
} ORDER BY DESC (?students)
# G8 universities ordered by the number of students
# at each university.
PREFIX dbo:<http://dbpedia.org/ontology/>
select ?name ?students ?undergrads
where {
?s dbo:affiliation <http://dbpedia.org/resource/
Group_of_Eight_(Australian_universities)> .
?s rdfs:label ?name .
OPTIONAL {?s dbo:numberOfStudents ?students}
OPTIONAL {?s dbo:numberOfUndergraduateStudents ?
undergrads}
FILTER ( lang(?name) = "en" )
} ORDER BY DESC (?students)
# G8 universities ordered by the number of students
# at each university.
PREFIX dbo:<http://dbpedia.org/ontology/>
select ?name ?students ?undergrads
where {
?s dbo:affiliation <http://dbpedia.org/resource/
Group_of_Eight_(Australian_universities)> .
?s rdfs:label ?name .
OPTIONAL {?s dbo:numberOfStudents ?students}
OPTIONAL {?s dbo:numberOfUndergraduateStudents ?
undergrads}
FILTER ( lang(?name) = "en" )
} ORDER BY DESC (?students)
OpenStreetMap
Wikimedia Commons
DBpedia
US EPA RCRA
US EPA FRS
ABT Associates
UQ
The University of
Queensland
label
ANU
Australian National
University
label
Monash
affiliation
UMelbourne
affiliation
UNSW
affiliation
USydney
affiliation
UAdelaide
affiliation
Go8
memberOf
memberOf
memberOf
memberOf
memberOf
memberOf
memberOf
University of
Melbourne
label
Monash
University
label
University of
Adelaide
label
Group of 8
label
University of
Sydney
label
University
of NSW
label
UQ
The University of
Queensland
label
ANU
Australian National
University
label
Monash
affiliation
UMelbourne
affiliation
UNSW
affiliation
USydney
affiliation
UAdelaide
affiliation
Graphs in Scala
val graph: Graph[String, String] =
Graph(vertexRDD, edgeRDD)
// Create a subgraph based on the vertices connected
// by an "affiliation" property.
val affiliationRelatedSubgraph =
graph.subgraph(t => t.attr ==
"http://dbpedia.org/ontology/affiliation")
// Find connected components of affiliationRelatedSubgraph.
val ccGraph =
affiliationRelatedSubgraph.connectedComponents()
Graphs in Scala
// Create a hashmap of componentLists.
affiliationRelatedSubgraph.vertices.leftJoin
(ccGraph.vertices) {
case (id, u, comp) => comp.get
}.foreach { case (id, startingNode) =>
{
if (!(componentLists.contains(startingNode))) {
componentLists(startingNode) = new
ListBuffer[VertexId]
}
componentLists(startingNode) += id
}
}
Graphs in Scala
// Output a report on the connected components.
println("------ connected components in related triples ------
n")
for ((component, componentList) <- componentLists){
if (componentList.size > 1) {
for(c <- componentList) {
println(labelMap(c));
}
println("--------------------------")
}
}
------ connected components in related triples ------
Australian National University
University of Sydney
University of Adelaide
University of New South Wales
--------------------------
The University of Queensland
University of Melbourne
Monash University
--------------------------
Resources
• Slides:
http://w3id.org/people/prototypo/talks/UQ-
DKE-20160601/slides
• Code:
http://w3id.org/people/prototypo/talks/UQ-
DKE-20160601/code
Resources
• Callimachus:
http://callimachusproject.org
• Apache Spark:
http://spark.apache.org
• GraphX Programming Guide:
http://spark.apache.org/docs/latest/graphx-
programming-guide.html
Attributions
• Linking Open Data cloud diagram by
Richard Cyganiak and Anja Jentzsch, used
under a CC license: http://lod-cloud.net/
This work is Copyright © 2015 David Hyland-Wood
It is licensed under the Creative Commons Attribution 3.0 Unported License

Full details at: http://creativecommons.org/licenses/by/3.0/
You are free:
to Share — to copy, distribute and transmit the work
to Remix — to adapt the work
Under the following conditions:
Attribution. You must attribute the work in the manner specified by the
author or licensor (but not in any way that suggests that they endorse
you or your use of the work).
Share Alike. If you alter, transform, or build upon this work, you may
distribute the resulting work only under the same or similar license to this
one.

Functional manipulations of large data graphs 20160601