SlideShare a Scribd company logo
1 of 37
Data Integration in a
Big Data Context
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
Data Linkage and Querying
2 September 2015 UBDC Seminar
Linking it all
together!
2
Big Data
2 September 2015 UBDC Seminar 3
Volume
VelocityVariety
http://i.kinja-img.com/gawker-
media/image/upload/lvzm0afp8kik5dctxiya.jpg
Purpose: Extracting Value
2 September 2015 UBDC Seminar 4
http://senderocorp.com/images/uploads/bigdata_v9.png
Volume Velocity Variety Veracity
Value
VisualizationAnalytics
Big Data Technology
Big Data
2 September 2015 UBDC Seminar 5
Volume
VelocityVariety
Big Data: Volume
More data than you
can process
 Scalable
processing
 Relative term
 WSN query
processing
2 September 2015 UBDC Seminar 6
Volume
VelocityVariety
Big Data: Variety
Many sources of
data
 Heterogeneous
 Formats
 Models
 Reconcile
meaning
2 September 2015 UBDC Seminar 7
Volume
VelocityVariety
Big Data: Velocity
Data constantly
generated
 Real-time
processing
 Contextualise
2 September 2015 UBDC Seminar 8
Volume
VelocityVariety
RDF: An Integration Dream
2 September 2015 UBDC Seminar 9
http://www.w3.org/TR/rdf11-primer/
2 September 2015 UBDC Seminar 10
https://www.flickr.com/photos/mobilestreetlife/4179063482
“RDF and OWL do not
solve the interoperability
problem, they just lay it
bare on the table!”
Frank van Harmelen
Solent Use Case
 Busy shipping
channel
 Two major ports
 Complex tidal
and
wave patterns
2 September 2015 UBDC Seminar 11
Estuarine Flooding
 Financial implications
 Damage
 Loss of business
 Personal factors
 Emotional impact
Flood prediction
 Locations
 Severity
Requires correlating
 Sea-state data
 Weather forecasts
 Details of sea defences
Response Planning
 Evacuation routes
 Personnel deployment
 …
Requires more data
 Traffic reports
 Shipping
 …
2 September 2015 UBDC Seminar 12
Image: http://www.metro.co.uk/
Flood
defences
data
(database)
Flood Detection
“Detect overtopping
events in the Solent
region”
sea-level >
sea-defence
•Sea-level: sensors
•Defence heights:
databases
2 September 2015 UBDC Seminar 13
Real-time
sensor data
Wave,
Wind,
Tide
Meteorological
forecasts
Response Planning
“Provide contextual
information”
• Web feeds
• Other sources: maps,
models
• Real-time merging of
datasets
2 September 2015 UBDC Seminar 14
Other sources:
Maps, models,
…
Abstract Problem
Stored data
Sensor
Network
Integrator
2 September 2015 15
Sensor
Network
Stored data
service
Streaming
data service
Streaming
data service
UBDC Seminar
Data
source
Data
stream
Query
capabilities
Data
access
Types of Heterogeneity
Stored data
Sensor
Network
Integrator
2 September 2015 16
Sensor
Network
Stored data
service
Streaming
data service
Streaming
data service
Data
semantics
UBDC Seminar
Querying Approach
 Use ontologies as common model
Requires:
 Representation of RDF stream
 Expressing continuous queries over an RDF
stream
 Establishing mappings between ontology
models and data source schemas
 Accessing data sources through queries over
ontology model
2 September 2015 17UBDC Seminar
RDF Stream
 Named graph
 Continuously updating
 Triples annotated with timestamp
2 September 2015 18
STREAM http://www.semsorgrid4env.eu/ccometeo.srdf
...
...
( <ssg4e:Obs1, rdf:type, cd:Observation>, ti ),
( <ssg4e:Obs1, cd:observationResult, “34.5”>, ti ),
( <ssg4e:Obs2, rdf:type, cd:Observation>, ti+1 ),
( <ssg4e:Obs2, cd:observationResult,”20.3”>, ti+1 ),
...
...
cd:Observation
xsd:double
cd:observationResult
UBDC Seminar
SPARQLStream
PREFIX cd:
<http://www.semsorgrid4env.eu/ontologies/CoastalDefences.owl#>
PREFIX sb: <http://www.w3.org/2009/SSN-
XG/Ontologies/SensorBasis.owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
RSTREAM
SELECT ?windspeed ?windts
FROM STREAM <http://www.semsorgrid4env.eu/ccometeo.srdf>
[ NOW – 1 MINUTE TO NOW STEP 5 MINUTES ]
WHERE {
?WindObs a cd:Observation;
cd:observationResult ?windspeed;
cd:observationResultTime ?windts;
cd:observedProperty ?windProperty;
cd:featureOfInterest ?windFeature.
?windFeature a cd:Feature;
cd:locatedInRegion cd:SolentCCO.
?windProperty a cd:WindSpeed.
}
2 September 2015 19
cd:Observation
xsd:double
cd:observationResult
cd:Feature
cd:featureOfInterest
cd:Property
cd:observedProperty
cd:Region
cd:locatedInRegion
“Every 5 minutes give me with the wind speed observations over
the last minute in the Solent Region ”
UBDC Seminar
Initial Display
2 September 2015 UBDC Seminar 20
Sensor Data
2 September 2015 UBDC Seminar 21
Sea-state Forecast Model
2 September 2015 UBDC Seminar 22
Drug Discovery Use Case
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
 Chemical Properties (Chemspider)
 Launched drugs (Drugbank)
 Human => Mouse (Homologene)
 Protein Families (Enzyme)
 Bioactivty Data (ChEMBL)
 … other info (Uniprot/Entrez etc.)
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
2 September 2015 UBDC Seminar 23
Open PHACTS Discovery Platform
2 September 2015 UBDC Seminar 24
Drug Discovery Platform
Apps
Domain API
Interactive
responses
Production quality
integration platform
Method
Calls
Standard Web
Technologies
API Hits
2 September 2015 UBDC Seminar 26
0
10
20
30
40
50
60
Jan
2013
Feb
2013
Mar
2013
Apr
2013
May
2013
June
2013
July
2013
Aug
2013
Sept
2013
Oct
2013
Nov
2013
Dec
2013
Jan
2014
Feb
2014
Mar
2014
Apr
2014
May
2014
June
2014
July
2014
Aug
2014
Sept
2014
Oct
2014
Nov
2014
Dec
2014
Jan
2015
Feb
2015
Mar
2015
Apr
2015
May
2015
June
2015
NoofHits
Millions
Month
Public launch
of 1.2 API
1.3 API 1.4 API 1.5 API
Open PHACTS Data
2 September 2015 UBDC Seminar 27
Multiple Identities
P12047
X31045
GB:29384
2 September 2015 UBDC Seminar
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
28
Are these the
same thing?
Gleevec®: Imatinib Mesylate
2 September 2015 UBDC Seminar 29
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Gleevec®: Imatinib Mesylate
2 September 2015 UBDC Seminar 30
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Are these records the same?
It depends upon your task!
UBDC Seminar 31
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Structure Lens
2 September 2015
I need to perform an analysis, give me
details of the active compound in
Gleevec.
UBDC Seminar 32
skos:closeMatch
(Drug Name)
skos:closeMatch
(Drug Name)
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Name Lens
2 September 2015
Which targets are known to interact
with Gleevec?
What is a Scientific Lens?
A lens defines a conceptual view over the data
 Specifies operational equivalence conditions
Consists of:
 Identifier (URI)
 Title
(dct:title)
 Description
(dct:description)
 Documentation link
(dcat:landingPage)
 Creator
(pav:createdBy)
 Timestamp
(pav:createdOn)
 Equivalence rules
(bdb:linksetJustification)
2 September 2015 UBDC Seminar 33
Administrative Data Research Network
UBDC Seminar
Administrative
Data Service
372 September 2015
ADRC-Scotland
UBDC Seminar
 Co-located with Farr Institute,
Scottish Government and NHS.
 Universities of Aberdeen, Dundee,
Edinburgh, Glasgow, Herriot-Watt,
St Andrews and Stirling.
 Expertise in administrative data and public
engagement, linkage, law and relevant computer
science techniques.
 Provide research support, facilities, training
382 September 2015
Research Focus
UBDC Seminar
http://www.gov.scot/Resource/0044/00442276-39
 Schools, colleges and universities
 The criminal and justice system
 Social work services
 Social welfare
 Housing system
 Transport system
 Health system
 Historical administrative data
392 September 2015
Data Matching
UBDC Seminar
Messy data
Probabilistic matches
Schema matching
John Grant
Fisherman
Fiona Sinclair
Ian Grant
Smithy
Born: 1861
Stuart Adam
Wheelwright
Morag Scott
Flora Adam
Seamstress
Born: 1866
Married: 1884
John Grant
Farmer
Fiona Grant
Iain Grant
Born: 1860
402 September 2015
Summary
 RDF eases data integration
 Working on RDF stream extensions
 Data is complex and messy
 Requires flexibility in linking
 Equivalence depends upon context
 Lenses provide support for operation equivalence
2 September 2015 UBDC Seminar 41
www.alasdairjggray.co.uk
A.J.G.Gray@hw.ac.uk
@gray_alasdair

More Related Content

Similar to Data Integration in a Big Data Context

A Framework for Online Clustering Based on Evolving Semi-supervision
A Framework for Online Clustering Based on Evolving Semi-supervisionA Framework for Online Clustering Based on Evolving Semi-supervision
A Framework for Online Clustering Based on Evolving Semi-supervisionGuilherme Alves
 
COMBINE (archive) meta data
COMBINE (archive) meta dataCOMBINE (archive) meta data
COMBINE (archive) meta dataMartin Scharm
 
JavaFest. Cedrick Lunven. Build APIS with SpringBoot - REST, GRPC, GRAPHQL wh...
JavaFest. Cedrick Lunven. Build APIS with SpringBoot - REST, GRPC, GRAPHQL wh...JavaFest. Cedrick Lunven. Build APIS with SpringBoot - REST, GRPC, GRAPHQL wh...
JavaFest. Cedrick Lunven. Build APIS with SpringBoot - REST, GRPC, GRAPHQL wh...FestGroup
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...Adel Sabour
 
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...Sarah Wells - Alert overload: How to adopt a microservices architecture witho...
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...Codemotion
 
Codemotion Milan 2015 Alerts Overload
Codemotion Milan 2015 Alerts OverloadCodemotion Milan 2015 Alerts Overload
Codemotion Milan 2015 Alerts Overloadsarahjwells
 
UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...
UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...
UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...UKSG: connecting the knowledge community
 
The crusade for big data in the AAL domain
The crusade for big data in the AAL domainThe crusade for big data in the AAL domain
The crusade for big data in the AAL domainAALForum
 
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsAdaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsMaribel Acosta Deibe
 
The Future Publication of Software
The Future Publication of SoftwareThe Future Publication of Software
The Future Publication of SoftwareMartin Hammitzsch
 
Velocity 2015 Amsterdam: Alerts overload
Velocity 2015 Amsterdam: Alerts overloadVelocity 2015 Amsterdam: Alerts overload
Velocity 2015 Amsterdam: Alerts overloadsarahjwells
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsNeo4j
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsZitao Liu
 
Ontology of citizen science @ Siena 2016 11 24
Ontology of citizen science @ Siena 2016 11 24Ontology of citizen science @ Siena 2016 11 24
Ontology of citizen science @ Siena 2016 11 24Luigi Ceccaroni
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinGuido Schmutz
 
Self-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipesSelf-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipesApache StreamPipes
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEnrico Daga
 
LUNA - Lessons in cloud based workflow: Universal & ETC by Guillaume Aubchon ...
LUNA - Lessons in cloud based workflow: Universal & ETC by Guillaume Aubchon ...LUNA - Lessons in cloud based workflow: Universal & ETC by Guillaume Aubchon ...
LUNA - Lessons in cloud based workflow: Universal & ETC by Guillaume Aubchon ...ETCenter
 

Similar to Data Integration in a Big Data Context (20)

A Framework for Online Clustering Based on Evolving Semi-supervision
A Framework for Online Clustering Based on Evolving Semi-supervisionA Framework for Online Clustering Based on Evolving Semi-supervision
A Framework for Online Clustering Based on Evolving Semi-supervision
 
Concentric Semantic Snapshot
Concentric Semantic SnapshotConcentric Semantic Snapshot
Concentric Semantic Snapshot
 
COMBINE (archive) meta data
COMBINE (archive) meta dataCOMBINE (archive) meta data
COMBINE (archive) meta data
 
JavaFest. Cedrick Lunven. Build APIS with SpringBoot - REST, GRPC, GRAPHQL wh...
JavaFest. Cedrick Lunven. Build APIS with SpringBoot - REST, GRPC, GRAPHQL wh...JavaFest. Cedrick Lunven. Build APIS with SpringBoot - REST, GRPC, GRAPHQL wh...
JavaFest. Cedrick Lunven. Build APIS with SpringBoot - REST, GRPC, GRAPHQL wh...
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
 
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...Sarah Wells - Alert overload: How to adopt a microservices architecture witho...
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...
 
Codemotion Milan 2015 Alerts Overload
Codemotion Milan 2015 Alerts OverloadCodemotion Milan 2015 Alerts Overload
Codemotion Milan 2015 Alerts Overload
 
UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...
UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...
UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...
 
The crusade for big data in the AAL domain
The crusade for big data in the AAL domainThe crusade for big data in the AAL domain
The crusade for big data in the AAL domain
 
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsAdaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of Endpoints
 
The Future Publication of Software
The Future Publication of SoftwareThe Future Publication of Software
The Future Publication of Software
 
Velocity 2015 Amsterdam: Alerts overload
Velocity 2015 Amsterdam: Alerts overloadVelocity 2015 Amsterdam: Alerts overload
Velocity 2015 Amsterdam: Alerts overload
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply Chains
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical Basics
 
Ontology of citizen science @ Siena 2016 11 24
Ontology of citizen science @ Siena 2016 11 24Ontology of citizen science @ Siena 2016 11 24
Ontology of citizen science @ Siena 2016 11 24
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Self-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipesSelf-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipes
 
ADF+Course+Deck.pdf
ADF+Course+Deck.pdfADF+Course+Deck.pdf
ADF+Course+Deck.pdf
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 
LUNA - Lessons in cloud based workflow: Universal & ETC by Guillaume Aubchon ...
LUNA - Lessons in cloud based workflow: Universal & ETC by Guillaume Aubchon ...LUNA - Lessons in cloud based workflow: Universal & ETC by Guillaume Aubchon ...
LUNA - Lessons in cloud based workflow: Universal & ETC by Guillaume Aubchon ...
 

More from Alasdair Gray

Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Alasdair Gray
 
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Alasdair Gray
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
 
Supporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesSupporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
 
Validata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceValidata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceAlasdair Gray
 
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsThe HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
 
Scientific lenses to support multiple views over linked chemistry data
Scientific lenses to support multiple views over linked chemistry dataScientific lenses to support multiple views over linked chemistry data
Scientific lenses to support multiple views over linked chemistry dataAlasdair Gray
 
Scientific Lenses over Linked Data An approach to support multiple integrate...
Scientific Lenses over Linked Data An approach to support multiple integrate...Scientific Lenses over Linked Data An approach to support multiple integrate...
Scientific Lenses over Linked Data An approach to support multiple integrate...Alasdair Gray
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileAlasdair Gray
 
Data Science meets Linked Data
Data Science meets Linked DataData Science meets Linked Data
Data Science meets Linked DataAlasdair Gray
 
Sensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-beingSensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-beingAlasdair Gray
 
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Alasdair Gray
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
 
Computing Identity Co-Reference Across Drug Discovery Datasets
Computing Identity Co-Reference Across Drug Discovery DatasetsComputing Identity Co-Reference Across Drug Discovery Datasets
Computing Identity Co-Reference Across Drug Discovery DatasetsAlasdair Gray
 
Incorporating Commercial and Private Data into an Open Linked Data Platform f...
Incorporating Commercial and Private Data into an Open Linked Data Platform f...Incorporating Commercial and Private Data into an Open Linked Data Platform f...
Incorporating Commercial and Private Data into an Open Linked Data Platform f...Alasdair Gray
 
Including Co-Referent URIs in a SPARQL Query
Including Co-Referent URIs in a SPARQL QueryIncluding Co-Referent URIs in a SPARQL Query
Including Co-Referent URIs in a SPARQL QueryAlasdair Gray
 

More from Alasdair Gray (20)

Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
 
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland Project
 
Supporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesSupporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life Sciences
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
 
Validata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceValidata: A tool for testing profile conformance
Validata: A tool for testing profile conformance
 
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsThe HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
 
Project X
Project XProject X
Project X
 
Data Linkage
Data LinkageData Linkage
Data Linkage
 
Scientific lenses to support multiple views over linked chemistry data
Scientific lenses to support multiple views over linked chemistry dataScientific lenses to support multiple views over linked chemistry data
Scientific lenses to support multiple views over linked chemistry data
 
Scientific Lenses over Linked Data An approach to support multiple integrate...
Scientific Lenses over Linked Data An approach to support multiple integrate...Scientific Lenses over Linked Data An approach to support multiple integrate...
Scientific Lenses over Linked Data An approach to support multiple integrate...
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community Profile
 
SensorBench
SensorBenchSensorBench
SensorBench
 
Data Science meets Linked Data
Data Science meets Linked DataData Science meets Linked Data
Data Science meets Linked Data
 
Sensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-beingSensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-being
 
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLS
 
Computing Identity Co-Reference Across Drug Discovery Datasets
Computing Identity Co-Reference Across Drug Discovery DatasetsComputing Identity Co-Reference Across Drug Discovery Datasets
Computing Identity Co-Reference Across Drug Discovery Datasets
 
Incorporating Commercial and Private Data into an Open Linked Data Platform f...
Incorporating Commercial and Private Data into an Open Linked Data Platform f...Incorporating Commercial and Private Data into an Open Linked Data Platform f...
Incorporating Commercial and Private Data into an Open Linked Data Platform f...
 
Including Co-Referent URIs in a SPARQL Query
Including Co-Referent URIs in a SPARQL QueryIncluding Co-Referent URIs in a SPARQL Query
Including Co-Referent URIs in a SPARQL Query
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Data Integration in a Big Data Context

  • 1. Data Integration in a Big Data Context Alasdair J G Gray A.J.G.Gray@hw.ac.uk alasdairjggray.co.uk @gray_alasdair
  • 2. Data Linkage and Querying 2 September 2015 UBDC Seminar Linking it all together! 2
  • 3. Big Data 2 September 2015 UBDC Seminar 3 Volume VelocityVariety http://i.kinja-img.com/gawker- media/image/upload/lvzm0afp8kik5dctxiya.jpg
  • 4. Purpose: Extracting Value 2 September 2015 UBDC Seminar 4 http://senderocorp.com/images/uploads/bigdata_v9.png Volume Velocity Variety Veracity Value VisualizationAnalytics Big Data Technology
  • 5. Big Data 2 September 2015 UBDC Seminar 5 Volume VelocityVariety
  • 6. Big Data: Volume More data than you can process  Scalable processing  Relative term  WSN query processing 2 September 2015 UBDC Seminar 6 Volume VelocityVariety
  • 7. Big Data: Variety Many sources of data  Heterogeneous  Formats  Models  Reconcile meaning 2 September 2015 UBDC Seminar 7 Volume VelocityVariety
  • 8. Big Data: Velocity Data constantly generated  Real-time processing  Contextualise 2 September 2015 UBDC Seminar 8 Volume VelocityVariety
  • 9. RDF: An Integration Dream 2 September 2015 UBDC Seminar 9 http://www.w3.org/TR/rdf11-primer/
  • 10. 2 September 2015 UBDC Seminar 10 https://www.flickr.com/photos/mobilestreetlife/4179063482 “RDF and OWL do not solve the interoperability problem, they just lay it bare on the table!” Frank van Harmelen
  • 11. Solent Use Case  Busy shipping channel  Two major ports  Complex tidal and wave patterns 2 September 2015 UBDC Seminar 11
  • 12. Estuarine Flooding  Financial implications  Damage  Loss of business  Personal factors  Emotional impact Flood prediction  Locations  Severity Requires correlating  Sea-state data  Weather forecasts  Details of sea defences Response Planning  Evacuation routes  Personnel deployment  … Requires more data  Traffic reports  Shipping  … 2 September 2015 UBDC Seminar 12 Image: http://www.metro.co.uk/
  • 13. Flood defences data (database) Flood Detection “Detect overtopping events in the Solent region” sea-level > sea-defence •Sea-level: sensors •Defence heights: databases 2 September 2015 UBDC Seminar 13 Real-time sensor data Wave, Wind, Tide
  • 14. Meteorological forecasts Response Planning “Provide contextual information” • Web feeds • Other sources: maps, models • Real-time merging of datasets 2 September 2015 UBDC Seminar 14 Other sources: Maps, models, …
  • 15. Abstract Problem Stored data Sensor Network Integrator 2 September 2015 15 Sensor Network Stored data service Streaming data service Streaming data service UBDC Seminar
  • 16. Data source Data stream Query capabilities Data access Types of Heterogeneity Stored data Sensor Network Integrator 2 September 2015 16 Sensor Network Stored data service Streaming data service Streaming data service Data semantics UBDC Seminar
  • 17. Querying Approach  Use ontologies as common model Requires:  Representation of RDF stream  Expressing continuous queries over an RDF stream  Establishing mappings between ontology models and data source schemas  Accessing data sources through queries over ontology model 2 September 2015 17UBDC Seminar
  • 18. RDF Stream  Named graph  Continuously updating  Triples annotated with timestamp 2 September 2015 18 STREAM http://www.semsorgrid4env.eu/ccometeo.srdf ... ... ( <ssg4e:Obs1, rdf:type, cd:Observation>, ti ), ( <ssg4e:Obs1, cd:observationResult, “34.5”>, ti ), ( <ssg4e:Obs2, rdf:type, cd:Observation>, ti+1 ), ( <ssg4e:Obs2, cd:observationResult,”20.3”>, ti+1 ), ... ... cd:Observation xsd:double cd:observationResult UBDC Seminar
  • 19. SPARQLStream PREFIX cd: <http://www.semsorgrid4env.eu/ontologies/CoastalDefences.owl#> PREFIX sb: <http://www.w3.org/2009/SSN- XG/Ontologies/SensorBasis.owl#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> RSTREAM SELECT ?windspeed ?windts FROM STREAM <http://www.semsorgrid4env.eu/ccometeo.srdf> [ NOW – 1 MINUTE TO NOW STEP 5 MINUTES ] WHERE { ?WindObs a cd:Observation; cd:observationResult ?windspeed; cd:observationResultTime ?windts; cd:observedProperty ?windProperty; cd:featureOfInterest ?windFeature. ?windFeature a cd:Feature; cd:locatedInRegion cd:SolentCCO. ?windProperty a cd:WindSpeed. } 2 September 2015 19 cd:Observation xsd:double cd:observationResult cd:Feature cd:featureOfInterest cd:Property cd:observedProperty cd:Region cd:locatedInRegion “Every 5 minutes give me with the wind speed observations over the last minute in the Solent Region ” UBDC Seminar
  • 20. Initial Display 2 September 2015 UBDC Seminar 20
  • 21. Sensor Data 2 September 2015 UBDC Seminar 21
  • 22. Sea-state Forecast Model 2 September 2015 UBDC Seminar 22
  • 23. Drug Discovery Use Case “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”  Chemical Properties (Chemspider)  Launched drugs (Drugbank)  Human => Mouse (Homologene)  Protein Families (Enzyme)  Bioactivty Data (ChEMBL)  … other info (Uniprot/Entrez etc.) “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases” 2 September 2015 UBDC Seminar 23
  • 24. Open PHACTS Discovery Platform 2 September 2015 UBDC Seminar 24 Drug Discovery Platform Apps Domain API Interactive responses Production quality integration platform Method Calls Standard Web Technologies
  • 25. API Hits 2 September 2015 UBDC Seminar 26 0 10 20 30 40 50 60 Jan 2013 Feb 2013 Mar 2013 Apr 2013 May 2013 June 2013 July 2013 Aug 2013 Sept 2013 Oct 2013 Nov 2013 Dec 2013 Jan 2014 Feb 2014 Mar 2014 Apr 2014 May 2014 June 2014 July 2014 Aug 2014 Sept 2014 Oct 2014 Nov 2014 Dec 2014 Jan 2015 Feb 2015 Mar 2015 Apr 2015 May 2015 June 2015 NoofHits Millions Month Public launch of 1.2 API 1.3 API 1.4 API 1.5 API
  • 26. Open PHACTS Data 2 September 2015 UBDC Seminar 27
  • 27. Multiple Identities P12047 X31045 GB:29384 2 September 2015 UBDC Seminar Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ 28 Are these the same thing?
  • 28. Gleevec®: Imatinib Mesylate 2 September 2015 UBDC Seminar 29 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
  • 29. Gleevec®: Imatinib Mesylate 2 September 2015 UBDC Seminar 30 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N Are these records the same? It depends upon your task!
  • 30. UBDC Seminar 31 skos:exactMatch (InChI) Strict Relaxed Analysing Browsing Structure Lens 2 September 2015 I need to perform an analysis, give me details of the active compound in Gleevec.
  • 31. UBDC Seminar 32 skos:closeMatch (Drug Name) skos:closeMatch (Drug Name) skos:exactMatch (InChI) Strict Relaxed Analysing Browsing Name Lens 2 September 2015 Which targets are known to interact with Gleevec?
  • 32. What is a Scientific Lens? A lens defines a conceptual view over the data  Specifies operational equivalence conditions Consists of:  Identifier (URI)  Title (dct:title)  Description (dct:description)  Documentation link (dcat:landingPage)  Creator (pav:createdBy)  Timestamp (pav:createdOn)  Equivalence rules (bdb:linksetJustification) 2 September 2015 UBDC Seminar 33
  • 33. Administrative Data Research Network UBDC Seminar Administrative Data Service 372 September 2015
  • 34. ADRC-Scotland UBDC Seminar  Co-located with Farr Institute, Scottish Government and NHS.  Universities of Aberdeen, Dundee, Edinburgh, Glasgow, Herriot-Watt, St Andrews and Stirling.  Expertise in administrative data and public engagement, linkage, law and relevant computer science techniques.  Provide research support, facilities, training 382 September 2015
  • 35. Research Focus UBDC Seminar http://www.gov.scot/Resource/0044/00442276-39  Schools, colleges and universities  The criminal and justice system  Social work services  Social welfare  Housing system  Transport system  Health system  Historical administrative data 392 September 2015
  • 36. Data Matching UBDC Seminar Messy data Probabilistic matches Schema matching John Grant Fisherman Fiona Sinclair Ian Grant Smithy Born: 1861 Stuart Adam Wheelwright Morag Scott Flora Adam Seamstress Born: 1866 Married: 1884 John Grant Farmer Fiona Grant Iain Grant Born: 1860 402 September 2015
  • 37. Summary  RDF eases data integration  Working on RDF stream extensions  Data is complex and messy  Requires flexibility in linking  Equivalence depends upon context  Lenses provide support for operation equivalence 2 September 2015 UBDC Seminar 41 www.alasdairjggray.co.uk A.J.G.Gray@hw.ac.uk @gray_alasdair

Editor's Notes

  1. About Me Working in data integration for over a decade Special focus on Practical real-world problems Streaming data Lots of domains: Astronomy, Biology, Chemistry, Physics, Environmental science, Pharmacology, Social science, Health Informatics
  2. Data from heterogeneous sources: discover relevant sources; different temporal modalities; different data models and representations Interlink data: common representation, align data models/schemas, identify common entities Query decomposition across distributed sources Efficient in-network processing: Save energy, increase network lifetime Enable new insights through novel user interfaces
  3. Veracity is a cross-cutting issue
  4. Deriving value from the data My research focuses on the top part: bringing data together
  5. WSN typically resource constrained 48k memory limited energy
  6. Integrated with background knowledge
  7. Identify things with URIs Reuse URIs Explicit meaning to relationships Links between datasets Infer hidden meaning
  8. They give us a common syntax Rest of the talk focuses on my work to address these challenges
  9. Strait of water separating Isle of Wight from English mainland Two high tides -> increased opportunities for getting ships in and out -> better for business Complex tidal pattern Non-standard models
  10. Environmental decision support systems Flood emergency response: real-time data mash-ups real-time data linkage
  11. Overtopping: a wave or tide exceeds the height of the sea defence: simplified as threshold in graph Sensor data provides current sea-state conditions National Flood and Coastal Defences Database (NFCDD) provides height of sea walls, etc Lots of forms of heterogeneity in the system
  12. Contextual Data Weather feed provides predicted wind speed and direction, contextual streaming data Maps -> contextual visual data Report data in a form understandable to the user, ontology
  13. User requires correlation of data from variety of sources Sources wrapped by a service Integrator includes DQP Queries, or data requests sent to data services
  14. Data source: stored or streaming Data stream: acquire or receive. Control of data rate Query capabilities: Query evaluator Language Data access: pull or push Semantic Heterogeneity, e.g. temperature: air, sea, …
  15. Previous streaming extensions to SPARQL have problems
  16. Requires two triples to represent information Problem being addressed in W3C RDF Stream Processing Group
  17. 1 of 83 business driver questions Took a team of 5 experienced researchers 6 hours to manually gather the answer Start of the project couldn’t be answered by a computer system 6 months in 30s with prototype now subsecond
  18. A platform for integrated pharmacology data Relied upon by pharma companies Public domain, commercial, and private data sources Provides domain specific API Making it easy to build multiple drug discovery applications: examples developed in the project
  19. Actively being used for different purposes Public launch April 2013 Averaging 20 million hits a month from the start of 2015 38 million in the last 30 days Heavy usage from pharma, academia, and biotech 500+ registered users
  20. Over 3 billion triples Hosted on beefy hardware; data in memory (aim)
  21. Concept appears in multiple datasets, each with its own identifier This talk is about supporting the multiple identities that exist Rather than define a single approach, we want to support the use of multiple identifiers
  22. Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases  Different results Chemistry is complicated, often simplified for convenience Data is messy!
  23. Are these records the same? It depends on what you are doing with the data! Each captures a subtly different view of the world Chemistry is complicated, often simplified for convenience Data is messy!
  24. Interested in physiochemical properties of Gleevec
  25. Interested in biomedical and pharmacological properties sameAs != sameAs depends on your point of view Links relate individual data instances: source, target, predicate, reason. Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  26. Lens enables certain relationships and disables others Alters links between the data
  27. Default lens matches structures Only get data back associated with the structure entered with Really want all information about Ibuprofen Need a different lens
  28. ESRC funded network Coordinating Administrative Data Service (ADS) – led by University of Essex Four Administrative Data Research Centres (ADRCs), one in each UK country England – led by University of Southampton Northern Ireland – led by Queens Uni Belfast Scotland – led by University of Edinburgh Wales – led by Swansea University
  29. Social science example from ADRC Scotland Looking to apply lenses to support different interactions
  30. Bird habitat monitoring, Coastal monitoring, Glacier movement, Farms, Volcanoes… Cost effective monitoring, high spatial/temporal resolution What is the underlying technology/software?
  31. Trade-off of capabilities vs QoS vs Lifetime Every system performed their own bespoke evaluations, how do you compare?