Data is being generated all around us – from our smart phones tracking our movement through a city to the city itself sensing various properties and reacting to various conditions. However, to maximise the potential from all this data, it needs to be combined and coerced into models that enable analysis and interpretation. In this talk I will give an overview of the techniques that I have developed for data integration: integrating streams of sensor data with background contextual data and supporting multiple interpretations of linking data together. At the end of the talk I will overview the work I will be conducting in the Administrative Data Research Centre for Scotland.
1. Data Integration in a
Big Data Context
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
2. Data Linkage and Querying
2 September 2015 UBDC Seminar
Linking it all
together!
2
3. Big Data
2 September 2015 UBDC Seminar 3
Volume
VelocityVariety
http://i.kinja-img.com/gawker-
media/image/upload/lvzm0afp8kik5dctxiya.jpg
4. Purpose: Extracting Value
2 September 2015 UBDC Seminar 4
http://senderocorp.com/images/uploads/bigdata_v9.png
Volume Velocity Variety Veracity
Value
VisualizationAnalytics
Big Data Technology
6. Big Data: Volume
More data than you
can process
Scalable
processing
Relative term
WSN query
processing
2 September 2015 UBDC Seminar 6
Volume
VelocityVariety
7. Big Data: Variety
Many sources of
data
Heterogeneous
Formats
Models
Reconcile
meaning
2 September 2015 UBDC Seminar 7
Volume
VelocityVariety
8. Big Data: Velocity
Data constantly
generated
Real-time
processing
Contextualise
2 September 2015 UBDC Seminar 8
Volume
VelocityVariety
9. RDF: An Integration Dream
2 September 2015 UBDC Seminar 9
http://www.w3.org/TR/rdf11-primer/
10. 2 September 2015 UBDC Seminar 10
https://www.flickr.com/photos/mobilestreetlife/4179063482
“RDF and OWL do not
solve the interoperability
problem, they just lay it
bare on the table!”
Frank van Harmelen
11. Solent Use Case
Busy shipping
channel
Two major ports
Complex tidal
and
wave patterns
2 September 2015 UBDC Seminar 11
12. Estuarine Flooding
Financial implications
Damage
Loss of business
Personal factors
Emotional impact
Flood prediction
Locations
Severity
Requires correlating
Sea-state data
Weather forecasts
Details of sea defences
Response Planning
Evacuation routes
Personnel deployment
…
Requires more data
Traffic reports
Shipping
…
2 September 2015 UBDC Seminar 12
Image: http://www.metro.co.uk/
17. Querying Approach
Use ontologies as common model
Requires:
Representation of RDF stream
Expressing continuous queries over an RDF
stream
Establishing mappings between ontology
models and data source schemas
Accessing data sources through queries over
ontology model
2 September 2015 17UBDC Seminar
19. SPARQLStream
PREFIX cd:
<http://www.semsorgrid4env.eu/ontologies/CoastalDefences.owl#>
PREFIX sb: <http://www.w3.org/2009/SSN-
XG/Ontologies/SensorBasis.owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
RSTREAM
SELECT ?windspeed ?windts
FROM STREAM <http://www.semsorgrid4env.eu/ccometeo.srdf>
[ NOW – 1 MINUTE TO NOW STEP 5 MINUTES ]
WHERE {
?WindObs a cd:Observation;
cd:observationResult ?windspeed;
cd:observationResultTime ?windts;
cd:observedProperty ?windProperty;
cd:featureOfInterest ?windFeature.
?windFeature a cd:Feature;
cd:locatedInRegion cd:SolentCCO.
?windProperty a cd:WindSpeed.
}
2 September 2015 19
cd:Observation
xsd:double
cd:observationResult
cd:Feature
cd:featureOfInterest
cd:Property
cd:observedProperty
cd:Region
cd:locatedInRegion
“Every 5 minutes give me with the wind speed observations over
the last minute in the Solent Region ”
UBDC Seminar
23. Drug Discovery Use Case
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
Chemical Properties (Chemspider)
Launched drugs (Drugbank)
Human => Mouse (Homologene)
Protein Families (Enzyme)
Bioactivty Data (ChEMBL)
… other info (Uniprot/Entrez etc.)
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
2 September 2015 UBDC Seminar 23
24. Open PHACTS Discovery Platform
2 September 2015 UBDC Seminar 24
Drug Discovery Platform
Apps
Domain API
Interactive
responses
Production quality
integration platform
Method
Calls
Standard Web
Technologies
25. API Hits
2 September 2015 UBDC Seminar 26
0
10
20
30
40
50
60
Jan
2013
Feb
2013
Mar
2013
Apr
2013
May
2013
June
2013
July
2013
Aug
2013
Sept
2013
Oct
2013
Nov
2013
Dec
2013
Jan
2014
Feb
2014
Mar
2014
Apr
2014
May
2014
June
2014
July
2014
Aug
2014
Sept
2014
Oct
2014
Nov
2014
Dec
2014
Jan
2015
Feb
2015
Mar
2015
Apr
2015
May
2015
June
2015
NoofHits
Millions
Month
Public launch
of 1.2 API
1.3 API 1.4 API 1.5 API
27. Multiple Identities
P12047
X31045
GB:29384
2 September 2015 UBDC Seminar
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
28
Are these the
same thing?
29. Gleevec®: Imatinib Mesylate
2 September 2015 UBDC Seminar 30
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Are these records the same?
It depends upon your task!
30. UBDC Seminar 31
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Structure Lens
2 September 2015
I need to perform an analysis, give me
details of the active compound in
Gleevec.
31. UBDC Seminar 32
skos:closeMatch
(Drug Name)
skos:closeMatch
(Drug Name)
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Name Lens
2 September 2015
Which targets are known to interact
with Gleevec?
32. What is a Scientific Lens?
A lens defines a conceptual view over the data
Specifies operational equivalence conditions
Consists of:
Identifier (URI)
Title
(dct:title)
Description
(dct:description)
Documentation link
(dcat:landingPage)
Creator
(pav:createdBy)
Timestamp
(pav:createdOn)
Equivalence rules
(bdb:linksetJustification)
2 September 2015 UBDC Seminar 33
34. ADRC-Scotland
UBDC Seminar
Co-located with Farr Institute,
Scottish Government and NHS.
Universities of Aberdeen, Dundee,
Edinburgh, Glasgow, Herriot-Watt,
St Andrews and Stirling.
Expertise in administrative data and public
engagement, linkage, law and relevant computer
science techniques.
Provide research support, facilities, training
382 September 2015
36. Data Matching
UBDC Seminar
Messy data
Probabilistic matches
Schema matching
John Grant
Fisherman
Fiona Sinclair
Ian Grant
Smithy
Born: 1861
Stuart Adam
Wheelwright
Morag Scott
Flora Adam
Seamstress
Born: 1866
Married: 1884
John Grant
Farmer
Fiona Grant
Iain Grant
Born: 1860
402 September 2015
37. Summary
RDF eases data integration
Working on RDF stream extensions
Data is complex and messy
Requires flexibility in linking
Equivalence depends upon context
Lenses provide support for operation equivalence
2 September 2015 UBDC Seminar 41
www.alasdairjggray.co.uk
A.J.G.Gray@hw.ac.uk
@gray_alasdair
Editor's Notes
About Me
Working in data integration for over a decade
Special focus on
Practical real-world problems
Streaming data
Lots of domains: Astronomy, Biology, Chemistry, Physics, Environmental science, Pharmacology, Social science, Health Informatics
Data from heterogeneous sources: discover relevant sources; different temporal modalities; different data models and representations
Interlink data: common representation, align data models/schemas, identify common entities
Query decomposition across distributed sources
Efficient in-network processing: Save energy, increase network lifetime
Enable new insights through novel user interfaces
Veracity is a cross-cutting issue
Deriving value from the data
My research focuses on the top part: bringing data together
WSN typically resource constrained
48k memory
limited energy
Integrated with background knowledge
Identify things with URIs
Reuse URIs
Explicit meaning to relationships
Links between datasets
Infer hidden meaning
They give us a common syntax
Rest of the talk focuses on my work to address these challenges
Strait of water separating Isle of Wight from English mainland
Two high tides -> increased opportunities for getting ships in and out -> better for business
Complex tidal pattern
Non-standard models
Environmental decision support systems
Flood emergency response:
real-time data mash-ups
real-time data linkage
Overtopping: a wave or tide exceeds the height of the sea defence: simplified as threshold in graph
Sensor data provides current sea-state conditions
National Flood and Coastal Defences Database (NFCDD) provides height of sea walls, etc
Lots of forms of heterogeneity in the system
Contextual Data
Weather feed provides predicted wind speed and direction,
contextual streaming data
Maps -> contextual visual data
Report data in a form understandable to the user, ontology
User requires correlation of data from variety of sources
Sources wrapped by a service
Integrator includes DQP
Queries, or data requests sent to data services
Data source: stored or streaming
Data stream: acquire or receive. Control of data rate
Query capabilities:
Query evaluator
Language
Data access: pull or push
Semantic Heterogeneity, e.g. temperature: air, sea, …
Previous streaming extensions to SPARQL have problems
Requires two triples to represent information
Problem being addressed in W3C RDF Stream Processing Group
1 of 83 business driver questions
Took a team of 5 experienced researchers 6 hours to manually gather the answer
Start of the project couldn’t be answered by a computer system
6 months in 30s with prototype
now subsecond
A platform for integrated pharmacology data
Relied upon by pharma companies
Public domain, commercial, and private data sources
Provides domain specific API
Making it easy to build multiple drug discovery applications: examples developed in the project
Actively being used for different purposes
Public launch April 2013
Averaging 20 million hits a month from the start of 2015
38 million in the last 30 days
Heavy usage from pharma, academia, and biotech
500+ registered users
Over 3 billion triples
Hosted on beefy hardware; data in memory (aim)
Concept appears in multiple datasets, each with its own identifier
This talk is about supporting the multiple identities that exist
Rather than define a single approach, we want to support the use of multiple identifiers
Example drug: Gleevec Cancer drug for leukemia
Lookup in three popular public chemical databases Different results
Chemistry is complicated, often simplified for convenience
Data is messy!
Are these records the same? It depends on what you are doing with the data!
Each captures a subtly different view of the world
Chemistry is complicated, often simplified for convenience
Data is messy!
Interested in physiochemical properties of Gleevec
Interested in biomedical and pharmacological properties
sameAs != sameAs depends on your point of view
Links relate individual data instances: source, target, predicate, reason.
Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
Lens enables certain relationships and disables others
Alters links between the data
Default lens matches structures
Only get data back associated with the structure entered with
Really want all information about Ibuprofen
Need a different lens
ESRC funded network
Coordinating Administrative Data Service (ADS) – led by University of Essex
Four Administrative Data Research Centres (ADRCs), one in each UK country
England – led by University of Southampton
Northern Ireland – led by Queens Uni Belfast
Scotland – led by University of Edinburgh
Wales – led by Swansea University
Social science example from ADRC Scotland
Looking to apply lenses to support different interactions
Bird habitat monitoring, Coastal monitoring, Glacier movement, Farms, Volcanoes…
Cost effective monitoring, high spatial/temporal resolution
What is the underlying technology/software?
Trade-off of capabilities vs QoS vs Lifetime
Every system performed their own bespoke evaluations, how do you compare?