Industrialized Linked Data

Industrialized Linked Data

Dave Reynolds, Epimorphics Ltd
@der42

Context: public sector Linked Data

Linked Data journey ...

explore
 what is linked data?
 what use it is for us?


explore

 self-describing  Integration
 carries semantics with it  comparable
 annotate and explain  slice and dice
 data in context  web API
 ...  ...


explore

 self-describing  Integration
 carries semantics with it  comparable
 annotate and explain  slice and dice
 data in context  web API
 ...  ...
 what’s involved?


explore pilot

data model convert publish apply

Photo of The Thinker © dSeneste.dk@flicker CC BY


explore pilot routine?
Great pilot but ...
 can we reduce the time and cost?
 how do we handle changes and updates?
 how can we make the published data easier to use?

How do we make Linked Data “business as usual”?

Example case study: Environment Agency
 monitoring of bathing
water quality
 static pilot
 live pilot
 historic annual
assessments
 weekly assessments
 operational system
 additional data feeds
 live update
 integrated API
 data explorer

From pilot to practice
 reduce modelling costs
 patterns dive1
 reuse
 handling change and update
 patterns
 publication process
 automation
 conversion
 publication
 embed in the business process
 use internally as well as externally
 publish once, use many
 data platform

Reduce costs - modelling
1. Don’t do it
 map source data into isomorphic RDF, synthesize URIs
 loses some of the value proposition
2. Reuse existing ontologies intact or mix-and-match
 best solution when available
 W3C GLD work on vocabularies – people, organizations,
datasets ...
3. Reusable vocabulary patterns
 example:
 Data cube plus reference URI sets
 adaptable to broad range of data – environmental, statistical,
financial ...

Reusable patterns: Data cube
 Much public sector data has regularities
 set of measures
 observations, forecasts, budgets, assessments, statistics ...

>0.1 34

27 good
excellent
poor
good 125

 sets of measures
 observations, forecasts, budgets, assessments, estimates ...
 organized along some dimensions
 region, agency, time, category, cost centre ...

objective code cost centre

12 15 25
measure: spend
8 9 11
120 130 180
time

 sets of measures
 observations, forecasts, budgets, assessments, estimates ...
 organized along some dimensions
 region, agency, time, category, cost centre ...
 interpreted according to attributes
 units, multipliers, status

objective code cost centre

provisional
$12k $15k $25k
measure: spend
$8k $9k $11k
final
$120k $130k $180k
time

Data cube pattern
 Pattern, not a fixed ontology
 customize by selecting measures, dimensions and attributes
 originated in publishing of statistics
 applied to environment measurements, weather forecasts, budgets
and spend, quality assessments, regional demographics ...
 Supports reuse
 widely reusable URI sets – geography, time periods, agencies, units
 organization-wide sets
 modelling often only requires small increments on top of core
pattern and reusable components
 opens door for reusable visualization tools
 standardization through W3C GLD

Application to case study
 Data Cubes for water quality measurement
 in-season weekly assessments
 end of season annual assessments
 dimensions:
 time intervals – UK reference time service
 location - reference URI set for bathing waters and sample pts
 cubes can reuse these dimensions
 just need to define specific measures

 patterns
 reuse
 patterns dive 2
 automation
 conversion
 publication
 data platform

Handling change
 critical challenge
 most initial pilots choose a snapshot dataset
 and go stale, fast
 understanding the nature of data updates and how to handle
them is critical to successful scaling to business as usual
 types of change
 new data related to different time period
 corrections to data
 entities change
 properties
 identity

Modelling change
1. Individual data items relate to new time period
Pattern: n-ary relation
 observation resource relates value to time period and other context
 use Data Cube dimensions for this
bwq:sampleYear
bwq:bathingWater http://reference.data.gov.uk/id/year/2009
http://environment.data.gov.
uk/id/bathing- bwq:classification Higher
water/ukk1202-36000
bwq:sampleYear
Clevedon Beach http://reference.data.gov.uk/id/year/2010
bwq:classification
Minimum

bwq:sampleYear
http://reference.data.gov.uk/id/year/2011

bwq:classification
Higher

History or latest?
 latest is non-monotonic but helpful for many practical uses
 materialize (SPARQL Update), implement in query, implement in API
 choice whether to keep history as well
 water quality v. weather forecasts

Modelling change
2. Corrections
 patterns
 silent change (!)
 explicit replacement
 API level hides replaced values but SPARQL query can retrieve & trace
 explicit change event

bwq:sampleYear
http://environment.data.gov. bwq:bathingWater
classification : Higher http://reference.data.gov.uk/id/year/2011
uk/id/bathing-
water/ukk1202-36000
dct:isReplacedBy ev:after
Clevedon Beach dct:replaces
ev:occuredOn
classification : Minimum
status: replaced
analysis event
reason: reanalysis
ev:before ev:agent

Modelling change
3. Mutation
 Infrequent change of properties, essential identity remains
 e.g. renaming a school, adding another building
 routine accesses see property value, not function of time
 patterns
 in place update
 named graphs
 current graph + graphs for each previous state + meta-graph
 explicit versioning with open periods

Modelling change
3. Mutation
explicit versioning with open periods
dct:hasVersion dct:hasVersion
endurant

“Clevedon Beach” “Clevedon Sands”

time:intervalStarts time:intervalStarts
dct:valid 2003 dct:valid 2011

2011
time:intervalFinishes

 find right version by query on validity interval
 simplify use through
 non-monotonic “latest value” link
 API to implement query filters automatically

 weekly and annual samples
 use Data Cube pattern (n-ary relation)
 withdrawn samples
 replacement pattern (no explicit change event)
 Data Cube slice for “latest valid assessment”
 generated by a SPARQL Update query
 API gives easy access to the latest valid values
 linked data following or raw SPARQL query allows drilling into changes
 changes to bathing water profile
 versioning pattern
 bathing water entity points to latest profile (SPARQL Update again)

 patterns
 reuse
 patterns
 automation
 conversion dive 3
 publication
 data platform

Automation
Transform and publish data feed increments
 transformation engine service
 reusable mappings, low cost to adapt to new feeds
 linking to reference data
 publication service that supports non-monotonic changes

publication
service
data increments (csv) transform
service

replicated
xform xform reconciliation
xform
spec. spec. publication
spec. service
servers

Reference data

Transformation service
 declarative specification of transform
 single service support range of transformations
 easy to adapt transformation to new feeds and modelling
changes
 R2RML – RDB to RDF Mapping Language
 specify mapping from database tables to RDF triples
 W3C candidate recommendation
 D2RML
 R2RML extension to treat CSV feed as a database table

Small D2RML example
:dataSource a dr:CSVDataSource ;
rdfs:label "dataSource" .

:bathingWaterTermMap a dr:SubjectMap;
dr:template "http://environment.data.gov.uk/id/bathing-water/{EUBWID2}" ;
dr:class def-bw:BathingWater .

:bathingWaterMap
dr:logicalTable :dataSource ;
dr:subjectMap :bathingWaterTermMap ;

dr:predicateObjectMap [
dr:predicate rdfs:label ;
dr:objectMap [dr:column "description_english" ; dr:language "en" ] ]

dr:predicateObjectMap [
dr:predicate def-bw:eubwidNotation;
dr:objectMap [ dr:column "EUBWID2"; dr:datatype def-bw:eubwid ] ] .

Using patterns
 problems with verbosity, increases reuse costs
 extend to support modelling patterns
 Data Cube
 specify mapping to observation with measures and dimensions
 engine generates Data Set and Data Structure Definition
automatically

D2RML cube map example
:dataCubeMap a dr:DataCubeMap ;
rr:logicalTable “dataSource”;
dr:datasetIRI “http://example.org/datacube1”^^xsd:anyURI ;
dr:dsdIRI “http://example.org/myDsd”^^xsd:anyURI ;

Instances will
dr:observationMap [ automatically link to
rr:subjectMap [ base Data Set
rr:termType rr:IRI ;
rr:template “http://example.org/observation/{PLACE}/{DATE}” ] ;
rr:componentMap [
Implies an entry in the Data
dr:componentType qb:measure ;
Structure Definition which is
rr:predicate aq:concentration ;
auto-generated
rr:objectMap [ rr:column “NO2” ; rr:datatype xsd:decimal ; ]
] ;
... Define how measure
value is to be
represented

But what about linking?
 connect observations to reference data
 a core value of linked data
 R2RML has Term Maps to create values
 constants and templates
 extend to allow maps based on other data sources
 Lookup map
 lookup resource in a store, fetch predicate
 Reconcile
 specify lookup in a remote service
 use Google Refine reconciliation API

Automation
 transformation engine service 
 reusable mappings, low cost to adapt to new feeds 
 linking to reference data 
 publication service that supports non-monotonic changes

publication
service
service

replicated
xform
spec. service
servers

Reference data

Publication service
 goals
 cope with non-monotonic effects of change representation
 so replication is robust and cheap (=> make it idempotent)
 solution
 SPARQL Update
 publish transformed increment as a simple DATA INSERT
 then run SPARQL Update script for non-monotonic links
 dct:replacedBy links
 lastest value slices

Sample update script
DELETE {
?bw bwq:latestComplianceAssessment ?o .
} WHERE {
}

INSERT {
} WHERE {
{
?slice a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year].
OPTIONAL {
?slice2 a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year2].
FILTER (?year2 > ?year)
} FILTER ( !bound(?slice2) )
}
?slice qb:observation ?o .

?o bwq:bathingWater ?bw.
}

Automation
 transformation engine service 
 reusable mappings, low cost to adapt to new feeds 
 linking to reference data 
 publication service that supports non-monotonic changes 

publication
service
service

replicated
xform
spec. service
servers

Reference data

 Update server
 transforms based on scripts (earlier scripting utility)
 linking to reference data
 distributed publication via
SPARQL Update
 extensible range of data sets
 annual assessments
 in-season assessments
 bathing water profile
 features (e.g. pollution sources)
 reference data

 patterns
 reuse
 patterns
 automation
 conversion
 publication
 embed in the business process dive 4
 data platform

Embed in business process
 embedding is critical to ensure data kept up to date
 in turn needs usage
=> lower barrier to use external
use

data not
used rich, up
to date invest
data

data goes hard to
stale justify
internal
use

Lowering barrier to use
 simple REST APIs
 use Linked Data API specification
 rich query without learning SPARQL
 easy consumption as JSON, XML
 gets developers used to data and data model
publication

LD API
service

transform
service

 embedded in process for weekly/daily updates
 infrastructure to automate conversion and publishing
 API plus extensive developer documentation
 third party and in-house applications built over API

 information products as applications over a data platform,
usable externally as well as internally

The next stage
 grow range of data publications and uses
 range of reference data and sets brings new challenges
 discover reference terms and models to reuse
 discover datasets to use for application
 discover models and links between sets
 needs a coordination or registry service
 story for another day ...

Conclusions
 illustrated how public sector users of linked are moving
from static pilots to operational systems
 keys are:
 reduce modelling costs through patterns and reuse
 design for continuous update
 automation of publication using declarative mappings and
SPARQL Update
 lower barrier to use through API design and documentation
 embed in organization’s process so the data is used and useful
Acknowledgements
Only possible thanks to many smart colleagues: Stuart
Williams, Andy Seaborne, Ian Dickinson, Brian McBride,
Chris Dollin
plus Alex Coley and team from the Environment Agency

Industrialized Linked Data

Recommended

Recommended

More Related Content

Similar to Industrialized Linked Data

Similar to Industrialized Linked Data (20)

Recently uploaded

Recently uploaded (20)

Industrialized Linked Data