Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Jack Verhoosel | Semantics in Dairy Farming: towards a Common Dairy Ontology
1. SEMANTICS FOR BIG DATA
APPLICATIONS IN SMART DAIRY FARMING
Jack Verhoosel, Jacco Spek
Presentation at the Semantics 2016 conference
14 September 2016
2. aantalle
n NL
noemen
Collaboration project
3 Cooperations
7 SME’s
5 Research institutes
7 Real farmers
Timeline:
SDF1: 2011 – 2014
Nothern part of the Netherlands
Website (in Dutch):
http://www.smartdairyfarming.nl/nl/
Goal of SDF:
to support dairy farmers in the care of individual animals.
with the specific goal of a longer productive stay at the farm due to
improvement of individual health.
Challenge SDF2:
more farmers: from 7 to 60 (and prepare for 2500)
more sensor suppliers and more data consumers
incorporate semantics and big data analysis
Numbers for the Dutch situation:
• 15000+ farmers
• in total more then 1.5 million milk cows
• 20 to 200+ datafields per cow
• many different stakeholders in the
chain
2
SDF 1.0 (2011 – 2014)
SDF 2.0 (2015 – 2017)
3. 3
Starting point:
Cow centric thinking
Starting point:
Farmer in control
“De boer aan het roer”
Real time analysis models
(at different organisations)
Sensors from
different suppliers:
Lely, Delaval,
Agis, Gallagher,…
Other data sources,
CRV, FC, AgriFirm,
Weather, Satellite
InfoBroker: Open platform
for sharing (sensor) data
producers and consumers
Cow specifics
Workinstructions (SOP)
This project is made possible by:
Data sharing in the dairy chain
Think big,
start small 12GB sensordata per year for
7 farms => 310 GB triples
From 7 to 50 farms
of 15.000 in NL
4. 4
InfoBroker concept
InfoBroker functionalities:
Open interfaces for data exchange (API)
Authentication
who are you (are you allowed to login)
Permissions
which data may be used by whom
to be set by the farmers
Namingservice
location where the data can be found
– static data
– cow-centric sensor data
Integration
combining info from different sources
Pay-per-use
fixed costs (connections)
variable costs (used data)
So:
no central datastore for (sensor)data!
but indeed a broker
and reduces/prevents duplication
cow specific work
instructions (SOPs)
InfoBroker
cow centric data
cow centric
data
Cow centric
Sensor data
Static data
(e.g. feed)
Cow centric
Sensor data
Static data
(e.g. date of birth)
Dashboard
Model
Model
Model
x 15.000+
5. InfoBroker – Facts & Figures
5
Farm 1 Farm 2 Farm 3 Farm 4 Farm 5 Farm 6 Farm 7
# cows/calves 459 186 315 239 706 202 351
Behaviour x x
Temperature x x
Activity x x x x x x
Milk production x x x x x
Food intake x x x
Weight x x x x x x x
Water intake x x
Milk intake x x
Date: february 2015
NB1: this are “sensor data categories” at a farm
NB2: not all animals are monitored for SDF (e.g. 3 and 4 only calves)
6. InfoBroker – Facts & Figures
6
Number of cows
vs time
Number of sensorfields
vs time
7. WHY LINKED DATA AND SEMANTICS?
1. To make the various data sets accessible in an
automatically linkable manner for easier integration
2. To align the semantics of the datasets in isolation
as well as in combination using ontologies
3. To enable a rich set of questions to be queried on
the datasets for better analysis
7
8. AgriFirm:
“How much feed did a group of cows at a dairy
farm take in a certain period per type of feed
and how strong is the correlation with milk
yield?”
CRV:
“What was the average weight per day over the
last lactation period of a cow and what was the
weight in/decrease over that period?
BIG DATA ANALYSIS QUESTIONS
9. SIMPLE EXAMPLE OF LINKED DATA
Subject Object
predicate
Cow Animal
is a
(type)
Parcel
grazes on
Parcel
Grasslandis a
(type)
40 ha
has surface
11
10. LINKED DATA ROADMAP*
The four design principles of Linked
Data (by Tim Berners Lee):
1. Use Uniform Resource Identifiers
(URIs) as names for things.
2. Use HTTP URIs so that people can
look up those names.
3. When someone looks up a URI,
provide useful information, using the
standards (RDF*, SPARQL).
4. Include links to other URIs so that
they can discover more things.
12
LODRefine
*Based on PLDN LD roadmap
11. ….. sensor data….. sensor data
ONTOLOGY-BASED SDF
13
Cow specific data per farm per sensor
equipment
Delaval sensor data
Lely sensor data
Visualization and analysis apps
Common Dairy Ontology
MS-ontology
Measurements Triples Static Triples
ST-ontology
Static data
(e.g. date of birth)
Cow specific data per farm
mapping mapping
Nedap sensor data
Agis sensor data
12. COMMON DAIRY ONTOLOGY
14
“What was the average
weight per day and
weight in/decrease over
the last lactation period
of a cow in a group ?”
14. PLASIDO: OUR BIG, LINKED DATA PLATFORM
Powerful server: 128GB memory, 5TB storage
Triplestores:
Marmotta Triplestore with Relational TripleDB
Jena Fuseki Triplestore with Native GraphDB
Virtuoso Relational ClassicalDB with SPARQL-2-SQL interface
All SDF data of 2014 and 2015 retrieved from InfoBroker
Converted into triples using LODRefine and RDF generator
From 12GB to 310GB, increase of factor 25
Stored in Marmotta and Fuseki for comparison
Application development:
Angular-Javascript JSON converter
Google Visualization
16
LODRefine
19. PLASIDO PERFORMANCE TESTS
1. All 2014 SDF triple data from Infobroker into Apache
Marmotta triplestore
12GB of CSV data turned into +/- 310 GB of RDF triples
Marmotta makes use of classical relational database to store triples
Simple queries with only one parameter can be easily answered
More complex queries let to unacceptable response time
Main reason is inefficient access to underlying RDB
2. Next step: switch all data to Apache Jena Fuseki triplestore
Still +/- 310 GB of RDF triples
Fuseki makes use of modern graph database to store triples
Simple queries with only one parameter can be easily answered
More complex analysis queries lead to long, but still acceptable response
time, upto 15 minutes
So, an acceptable performance.
See next slide for some numbers.
21
20. PLASIDO PERFORMANCE
Query Input Graph size Search par Response
Select an overview with the number of cows
of a farmer
Stokman 111,604,625 1 0.04s
Bakker 167,894,559 1 0.03s
Antonides 79,739,365 1 0.37s
Select the list of cows with number and parity Stokman 28,704 3 0.934s
Bakker 9,400 3 15.110s
Antonides 45,816 3 27.006s
Select feed per type per day over all cows of a
farmer
Stokman 66,551,765 3 913.003s
Bakker 38,034,692 3 350.917s
Antonides 45,637,592 3 380.470s
Select average weight over all cows per day
per parity
Antonides 45,637,592 3 348.704s
Select static info for a cow NL 715820911 45,816 2 0.094s
Select weight per day in lactation period NL 715820911 45,683,408 5 5.129s
Select weight and milkyield per day in
lactation period
NL 715820911 45,683,408 7 13.714s
Select milkyield per day in lactation period NL 715820911 45,683,408 3 4.142s
With set-up 2 using Apache Jena Fuseki triplestore with graphDB
21. PLASIDO PERFORMANCE TESTS
Conclusion of previous 2 tests: working with large triplesets is
acceptable
However, is it really necessary to put all data into triples?
Why not use only the CDO as semantic interface and leave all
data in classical table database format?
3. Next step: put all data into tables and use RDB of Virtuoso
Only +/- 10 GB of raw data
Makes use of classical relational database to store raw data in table form
SPARQL interface based on CDO
Mapping between SPARQL and SQL to translate queries towards RDB
Simple queries with only one parameter can be easily answered
More complex analysis queries lead to errors, because the SPARQL-2-
SQL mapping generates too large SQL queries and results that cannot be
handled by Virtuoso
So, an unacceptable performance for this Virtuoso-based solution.
23
22. CONCLUSION AND NEXT STEPS
CDO ontology application in SDF architecture
Further enhancement of CDO to cover the dairy domain extensively
Architectural study to enable the use of CDO with InfoBroker
CDO as semantic interface for InfoBroker
Apply AI deep learning algorithms to perform data analytics
More extensive performance studies
Other linked data platforms to deal with big data: D2RQ
Other possibilities for improvements: Linked Data Fragments
Dealing with heterogeneous sensordata: differently measured
Enabling analysis based on incomplete sensordata
RDF stream processing for dealing with streaming sensor data
24
Het verband tussen open en linked data
Wat is LD: verbinden van data met semantiek om databronnen te kunnen combineren
LOD stappenplan
Step1:
Select: BOMOD
Step2:
Datasets are similar to raw material: they first have to be refined before they become useful. Data cleaning (also referred to as cleansing or scrubbing) describes the process of: fixing errors, transforming and homogenizing formats, aligning inconsistencies in data and metadata, removing duplicate and redundant information, adding lacking information, and making sure the information is up-to-date. One concrete example is the deletion of white spaces and empty cells in a dataset and the identification of missing data
Step3:
Make a conceptual model of the data by defining concepts and their relationships and properties. You can use the logical data model obtained when preparing the data as input for this step.
Investigate how others are already describing similar or related data in vocabularies.
Formalize the model and your vocabulary, preferably in the Web Ontology Language OWL
Step4:
Step7;
About dataset, follow ckan rules. Decsription ,website/explanation, LOD stars, use of voc’s, time period
Step 8:
Welke issues/uitdagingen?
Hoe van data LD maken? LOD stappenplan
Hoe LD beschikbaar maken? Infrastructuur: adapters, infobroker