An e-Infrastructure is a distributed network of service nodes, residing on multiple sites and managed by one or more organizations. e-Infrastructures allow scientists residing at distant places to collaborate. They offer a multiplicity of facilities as-a-service, supporting data sharing and usage at different levels of abstraction, e.g. data transfer, data harmonization, data processing workflows etc. e-Infrastructures are gaining an important place in the field of biodiversity conservation. Their computational capabilities help scientists to reuse models, obtain results in shorter time and share these results with other colleagues. They are also used to access several and heterogeneous biodiversity catalogues.
In this course, the D4Science e-Infrastructure will be used to conduct experiments in the field of biodiversity conservation. D4Science hosts models and contributions by several international organizations involved in the biodiversity conservation field. The course will give students an overview of the models, the practices and the methods that large international organizations like FAO and UNESCO apply by means of D4Science. At the same time, the course will introduce students to the basic concepts under e-Infrastructures, Virtual Research Environments, data sharing and experiments reproducibility.
2. Module 4 - Outline
1. Data processing requirements by communities of
practice
2. The D4Science Statistical Manager
3. Ecological modelling
3. D4Science
D4Science is both a Data and a Computational e-Infrastructure
• Used by several Projects: i-Marine, EUBrazil OpenBio, ENVRI;
• Implements the notion of e-Infrastructure as-a-Service: it offers on demand access to
data management services and computational facilities;
• Hosts several VREs for Fisheries Managers, Biologists, Statisticians…and Students.
4. D4Science - Resources
Large Set of Biodiversity
and Taxonomic Datasets
connected
A Network to
distribute and
access to
Geospatial Data
Distributed Storage
System to store
datasets and
documents
A Social
Network
to share
opinions and
useful news
Algorithms for Biology-
related experiments
6. 1. Data processing requirements by communities of
practice
2. The D4Science Statistical Manager
3. Ecological modelling
7. Some interests by communities of practice in Computational Statistics:
1. Repetition and validation of experiments
2. Exploitation of algorithms in several contexts
3. Hide the complexity of the calculations
4. Facilitate the management and the publication of the algorithms
Issues
8. …practically speaking, they search for:
1. Modular and pluggable solutions
2. Access by means of standard protocols
3. Hiding the complexity of parallel processing
4. Hiding the complexity of software management and provisioning
5. Active contribution with new algorithms and use cases
Issues
9. 1. Data processing requirements by communities of
practice
2. The D4Science Statistical Manager
3. Ecological modelling
10. The Statistical Manager is a set of web services that aim to:
• Help scientists in computational statistics experiments
• Supply precooked state-of-the-art algorithms as-a-Service
• Perform calculations by using Map-Reduce in a seamless way to the users
• Share input, results, parameters and comments with colleagues by means of Virtual
Research Environment in the D4Science e-Infrastructure
Statistical Manager – Users’ View
Statistical
Manager
D4Science
Computational
Facilities
Sharing
Setup and execution
12. The Statistical Manager allows to:
• Develop distributed computation in easy way
(Statistical Manager Framework)
• Parallelize R Scripts without possibly changing
the code
• Automatically produce a User Interface to
perform experiments
• Reuse models and best practices developed by
the community
• Connect external computational facilities via
WPS OGC Standard
Statistical Manager – Developers’ View
22. 2012
1. L. Candela, G. Coro, P. Pagano, ”Supporting Tabular Data Characterization in a Large Scale Data Infrastructure by Lexical Matching Techniques”, In M. Agosti et al. (Eds.): IRCDL 2012, Communications in Computer
and Information Science Volume 354, pp. 21–32. Springer, Heidelberg (2012).
2013
2. R. Froese, J. Thorson, R. B. Reyes Jr. A Bayesian approach for estimating length-weight relationships in fishes. Journal of Applied Ichthyology. Volume 30, Issue 1, pages 78–85, 2013
3. G. Coro, P. Pagano, A. Ellenbroek, ”Combining Simulated Expert Knowledge with Neural Networks to Produce Ecological Niche Models for Latimeria chalumnae”, Ecological Modelling, DOI
10.1016/j.ecolmodel.2013.08.005, Ed. Elsevier.
4. G. Coro, L. Fortunati, P. Pagano. Deriving Fishing Monthly Effort and Caught Species from Vessel Trajectories. Oceans 2013, Proceedings of MTS/IEEE.
5. P. Pagano, G. Coro, D. Castelli, L. Candela, F. Sinibaldi, A. Manzi. Cloud Computing for Ecological Modeling in the D4Science Infrastructure. Proceedings of EGI Community Forum 2013.
6. D. Castelli, P. Pagano, G. Coro, F. Sinibaldi, ”Modellazione della Nicchia Ecologica di Specie Marine (Marine Species Ecological Niche Modelling)”. In “Le Tecnologie del CNR per il Mare” (CNR Marine Technologies)
pp. 140, Ed. CNR (Roma, Italy).
7. D. Castelli, P. Pagano, G. Coro, ”Variazioni Climatiche ed Effetto sulle Specie Marine (Climate Changes and Effect on Marine Species)”. In ”Le Tecnologie del CNR per il Mare” (CNR Marine Technologies) pp. 139,
Ed. CNR (Roma, Italy).
8. D. Castelli, P. Pagano, G. Coro, ”Elaborazione di Dati Trasmessi da Pescherecci (Processing of fishing vessel transmitted information)”. In “Le Tecnologie del CNR per il Mare” (CNR Marine Technologies). pp. 133,
Ed. CNR (Roma, Italy).
9. G. Coro, P. Pagano, A. Ellenbroek. Automatic Procedures to Assist in Manual Review of Marine Species Distribution Maps. To be published in M. Tomassini et al. (Eds.): International Conference on Adaptive and
Natural Computing Algorithms (ICANNGA’13), Springer, Heidelberg (2013).
10. Candela L., Castelli D., Coro G., Pagano P., Sinibaldi F. Species distribution modeling in the cloud. In: Concurrency and Computation-Practice & Experience, Geoffrey C. Fox, David W. Walker (eds.). Wiley,
11. Appeltans W., Pissierssens P., Coro G., Italiano A., Pagano P., Ellenbroek A., Webb T. Trendylyzer: a long-term trend analysis on biogeographic data. In: Bollettino di Geofisica Teorica e Applicata: an International
Journal of Earth Sciences, vol. 54 (Suppl.) pp. 203 - 205. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - Istituto Nazionale di
Oceanografia e di Geofisica Sperimentale, 2013.
12. Coro G., Gioia A., Pagano P., Candela L. A service for statistical analysis of marine data in a distributed e-infrastructure. In: Bollettino di Geofisica Teorica e Applicata: an International Journal of Earth Sciences,
vol. 54 (Suppl.) pp. 68 - 70. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - Istituto Nazionale di Oceanografia e di Geofisica
Sperimentale, 2013.
13. Castelli D., Pagano P., Candela L., Coro G. The iMarine data bonanza: improving data discovery and management through a hybrid data infrastructure. In: Bollettino di Geofisica Teorica e Applicata: an
International Journal of Earth Sciences, vol. 54 (Suppl.) pp. 105 - 107. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - Istituto
Nazionale di Oceanografia e di Geofisica Sperimentale, 2013.
14. Coro G. A Lightweight Guide on Gibbs Sampling and JAGS. A Lightweight Guide on Gibbs Sampling and JAGS. Technical report, 2013.
15. Vanden Berghe E., Bailly N., Aldemita C., Fiorellato F., Coro G., Ellenbroek A., Pagano P. BiOnym - a flexible workflow approach to taxon name matching. In: TDWG 2013 - Taxonomic Database Working Group
2013 (Firenze, 28-31 October 2013).
16. Coro G., Pagano P., Candela L. Providing Statistical Algorithms as-a-Service. In: TDWG 2013 - Taxonomic Database Working Group 2013 (Firenze, 28-31 October 2013).
2014
17. Candela L., Castelli D., Coro G., De Faveri F., Italiano A., Lelii L., Mangiacrapa F., Marioli V., Pagano P. Integrating Species Occurrence Databases to Facilitate Data Analysis. Approved for the Ecological Informatics
Journal, Elsevier 2014.
18. Froese R, Coro G., Kleisner K., Demirel N. Revisiting Safe Biological Limits in Fisheries. Sumitted to the Fish and Fisheries Journal, Wiley 2014
19. Coro G., Candela L., Pagano P., Italiano A., Liccardo L. Parallelising the Execution of Native Data Mining Algorithms for Computational Biology. Submitted to Concurrency and Computation-Practice & Experience,
Wiley 2014.
20. Coro G. , Pagano P., Ellenbroek A. Comparing Heterogeneous Distribution Maps for Marine Species. Submitted to GIScience & Remote Sensing, Taylor & Francis 2014.
2015
21. G. Coro, C. Magliozzi, A. Ellenbroek, P. Pagano, Improving data quality to build a robust distribution model for Architeuthis dux, Ecological Modelling, Volume 305, 10 June 2015, Pages 29-39, ISSN 0304-3800
22. G. Coro, C. Magliozzi, E. Vanden Berghe, N. Bailly, A. Ellenbroek, P. Pagano, Estimating absence locations of marine species from data of scientific surveys
23. R. Froese, N. Demirel, G. Coro, K. Kleisner, H. Winker, Estimating Fisheries Reference Points from Catch and Resilience
24. E. Vanden Berghe, N. Bailly, G. Coro, F. Fiorellato, C. Aldemita, A. Ellenbroek, P. Pagano. Retrieving taxa names from large biodiversity data collections using a flexible matching workflow
25. G. Coro, C. Magliozzi, A. Ellenbroek, K. Kaschner, P. Pagano. Automatic classification of climate change effects on marine species distributions in 2050 using the AquaMaps model
26. E. Trumpy, G. Coro, A. Manzella, P. Pagano, D. Castelli, P. Calcagno, A. Nador, T. Bragasson, S. Grellet. Building a European Geothermal Information Network using a
Publications around the Statistical Manager
23. 1. Data processing requirements by communities of
practice
2. The D4Science Statistical Manager
3. Ecological modelling
24. Niche Modelling
Scope:
• characterize the environmental conditions that are suitable for the species to
subsist;
• identify where suitable environment is distributed in geographical space;
• estimate the actual and potential geographic distributions of a species.
Actual distribution: areas that are truly occupied by the species
Fundamental niche: the full range of abiotic conditions within which the species is viable
Potential distribution: areas with abiotic conditions that fall within the fundamental niche
25. Niche Modelling and Absence and Presence Points
Approaches:
Mechanistic models: incorporate physiological limits in a species tolerance to
environmental conditions;
Correlative models: automatically estimate the environmental conditions that are
suitable for a species by relying on examples.
Presence points: occurrence records, i.e. places where the species has been observed
in its habitat
Absence points: locations where the environment is
considered unsuitable for the species.
In many cases, absence points must be simulated
(pseudo-absence points), because reliable data are rare.
26. Examples: Potential Distributions of the Coelacanth
Presence-only: MaxEnt Presence-only: GARP
Expert (semi-Mechanistic):
AquaMaps
PresenceAbsence: Artificial Neural Networks
Comparison between several
approaches estimating the potential
distribution of the Coelacanth.
The best depends on the quality of
the data.
Thus, cleaning operations are very
important!
27. C-squares (concise spatial query and representation system):
• A system of geocodes that provides a basis for simple spatial indexing of
geographic features
• Devised by Tony Rees of CSIRO Marine and Atmospheric Research
• A compact encoding of Latitude and Longitude and Resolution
Example:
C-square code: 3414:227:3
Resolution: 0.5°
N,S,W,E limits: -42.5,-43.0,147.0,147.5
A useful converter: http://www.marine.csiro.au/marq/csq_builder.init
C-square codes
28. Contains information on:
a) cell codes
b) statistical cell properties (center, limits, and area);
c) membership in relevant areas (FAO areas, EEZs or LMEs);
d) physical attributes (depth, salinity or temperature);
e) biological properties (e.g. primary production).
Data gathered from:
Sea Around Us Project
CSIRO
Kansas Geological Survey
Compiled by:
Kristin Kaschner & Jonathan Ready
HCAF (Half-degree Cells Authority File)
29. Contains information used for describing the environmental
tolerance and preference of a species:
• distribution using FAO areas and bounding box
• range of values per environmental parameter (min., preferred
min., preferred max., max.)
HSPEN (Half-degree Species Environmental Envelope)
32. Contains the assignment of a species to a half-degree cell and
the corresponding probability of occurrence of the species in
a given cell;
The assignment probability is the multiplicative equation of
each of the environmental parameters (SST, salinity, prim.
prod., sea ice concentration, distance to land).
HSPEC (Half-degree Species Assignment)
33. AquaMaps
Gadus morhua
A Presence-only species model that relies on expert knowledge about the species habitat
• AquaMaps Suitable: estimates the Potential Distribution
• AquaMaps Native: estimates the Actual Distribution
• Maps have 0.5 degrees resolution;
• Expert knowledge is used in modelling the habitat parameters;
• AquaMaps adopts mechanistic assumptions combined with an automatic estimation of
parameter values.
34. • “good cells” - within bounding box or known FAO areas
• minimum of 10 “good cells” for needed for extracting parameters
Bounding box or FAO area limits serve as independent verification of the validity of occurrence records.
AquaMaps – Good Cells
Taken from: http://www.aquamaps.org/main/presentations/Part%20II%20-%20AquaMaps%20behind%20the%20scene.pdf
35. Global grid of 259,200
half degree cells
Good cells are used to derive the range of environmental parameters within the species’ native range.
AquaMaps – Extracting Environmental Parameters
Taken from: http://www.aquamaps.org/main/presentations/AquaMaps_General0908.pdf
36. • Depth ranges: typically from literature; depth estimate based on habitat description
• Min = 25th percentile - 1.5 * interquartile or absolute minimum in extracted data (whichever is greater)
• Max = 75th percentile + 1.5 * interquartile or absolute maximum in extracted data (whichever is greater)
• PrefMin = 10th percentile of observed variation in an environmental parameter
• PrefMax = 90th percentile of observed variation in an environmental parameter
• Surface values for species with min depth ≤ 200m
• Bottom values for species with min depth > 200m
The environmental envelopes describe tolerances of a species with respect to each environmental
parameter.
AquaMaps – Environmental Envelopes
Taken from: http://www.aquamaps.org/main/presentations/AquaMaps_General0908.pdf
37. Predictor
Preferred
min
Preferred
max
Min Max
PMax
Relativeprobability
ofoccurrence
Pc = Pbathymetryc
x PSSTc
x Psalinityc
x Pchl ac
x PIceDistc
x PLandDistc
Probabilities of species occurrence are generated by matching the species environmental envelope against local
environmental conditions to determine relative suitability of a given area.
Probability of Occurrence
AquaMaps – Environmental Envelopes
Taken from: http://www.aquamaps.org/main/presentations/AquaMaps_General0908.pdf
38. The probability is
calculated for each 0.5
cell in the oceans.
A color is associated to
the probability values
AquaMaps – Probability
Pc = Pbathymetryc
x PSSTc
x
Psalinityc
x Pchl ac
x PIceDistc
x
PLandDistc
41. Artificial Neural Network
Presence/Absence
Points examples
Probability
(1/ 0)
• Learns from positive (presence) and negative (absence) examples (training mode);
• Adapts the network weights to produce the correct outputs on the examples;
• Produces probability values for new input (test mode).
45. • HCAF Scenarios can be simulated by
means of interpolation.
• Interpolation produces half-degree
values between a start and an end date
• Once new HCAFs are available we can
produce an HSPEC for each HCAF
Simulation of HCAF Scenarios
46. Climate Changes Effects on Species
Estimated impact of climate changes over 20 years on
11549 species.
Bioclimate HSpec
Overall occupancy in time
49. • Group points by spatial distance or density
• Detect outliers
Occurrence Points Clustering
50. DBScan acts on the points density
Parameters:
•Epsilon = 10
•Min Points = 2
Outliers
Density Clustering
51. XMeans
K = [20,30]
Min Points = 2
MaxIter=1000
KMeans
K = 24
Min Points = 2
MaxIter=1000
MaxOptSteps
= 1000
No Outliers Detected!
No Outliers Detected!
Distance Clustering
54. Similarity between habitats
Habitat Representativeness Score:
• Measures the degree to which sampled habitats are representative for a certain
area of study;
• Has been used for assessing the minimum number of surveys on a study area that
are needed to cover a good heterogeneity of species habitat variables.
Can be used to:
• Measure the similarity between the environmental features of two areas;
• Assesses the quality of models and environmental features.
HRS=10.6
Habitat
Representativeness
Score
55. A+P
HRS 10.58
P
HRS 10.61
Habitat Representativeness Score
Absence
Presence The HRS is too high -> all the maps can be unreliable and
need expert validation
HRS is in [0;2] for each feature
The overall HRS is the sum of the HRSs of the environmental features
56. Habitat Representativeness Score for each Feature
HRS 10.58
mean depth in t.c. 1.90
max depth in t.c. 0.87
min depth in t.c. 0.04
mean annual s surface temp 1.19
mean annual s bottom temp 1.59
mean salinity in t.c. 1.23
mean bottom salinity in t.c. 0.44
mean primary production 0.61
annual ice concentration 0.71
distance from land 0.46
ocean area in t.c. 1.54
Presence, Absence
HRS 10.61
mean depth in t.c. 1.92
max depth in t.c. 0.86
min depth in t.c. 0.04
mean annual s surface temp 1.13
mean annual s bottom temp 1.56
mean salinity in t.c. 1.29
mean bottom salinity in t.c. 0.34
mean primary production 0.64
annual ice concentration 0.78
distance from land 0.49
ocean area in t.c. 1.55
The most representative feature is the
minimum depth in a cell of 0.5 degrees
Presence only
Even in this case the most representative
feature is the minimum depth in a cell of 0.5
degrees
59. BiOnym
Preprocessing
And
Parsing
A workflow approach to
taxon name matching.
Accounts for:
• Variations in the spelling and
interpretation of taxonomic
names
• Combination of data from
different sources
• Harmonization and reconciliation
of Taxa names
Taxon
Matcher 1
Taxon
Matcher 2
Taxon
Matcher n
PostProcessing
Reference
Source
(ASFIS)
Reference
Source
(FISHBASE)
Reference
Source
(WoRMS)
Raw Input String.
E.g. Gadus morua Lineus 1758
Correct Transcriptions:
E.g. Gadus morhua (Linnaeus, 1758)
Reference
Source
(Other in
DwC-A)
60. GSAy
GSAY
GSrAy
GSrAY
GSA
Complete match
Step Rate
GSAy 950
GSAY 940
GSrAy 930
GSrAY 920
GSA 910
GSrA 900
GSY 890
GSrY 880
SAy 870
SAY 860
SrAy 850
SrAY 840
GAy 830
GAY 820
…
Parentheses issue
Gender agreement issues
Gender agreement and parentheses issues
Year issues
GSA
Year issues
Matcher Example - GSAy