Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)

BlueBRIDGE receives funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No. 675680 www.bluebridge-vres.eu
Using e-Infrastructures for Biodiversity
Conservation
Gianpaolo Coro
CNR, Italy
gianpaolo.coro@isti.cnr.it
(on behalf of the InfraScience group of ISTI-CNR, Pisa, Italy)

Context
Progress in Information Technology has changed
the paradigms of Science
 The large and fast increase of volume and
complexity of data requires new approaches to
collect-curate-analyse the data
 This requires new tools to guarantee exchange
and longevity of the data and of the reapplication
of the experiments

Big Data
• Large volume
• High generation velocity
• Large variety
• Untrustworthy
(veracity)
• High complexity
(variability)
Big Data: a dataset with large volume, variety, generation velocity, containing complex and
untrustworthy information that requires nonconventional methods to extract, manage and
process information within a reasonable time.
• Value

New Science Paradigms
 Open Science: make scientific research, data and dissemination
accessible to all levels of an inquiring society, amateur or
professional.
Keywords: Open Access, Open research, Open Notebook Science
 E-Science: computationally intensive science is carried out in highly
distributed network environments that use large data sets and
require distributed computing and collaborative tools.
Keywords: Provenance of the scientific process, Scientific workflows
 Science 2.0: process and publish large data sets using a
collaborative approach. Share from raw data to experimental
results and processes. Support collaborative experiments and
Reproducibility-Repeatability-Reusability (R-R-R) of Science.
Keywords: collaborative and repeatable Science

Requirements for IT systems
• Support collaborative research and experimentation
• Implement Reproducibility-Repeatability-Reusability of
Science
• Allow sharing data, processes and findings
• Grant free access to the produced scientific knowledge
• Tackle Big Data challenges
• Sustainability: low operational costs, low maintenance
prices
• Manage heterogeneous data/processes access policies
• Meet industrial processes requirements

e-Infrastructures
e-Infrastructures enable researchers at different locations across the world
to collaborate in the context of their home institutions or in national or multinational
scientific initiatives.
• People can work together having shared access to unique or distributed scientific
facilities (including data, instruments, computing and communications).
Examples:
Belief, http://www.beliefproject.org/
OpenAire, http://www.openaire.eu/
i-Marine, http://www.i-marine.eu/
EU-Brazil OpenBio,
http://www.eubrazilopenbio.eu/

Virtual Research Environments
• Define sub-communities
• Allow temporary dedicated
assignment of computational,
storage, and data resources
• Manage policies
• Support data and information
sharing
Integrates
e-Infrastructure
Unified Resource Space
Enables
VRE VRE VRE
WPS
External e-Infrastructures

Virtual Research Environments
Innovative, web-based, community-oriented, comprehensive, flexible, and
secure working environments.
• Communities are provided with applications to interact with the VRE services
• Client services are provided both with APIs (Java, R) and simple HTTP-REST interfaces

VREs Example
The D4Science e-Infrastructure
D4Science supports scientists in several domains
1. More than 25 000
taxonomic
studies per month
www.i-marine.eu
2. More than 60 000
species distribution
maps produced and
hosted
www.d4science.eu
3. Used to build a
pan- European
geothermal energy
map
www.egip.d4science.org
4. Processing and
management of
heterogeneous
environmental and
Earth system data
www.envriplus.eu
5. Enhances
communication and
exchange in Linguistic
Studies, Humanities,
Cultural Heritage,
History and
Archaeology
www.parthenos-project.eu

BlueBRIDGE VREs
Stock Assessment
assess the health status of fisheries stocks.
http://www.bluebridge-vres.eu/services/stock-
assessment
CMSY model
Marine Protected Areas
reduce adverse impact of human activities
(e.g. fishing, aquaculture, tourism) on
ecosystems, and ensure these activities are
properly embedded in policy frameworks.
http://www.bluebridge-vres.eu/services/protected-area-
impact-maps

Education VREs
Lecture-style: the course topics stress is different
depending on the audience
Interactive: after each explained topic, students do
experiments
Experimental: students reproduce the experiment
shown by the teacher and possibly repeat it on their own
data
Social: students communicate via messaging or VRE
discussion panel
• 1 course/year
In Pisa
• 1 course/year
In Paris
• 12 courses
In Copenhagen
www.bluebridge-vres.eu
International Council for
the Exploration of the Sea
• 38 courses
All over the world
+1000 attendees

Social networking is key to share information in e-Infrastructure
BlueBRIDGE offers a continuously updated list of events / news produced by users
and applications
User-shared
News
Application-
shared News
Share News
BlueBRIDGE VREs: Social Networking

A free-of-use folder-based file system allows managing and sharing
information objects.
Information objects can be
• files, dataset, workflows,
experiments, etc.
• organized
into folders
• shared
• disseminated via public
URLs
BlueBRIDGE VREs:
The Workspace – an online files storage system

Storage
Databases Cloud storage Geospatial data
Metadata generation
and management
Harmonisation Sharing
Data
management
Cloud computing Elastic resources
assignment
Multi-platform: R,
Java, Fortran
Processing
BlueBRIDGE Facilities: Overview

Innovation Through Integration
Vision: integration, sharing, and remote hosting help
informing people and taking decisions

• Experiments on Big Data
• Sharing inputs and results
• Save the provenance of experiments
• Supports R-R-R of experiments
• Input/Out
• Parameters
• Provenance
Cloud Computing
Platform
WPS
REST
NEW
Workspace

Prov-O
(https://www.w3.org/TR/prov-o/)
“Provenance is information about
entities, activities, and people
involved in producing a piece of data
or thing, which can be used to form
assessments about its quality,
reliability or trustworthiness.”
The PROV Ontology (PROV-O)
expresses the PROV Data Model
using the OWL2 Web Ontology
Language (OWL2).
It provides a set of classes,
properties, and restrictions that can
be used to represent and interchange
provenance information generated in
different systems and under different
contexts.

BlueBRIDGE Computational
Capabilities
Project resources:
 28 Virtual Machines (VM) with 418 CPU cores, 636GB of RAM and 4TB of
ephemeral storage
 100 VMs with 200 CPU cores, 800GB of RAM and 2TB of ephemeral storage
 Storage: 350TB
Processes:
 ~ 225 algorithms hosted in all the VREs
 ~ 20 contributing institutes
 ~ 30,000 requests per month
 ~ 2000 scientists/students in 44 countries using VREs
 Programming languages: R, Java, Python, Fortran, Linux-compiled
External providers (European Grid Infrastructure):
 6 VMs: 8 virtual CPU cores, 16GB of RAM and 100GB of storage
 5VMs: 4 virtual CPUs cores, 8GB of RAM and 80GB of disk

Integrating new processes
Integration: putting a script or a process that works offline into
the Cloud computing platform.
R script
Computing platform Web interface and Web service
SAI - Importing tool
Automatic
Coro G., Panichi G., Pagano P. A Web application to publish R scripts as-a-Service on a Cloud computing platform.
In: Bollettino di Geofisica Teorica e Applicata, vol. 52 article n. 51. Istituto Nazionale di Oceanografia e di Geofisica
Sperimentale, 2016.
https://wiki.gcube-system.org/gcube/How-to_Implement_Algorithms_for_DataMiner
https://wiki.gcube-system.org/gcube/Statistical_Algorithms_Importer

Algorithms Importer (SAI)
System features:
1. RStudio-like
interface
2. Simple definition
of script input and
output
3. Global variables
4. Associate data
type to the I/O
5. Request packages
6. Automatic
software
production
7. Automatic
deployment

Advantages
 The process is available as-a-Service
 Invoked via communication standards
 Higher computational capabilities
 Automatic creation of a Web interface
 Provenance management
 Storage of results on a high-availability system
 Collaboration and sharing
 Re-usability, Reproducibility, Repeatability, also
from other software (e.g. QGIS)

Collaborative experiments
WS
Shared online folders
Inputs
Outputs
Results
Computational system
In the e-Infrastructure
Through third party software

Scientific Workflow with Code Privacy Guarantee
Script provider
Updates the script on
his private Workspace
The service downloads
the script on-the-fly
A user executes an
experiment on
his/her data
The output, the input
and the parameters can
be shared with another
user
This user can execute the
experiment again
and share the
computation with the
other user
1
2
3
4
5
6
7
89
10

Limitations and requirements
Input OutputScript
Script
Required Provided
Issues:
 Code is often designed for one precise data set
 Often, prototype scripts have code that is not separable from the I/O
In the context of e-Infrastructures and Science 2.0:
 Modularity is necessary for integration
 Scripts should be re-organised in a way they could be re-used on other data without
changing the code
Vs

WS
Self-consistent comp. object
RepeatabilityProvenance Prov-O
Reusability
Use of standards
Reproducibility
Towards Science 2.0

Geospatial data processing
Maps
comparison
NetCDF
file
Data extraction
Signal processing Periodicity detection
Maps generation

Maps Comparison
compare
Compares :
• Species Distribution
maps
• Environmental layers
• SAR Images
Coro, G., Pagano, P., & Ellenbroek, A.
(2014). Comparing heterogeneous
distribution maps for marine
species. GIScience & Remote
Sensing, 51(5), 593-611.

Clustering and Outliers Detection
Presence
Points
Density-based
Clustering
and Outliers detection
Distance Based Clustering
K-Means
X-Means
DBScan
Cetorhinus maximus

Ecological Niche Modelling
Atlantic cod
Coelacanth
Giant squid
AquaMaps
Neural
Networks
Maximum
Entropy
Coro, G., Magliozzi, C., Ellenbroek, A., & Pagano, P. (2015). Improving data quality to build a robust
distribution model for Architeuthis dux. Ecological Modelling, 305, 29-39.

Estimating Similarity Between Habitats
Habitat Representativeness Score:
1. Measures the similarity between the environmental features of two areas
2. Assesses the quality of models and environmental features
HRS=10.5
Habitat
Representativeness
Score
Latimeria chalumnae
Coro, G., Pagano, P., & Ellenbroek, A. (2013). Combining simulated expert knowledge with Neural Networks
to produce Ecological Niche Models for Latimeria chalumnae. Ecological modelling, 268, 55-63.

Occurrence Data from GBIF
(www.gbif.org)
Occurrence Data from OBIS
(www.iobis.org)
∩
Intersection
-
Difference
ᴜ
Union
A
x,y
Event Date
Modif Date
Author
Species Scientific Name
Occurrence Points Processing
B
x,y
Event Date
Modif Date
Author
Species Scientific Name
Records
Similarity
DD
Duplicates Deletion
Candela, L., Castelli, D., Coro, G., Lelii, L., Mangiacrapa, F., Marioli, V., & Pagano, P. (2015). An infrastructure-
oriented approach for supporting biodiversity research. Ecological Informatics, 26, 162-172.

Absence Locations Estimation
Coro, G., Magliozzi, C., Berghe, E. V., Bailly, N.,
Ellenbroek, A., & Pagano, P. (2016). Estimating
absence locations of marine species from data of
scientific surveys in OBIS. Ecological Modelling, 323,
61-76.
• Intersect survey data
focussing on a target
species
• Maximise the
separation between
locations with and
without occurrences
• Spatially aggregate
• Estimate absence
locations

Detecting Trends in Species Abundance
• Fill some knowledge gaps on marine species
• Account for sampling biases
• Define trends for common species
Plankton regime shift
Herring recovered after the fish ban
Appeltans W., Pissierssens P., Coro G., Italiano A., Pagano P., Ellenbroek A., Webb T. Trendylyzer: a long-term trend analysis on biogeographic data. In: Bollettino di Geofisica Teorica e Applicata:
an International Journal of Earth Sciences, vol. 54 (Suppl.) pp. 203 - 205. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca
(Italy). OGS - Istituto Nazionale di Oceanografia e di Geofisica Sperimentale, 2013.

Estimating Climate Change Effects on Species
Distributions
AquaMaps actual (native)
distribution
Today vs 2050
(~11 500 maps)
Discover classes of
changes
by means of cluster
analysis
Coro, G., Magliozzi, C., Ellenbroek, A., Kaschner, K., & Pagano, P. (2015). Automatic classification of climate change effects on marine species distributions in 2050
using the AquaMaps model. Environmental and Ecological Statistics, 1-26.

Cluster Analysis to Detect Common Species
Average of
average_number_of_species_occ
urrences_per_dataset
Average of
number_of_datasets_containin
g_at_least_one_observation_f
or_the_
Average of
number_of_6_mi
nute_cells_contai
ning_at_least_on
e_observation_fo
Average of
number_of_mont
hs_containing_at
_least_one_occur
rence_record_for
_
Average of
no_months_with
_a_least_10_occu
rrences
Average of
nInd/nOcc
Cluster 0 100 100 100 100 100 100
Cluster 1 14.46 78.57 41.05 88.90 79.65 11.14
Cluster 2 2.43 63.04 12.90 66.16 31.16 5.64
Cluster 3 0.16 53.57 1.62 27.12 1.36 0.41
Normalization with respect to the maximum value for each column
Common: frequent, widespread, high individual
density
Moderate Commonness: moderately frequent,
moderately widespread, medium individual
density
Moderate-Low Commonness: poorly
widespread, low-moderately frequent, low
individual density
Low Commonness: quite localized, not frequent,
usually low individual density
• The term “common species” refers
intuitively to a species that is abundant
in a certain area, widespread and at
low risk of extinction.
• By consequence, “rare species” are
less abundant and possibly threatened.
• Automatically detecting common and
rare species, and how their status
changes through time, is an important
step in understanding the
consequences of environmental
change for ecosystem functioning.
Coro, G., Webb, T. J., Appeltans, W., Bailly, N., Cattrijsse, A., & Pagano, P. (2015). Classifying degrees of species
commonness: North Sea fish as a case study. Ecological Modelling, 312, 272-280..

Invasive species
• Seven data mining techniques to
estimate the spread of the puffer
fish in the Mediterranean Sea;
• The approach is applicable also to
other species;
• Produced impact maps on FAO-
AREAs, EEZs and GSAs.
Under publication

Search in Large Taxonomic Names Repositories
A flexible workflow approach
to taxon name matching
Accounts for:
• Variations in the spelling and
interpretation of taxonomic
names
• Combination of data from
different sources
• Harmonization and reconciliation
of Taxa names
Raw Input String
Gadus morua Lineus 1758
Correct Transcription:
Gadus morhua (Linnaeus, 1758)
Preprocessing
And
Parsing
Taxon name
Matcher 1
Taxon name
Matcher 2
Taxon name
Matcher n
PostProcessing
Reference
Source
(ASFIS)
Reference
Source
(FISHBASE)
Reference
Source
(WoRMS)
Reference
Source
(OBIS)
Berghe, E. V., Coro, G., Bailly, N., Fiorellato, F., Aldemita, C., Ellenbroek, A., & Pagano, P. (2015). Retrieving taxa names
from large biodiversity data collections using a flexible matching workflow. Ecological Informatics, 28, 29-41.

Vessels data analysis
Most exploited locations detection
Routes interpolation
Fishing activity estimation
Coro, G., Fortunati, L., & Pagano, P. (2013, June). Deriving fishing monthly effort and caught species from vessel
trajectories. In OCEANS-Bergen, 2013 MTS/IEEE (pp. 1-5).

Forecasting Fishery Statistics
Frequency and time series
structure detection (with SSA)
was used to forecast effort, catch
and locations of purse seine
fishing in the Indian Ocean.
Coro, G., Large, S., Magliozzi, C., & Pagano, P. (2016).
Analysing and forecasting fisheries time series: purse
seine in Indian Ocean as a case study. ICES Journal of
Marine Science: Journal du Conseil, fsw131.

Stock assessment
Length-Weight Relations: estimates Length-
Weight relation parameters for marine species,
using Bayesian methods. Developed by R. Froese, T.
Thorson and R. B. Reyes
SGVM interpolation: interpolation of vessels
trajectories. Developed by the Study Group on VMS,
involving ICES
FAO MSY: stock assessment for FAO catch data.
Developed by the Resource Use and Conservation
Division of the FAO Fisheries and Aquaculture
Department (ref. Y. Ye - FAO)
ICCAT VPA: stock assessment method for
International Commission for the Conservation
of Atlantic Tunas (ICCAT) data. Developed by
Ifremer and IRD (ref. S. Bonhommeau, J. Bard)
CMSY:estimates Maximum Sustainable Yield
from catch statistics. Prime choice for ICES as
main stock assessment tool. Developed by R.
Froese, G. Coro, N. Demirel, K. Kleisner and H. Winker
Atlantic herring
BlueBRIDGE reduced time-to-
market:
State-of-the-art models to estimate
Maximum Sustainable Yield
computational time reduced of 95%
in average
Froese, R., Demirel, N., Coro, G., Kleisner, K. M., & Winker, H. (2016).
Estimating fisheries reference points from catch and resilience. Fish and
Fisheries.

Links
Web Portals
• bluebridge.d4science.org
• services.d4science.org
Web sites
• www.bluebridge-vres.eu
• www.gcube-system.org
• www.d4science.org
• www.i-marine.eu

Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)

Similar to Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR) (20)

More from Blue BRIDGE

More from Blue BRIDGE (20)

Recently uploaded

Recently uploaded (20)

Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)