More Related Content

Slideshows for you(20)

Similar to Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)(20)


More from Blue BRIDGE(20)


Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)

  1. BlueBRIDGE receives funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 675680 Using e-Infrastructures for Biodiversity Conservation Gianpaolo Coro CNR, Italy (on behalf of the InfraScience group of ISTI-CNR, Pisa, Italy)
  2. Context Progress in Information Technology has changed the paradigms of Science  The large and fast increase of volume and complexity of data requires new approaches to collect-curate-analyse the data  This requires new tools to guarantee exchange and longevity of the data and of the reapplication of the experiments
  3. Big Data • Large volume • High generation velocity • Large variety • Untrustworthy (veracity) • High complexity (variability) Big Data: a dataset with large volume, variety, generation velocity, containing complex and untrustworthy information that requires nonconventional methods to extract, manage and process information within a reasonable time. • Value
  4. New Science Paradigms  Open Science: make scientific research, data and dissemination accessible to all levels of an inquiring society, amateur or professional. Keywords: Open Access, Open research, Open Notebook Science  E-Science: computationally intensive science is carried out in highly distributed network environments that use large data sets and require distributed computing and collaborative tools. Keywords: Provenance of the scientific process, Scientific workflows  Science 2.0: process and publish large data sets using a collaborative approach. Share from raw data to experimental results and processes. Support collaborative experiments and Reproducibility-Repeatability-Reusability (R-R-R) of Science. Keywords: collaborative and repeatable Science
  5. Requirements for IT systems • Support collaborative research and experimentation • Implement Reproducibility-Repeatability-Reusability of Science • Allow sharing data, processes and findings • Grant free access to the produced scientific knowledge • Tackle Big Data challenges • Sustainability: low operational costs, low maintenance prices • Manage heterogeneous data/processes access policies • Meet industrial processes requirements
  6. e-Infrastructures e-Infrastructures enable researchers at different locations across the world to collaborate in the context of their home institutions or in national or multinational scientific initiatives. • People can work together having shared access to unique or distributed scientific facilities (including data, instruments, computing and communications). Examples: Belief, OpenAire, i-Marine, EU-Brazil OpenBio,
  7. Virtual Research Environments • Define sub-communities • Allow temporary dedicated assignment of computational, storage, and data resources • Manage policies • Support data and information sharing Integrates e-Infrastructure Unified Resource Space Enables VRE VRE VRE WPS External e-Infrastructures
  8. Virtual Research Environments Innovative, web-based, community-oriented, comprehensive, flexible, and secure working environments. • Communities are provided with applications to interact with the VRE services • Client services are provided both with APIs (Java, R) and simple HTTP-REST interfaces
  9. VREs Example The D4Science e-Infrastructure D4Science supports scientists in several domains 1. More than 25 000 taxonomic studies per month 2. More than 60 000 species distribution maps produced and hosted 3. Used to build a pan- European geothermal energy map 4. Processing and management of heterogeneous environmental and Earth system data 5. Enhances communication and exchange in Linguistic Studies, Humanities, Cultural Heritage, History and Archaeology
  10. BlueBRIDGE VREs Stock Assessment assess the health status of fisheries stocks. assessment CMSY model Marine Protected Areas reduce adverse impact of human activities (e.g. fishing, aquaculture, tourism) on ecosystems, and ensure these activities are properly embedded in policy frameworks. impact-maps
  11. Education VREs Lecture-style: the course topics stress is different depending on the audience Interactive: after each explained topic, students do experiments Experimental: students reproduce the experiment shown by the teacher and possibly repeat it on their own data Social: students communicate via messaging or VRE discussion panel • 1 course/year In Pisa • 1 course/year In Paris • 12 courses In Copenhagen International Council for the Exploration of the Sea • 38 courses All over the world +1000 attendees
  12. Social networking is key to share information in e-Infrastructure BlueBRIDGE offers a continuously updated list of events / news produced by users and applications User-shared News Application- shared News Share News BlueBRIDGE VREs: Social Networking
  13. A free-of-use folder-based file system allows managing and sharing information objects. Information objects can be • files, dataset, workflows, experiments, etc. • organized into folders • shared • disseminated via public URLs BlueBRIDGE VREs: The Workspace – an online files storage system
  14. Storage Databases Cloud storage Geospatial data Metadata generation and management Harmonisation Sharing Data management Cloud computing Elastic resources assignment Multi-platform: R, Java, Fortran Processing BlueBRIDGE Facilities: Overview
  15. Innovation Through Integration Vision: integration, sharing, and remote hosting help informing people and taking decisions
  16. Data Processing
  17. • Experiments on Big Data • Sharing inputs and results • Save the provenance of experiments • Supports R-R-R of experiments • Input/Out • Parameters • Provenance Cloud Computing Platform WPS REST NEW Workspace
  18. Prov-O ( “Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.” The PROV Ontology (PROV-O) expresses the PROV Data Model using the OWL2 Web Ontology Language (OWL2). It provides a set of classes, properties, and restrictions that can be used to represent and interchange provenance information generated in different systems and under different contexts.
  19. BlueBRIDGE Computational Capabilities Project resources:  28 Virtual Machines (VM) with 418 CPU cores, 636GB of RAM and 4TB of ephemeral storage  100 VMs with 200 CPU cores, 800GB of RAM and 2TB of ephemeral storage  Storage: 350TB Processes:  ~ 225 algorithms hosted in all the VREs  ~ 20 contributing institutes  ~ 30,000 requests per month  ~ 2000 scientists/students in 44 countries using VREs  Programming languages: R, Java, Python, Fortran, Linux-compiled External providers (European Grid Infrastructure):  6 VMs: 8 virtual CPU cores, 16GB of RAM and 100GB of storage  2 VMs: 16 virtual CPU cores, 32GB of RAM and 100GB of storage  24 VMs: 2 virtual CPU cores, 8GB of RAM and 50GB of storage  5VMs: 4 virtual CPUs cores, 8GB of RAM and 80GB of disk
  20. Integrating new processes Integration: putting a script or a process that works offline into the Cloud computing platform. R script Computing platform Web interface and Web service SAI - Importing tool Automatic Coro G., Panichi G., Pagano P. A Web application to publish R scripts as-a-Service on a Cloud computing platform. In: Bollettino di Geofisica Teorica e Applicata, vol. 52 article n. 51. Istituto Nazionale di Oceanografia e di Geofisica Sperimentale, 2016.
  21. Algorithms Importer (SAI) System features: 1. RStudio-like interface 2. Simple definition of script input and output 3. Global variables 4. Associate data type to the I/O 5. Request packages 6. Automatic software production 7. Automatic deployment
  22. SAI Work Flow
  23. Advantages  The process is available as-a-Service  Invoked via communication standards  Higher computational capabilities  Automatic creation of a Web interface  Provenance management  Storage of results on a high-availability system  Collaboration and sharing  Re-usability, Reproducibility, Repeatability, also from other software (e.g. QGIS)
  24. Collaborative experiments WS Shared online folders Inputs Outputs Results Computational system In the e-Infrastructure Through third party software
  25. Scientific Workflow with Code Privacy Guarantee Script provider Updates the script on his private Workspace The service downloads the script on-the-fly A user executes an experiment on his/her data The output, the input and the parameters can be shared with another user This user can execute the experiment again and share the computation with the other user 1 2 3 4 5 6 7 89 10
  26. Limitations and requirements Input OutputScript Script Required Provided Issues:  Code is often designed for one precise data set  Often, prototype scripts have code that is not separable from the I/O In the context of e-Infrastructures and Science 2.0:  Modularity is necessary for integration  Scripts should be re-organised in a way they could be re-used on other data without changing the code Vs
  27. WS Self-consistent comp. object RepeatabilityProvenance Prov-O Reusability Use of standards Reproducibility Towards Science 2.0
  28. Examples
  29. Geospatial data processing Maps comparison NetCDF file Data extraction Signal processing Periodicity detection Maps generation
  30. Maps Comparison compare Compares : • Species Distribution maps • Environmental layers • SAR Images Coro, G., Pagano, P., & Ellenbroek, A. (2014). Comparing heterogeneous distribution maps for marine species. GIScience & Remote Sensing, 51(5), 593-611.
  31. Clustering and Outliers Detection Presence Points Density-based Clustering and Outliers detection Distance Based Clustering K-Means X-Means DBScan Cetorhinus maximus
  32. Ecological Niche Modelling Atlantic cod Coelacanth Giant squid AquaMaps Neural Networks Maximum Entropy Coro, G., Magliozzi, C., Ellenbroek, A., & Pagano, P. (2015). Improving data quality to build a robust distribution model for Architeuthis dux. Ecological Modelling, 305, 29-39.
  33. Estimating Similarity Between Habitats Habitat Representativeness Score: 1. Measures the similarity between the environmental features of two areas 2. Assesses the quality of models and environmental features HRS=10.5 Habitat Representativeness Score Latimeria chalumnae Coro, G., Pagano, P., & Ellenbroek, A. (2013). Combining simulated expert knowledge with Neural Networks to produce Ecological Niche Models for Latimeria chalumnae. Ecological modelling, 268, 55-63.
  34. Occurrence Data from GBIF ( Occurrence Data from OBIS ( ∩ Intersection - Difference ᴜ Union A x,y Event Date Modif Date Author Species Scientific Name Occurrence Points Processing B x,y Event Date Modif Date Author Species Scientific Name Records Similarity DD Duplicates Deletion Candela, L., Castelli, D., Coro, G., Lelii, L., Mangiacrapa, F., Marioli, V., & Pagano, P. (2015). An infrastructure- oriented approach for supporting biodiversity research. Ecological Informatics, 26, 162-172.
  35. Absence Locations Estimation Coro, G., Magliozzi, C., Berghe, E. V., Bailly, N., Ellenbroek, A., & Pagano, P. (2016). Estimating absence locations of marine species from data of scientific surveys in OBIS. Ecological Modelling, 323, 61-76. • Intersect survey data focussing on a target species • Maximise the separation between locations with and without occurrences • Spatially aggregate • Estimate absence locations
  36. Detecting Trends in Species Abundance • Fill some knowledge gaps on marine species • Account for sampling biases • Define trends for common species Plankton regime shift Herring recovered after the fish ban Appeltans W., Pissierssens P., Coro G., Italiano A., Pagano P., Ellenbroek A., Webb T. Trendylyzer: a long-term trend analysis on biogeographic data. In: Bollettino di Geofisica Teorica e Applicata: an International Journal of Earth Sciences, vol. 54 (Suppl.) pp. 203 - 205. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - Istituto Nazionale di Oceanografia e di Geofisica Sperimentale, 2013.
  37. Estimating Climate Change Effects on Species Distributions AquaMaps actual (native) distribution Today vs 2050 (~11 500 maps) Discover classes of changes by means of cluster analysis Coro, G., Magliozzi, C., Ellenbroek, A., Kaschner, K., & Pagano, P. (2015). Automatic classification of climate change effects on marine species distributions in 2050 using the AquaMaps model. Environmental and Ecological Statistics, 1-26.
  38. Cluster Analysis to Detect Common Species Average of average_number_of_species_occ urrences_per_dataset Average of number_of_datasets_containin g_at_least_one_observation_f or_the_ Average of number_of_6_mi nute_cells_contai ning_at_least_on e_observation_fo Average of number_of_mont hs_containing_at _least_one_occur rence_record_for _ Average of no_months_with _a_least_10_occu rrences Average of nInd/nOcc Cluster 0 100 100 100 100 100 100 Cluster 1 14.46 78.57 41.05 88.90 79.65 11.14 Cluster 2 2.43 63.04 12.90 66.16 31.16 5.64 Cluster 3 0.16 53.57 1.62 27.12 1.36 0.41 Normalization with respect to the maximum value for each column Common: frequent, widespread, high individual density Moderate Commonness: moderately frequent, moderately widespread, medium individual density Moderate-Low Commonness: poorly widespread, low-moderately frequent, low individual density Low Commonness: quite localized, not frequent, usually low individual density • The term “common species” refers intuitively to a species that is abundant in a certain area, widespread and at low risk of extinction. • By consequence, “rare species” are less abundant and possibly threatened. • Automatically detecting common and rare species, and how their status changes through time, is an important step in understanding the consequences of environmental change for ecosystem functioning. Coro, G., Webb, T. J., Appeltans, W., Bailly, N., Cattrijsse, A., & Pagano, P. (2015). Classifying degrees of species commonness: North Sea fish as a case study. Ecological Modelling, 312, 272-280..
  39. Invasive species • Seven data mining techniques to estimate the spread of the puffer fish in the Mediterranean Sea; • The approach is applicable also to other species; • Produced impact maps on FAO- AREAs, EEZs and GSAs. Under publication
  40. Search in Large Taxonomic Names Repositories A flexible workflow approach to taxon name matching Accounts for: • Variations in the spelling and interpretation of taxonomic names • Combination of data from different sources • Harmonization and reconciliation of Taxa names Raw Input String Gadus morua Lineus 1758 Correct Transcription: Gadus morhua (Linnaeus, 1758) Preprocessing And Parsing Taxon name Matcher 1 Taxon name Matcher 2 Taxon name Matcher n PostProcessing Reference Source (ASFIS) Reference Source (FISHBASE) Reference Source (WoRMS) Reference Source (OBIS) Berghe, E. V., Coro, G., Bailly, N., Fiorellato, F., Aldemita, C., Ellenbroek, A., & Pagano, P. (2015). Retrieving taxa names from large biodiversity data collections using a flexible matching workflow. Ecological Informatics, 28, 29-41.
  41. Vessels data analysis Most exploited locations detection Routes interpolation Fishing activity estimation Coro, G., Fortunati, L., & Pagano, P. (2013, June). Deriving fishing monthly effort and caught species from vessel trajectories. In OCEANS-Bergen, 2013 MTS/IEEE (pp. 1-5).
  42. Forecasting Fishery Statistics Frequency and time series structure detection (with SSA) was used to forecast effort, catch and locations of purse seine fishing in the Indian Ocean. Coro, G., Large, S., Magliozzi, C., & Pagano, P. (2016). Analysing and forecasting fisheries time series: purse seine in Indian Ocean as a case study. ICES Journal of Marine Science: Journal du Conseil, fsw131.
  43. Stock assessment Length-Weight Relations: estimates Length- Weight relation parameters for marine species, using Bayesian methods. Developed by R. Froese, T. Thorson and R. B. Reyes SGVM interpolation: interpolation of vessels trajectories. Developed by the Study Group on VMS, involving ICES FAO MSY: stock assessment for FAO catch data. Developed by the Resource Use and Conservation Division of the FAO Fisheries and Aquaculture Department (ref. Y. Ye - FAO) ICCAT VPA: stock assessment method for International Commission for the Conservation of Atlantic Tunas (ICCAT) data. Developed by Ifremer and IRD (ref. S. Bonhommeau, J. Bard) CMSY:estimates Maximum Sustainable Yield from catch statistics. Prime choice for ICES as main stock assessment tool. Developed by R. Froese, G. Coro, N. Demirel, K. Kleisner and H. Winker Atlantic herring BlueBRIDGE reduced time-to- market: State-of-the-art models to estimate Maximum Sustainable Yield computational time reduced of 95% in average Froese, R., Demirel, N., Coro, G., Kleisner, K. M., & Winker, H. (2016). Estimating fisheries reference points from catch and resilience. Fish and Fisheries.
  44. Links Web Portals • • Web sites • • • •