AH-XLDBEurope-position-09 jun2011


Published on

Text (personal views position statement) to accompany presentation on what research infrastructures really need for data, XLDB-Europe, 8-10th June 2011, Edinburgh

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

AH-XLDBEurope-position-09 jun2011

  1. 1. What does research infrastructure really need for data? A personal view based on LifeWatch and ENVRI Alex Hardisty, Cardiff UniversityLifeWatch: an ESFRI Research Infrastructure; an e-Infrastructure forBiodiversity and Ecosystem Science.What is LifeWatch?Biodiversity science is the study of the diversity of life on our planet – plants, animals, microorganisms andviruses – and the environments (ecosystems) they live in. LifeWatch (www.lifewatch.eu) will be an openaccess infrastructure, accessed through a single portal (portal.lifewatch.eu) for users from the scientificcommunity, as well as policy makers and representatives of the private sector. It will allow scientists toexplore, describe and understand patterns in biodiversity, and the processes that maintain biodiversity, inspace and time at the gene, species, ecosystem and landscape levels; and to understand what causes andaffects species diversity.The innovative design of LifeWatch offers integrated access to large-scale data resources, advancedalgorithms and computational capability through a service-oriented architecture to support creation of newknowledge. Key elements of the infrastructure will include: distributed observatories/sensors,interoperable datasets, processing and analytical tools, and both computational capability and capacity.Data mining, data analysis and modelling allows users to study patterns and mechanisms across differentlevels of biodiversity. The LifeWatch infrastructure provides scientific research teams with newcollaborative environments by creating ‘e-Laboratories’ or composing ‘e-Services’. They may share theirdata and analytical and modelling algorithms with others, while controlling access. LifeWatch enables“distributed large scale” and collaborative research on complex and multidisciplinary problems.In planning for the past 3 years, LifeWatch is presently transitioning to its construction phase. Early VirtualLabs are likely to support scientific studies of biodiversity in marine wetlands and the fragility of ecosystemstowards alien and invasive species. The Biodiversity Virtual e-Laboratory (BioVeL) project (www.biovel.eu)contributes to the construction by causing islands of compatible infrastructure to be created / emerge atkey centres across Europe.The challenges of scale and heterogeneityLifeWatch is supported by many good data providers from within the scientific communities (networks ofexcellence) for terrestrial ecology, marine ecology and the natural history collections with all theirbiological specimens. There are currently about 1800 terrestrial monitoring sites and 200 marine researchsites across Europe. Hundreds of millions of specimens in natural history collections all over Europe aregradually being indexed and digitised.Biodiversity data is extremely diverse and heterogeneous. Biodiversity science spans many more familiardisciplines: biology, botany, zoology, ecology, genetics, soil science, biogeography, climate science,chemistry - to name but a few. Each of these established scientific communities already has its own way ofAlex Hardisty, XLDB-Europe, Edinburgh, 8-10th June 2011 Page 1
  2. 2. doing things, their own data resources and their own tools. Not only that, but they have their own differentvocabularies and conceptual underpinnings. Interoperability is a problem demanding a determinedontological and thesaurus solution like that used in the medical domain: the Unified Medical LanguageSystem (UMLS) (www.nlm.nih.gov/research/umls).The interconnections between different biodiversity ideas/concepts, data sources, and the outputs fromdata processing, manipulation and modelling are intricate. As well as the traditional sources mentionedabove, genomic data including, for example: sequence data, DNA barcodes and phylogenies are becomingincreasingly important sources. Biodiversity science also demands environmental data (climate, soil, oceantemperature, etc.), as well as economic and census data for particular types of studies.Apart from the well known and often large sources - GBIF, EBI, environmental data, census data - there arenumerous small datasets in the hands of individual researchers. If computerised at all, these small datasetsare often held in spreadsheets and with no identifiable common structure. There are probably thousands ofthem. And multiple tools for processing too. The biodiversity science community is highly fragmented andall these kinds of small, personal, group and departmental datasets need to get published and becomediscoverable and usable.LifeWatch aims to support upwards of 25,000 users, primarily from the academic and research community,and the policymaking community, but also supporting the student education sector and the general public(citizen science).The LifeWatch strategy of “Thinking globally, acting locally” addresses these challenges of heterogeneityand scale. “Thinking globally, acting locally” devises and promotes the pan-European top-down strategiesthat foster collaboration and interoperability, and at the local level assists and encourages ‘islands’ ofcompliant infrastructure to emerge and fuse.ENVRI: Common Operations of the ESFRI Environmental ResearchInfrastructuresWhat is ENVRI?ENVRI is a soon to be funded EC FP7 project that brings together many of the main ESFRI researchinfrastructures from the environmental sciences domain. The ENVRI project will contribute to theconstruction of these research infrastructures by sharing experiences and technologies and by solvingcrucial common technology issues and challenges together. Through cooperation in this project the ESFRIENV infrastructures, together with ICT partners, are seeking to increase the interoperability of their dataand facilities to increase the use and effectiveness of their infrastructures. The central goal of the ENVRIproject is to implement harmonised solutions and draw up guidelines for the common needs of theenvironmental ESFRI projects, with a special focus on issues as architectures, metadata frameworks, datadiscovery in scattered repositories, visualization and data curation.ENVRI recognises scientific data services as part of a horizontal set of foundational services that includecommunications, distributed computing, and storage. It recognises that data providers, as well as datausers, are users of data services and that there are common requirements irrespective of domain-specificcommunities. Community-specific services sit on top of data services and interact with them.The key to improved interoperability is finding common solutions to common problems that can beadopted by each research infrastructure as it progresses through its construction phase. Fundamentalcommon solutions include:Alex Hardisty, XLDB-Europe, Edinburgh, 8-10th June 2011 Page 2
  3. 3. a) A Common Reference Model providing multiple compatible ‘views’ of the research infrastructure for different purposes. An ENVRI Common Reference Model is likely to be based on the ISO/IEC 10746 series of Standards for Open Distributed Processing, presenting 5 viewpoints: i) Science business / enterprise view; ii) Information view; iii) Computational / services view; iv) Engineering view and v) Technology view.b) “Standards, Standards, Standards” are required for, at least: • Data capture from distributed sensors • Metadata definition • Management of high volume data • Execution of workflows • Visualization of data • Provenance and annotation • Interoperability between assetsc) Based on a generic metadata model (the Information viewpoint of the Common Reference Model), tools to allow data discovery and access in a federation of distributed digital repositories and interoperating infrastructures;d) RDF and OWL frameworks to describe relations between (virtualized) e-Infrastructure components, and to link semantic descriptions of data with the semantic descriptions of the infrastructure, allowing the creation of a data-centric network.Riding the Wave: How Europe can gain from the rising tide of scientificdataThe recently published report of the High Level Expert Group on Scientific Data – “Riding the Wave: HowEurope can gain from the rising tide of scientific data” – is an important contribution towards addressingthe question of what research infrastructures really need for data. Neelie Kroes, the Vice-President of theEuropean Commission responsible for the Digital Agenda has asked: “every citizen and every organisationinvolved in scientific research to take note of this report and to use it as a reference point when discussingthe priorities of EU research investments.”The report may be found here:http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdfAlex Hardisty, XLDB-Europe, Edinburgh, 8-10th June 2011 Page 3