Ψηφιακές Βιβλιοθήκες,
     Ψηφιακά Αποθετήρια
    Υποδομές Δεδομέμωμ:
  Θεμέλια της Νέας Επιστήμης



            Γιάννης Ιωαννίδης
Εργαςτήριο MaDgIK - Τμ. Πληροφορικήσ & Τηλ/νιών
             Πανεπιςτήμιο Αθηνών
Projects at Work
Outline
 Science Paradigms
 New Scholarly Communication & Open Access
 Digital Libraries & Repositories
  – DRIVER → OpenAIRE
 Computational & Data Challenges
 eInfrastructures
  – D4Science (I & II)
  – GRDI2020
 Conclusions
Science Paradigms

    1st - Thousand years ago:
     science was empirical
      describing natural phenomena
       w/ some models, generalizations


    2nd - Last few hundred years:          2
                                         .
                                         a    4G c2
     theoretical branch                   a   3  2
                                                   a
      using models, generalizations       




4
Really Early Times

 One scientist
 One location
 One discipline
 One phenomenon

 One pencil (… carver …)
 One paper (… stone …)
 Street announcements, e.g., Εύρηκα!
Science Paradigms

    3rd - Last few decades:
     a computational branch
      simulating complex phenomena




6
Recent Times

 One small group of scientists
 One location
 One discipline
 One phenomenon

 One file system
 One local disk with custom files
 Publications at refereed forums
Science Paradigms

    4th - Today:
     data exploration (eScience)
      unify theory, experiment, and simulation




8
Current Times
 Many/large teams of scientists
 Many locations
 Many disciplines
 Many phenomena

 Many data management systems
 Many data forms
 Web uploads for publications, data, processes, …
Current Times
 Web uploads for publications, data, processes, …

 New order in scholarly communication
 Open access
 Creator, author, publisher, curator, preserver
 roles mixed up
 Digital libraries & repositories at centre stage
eInfrastructure Layers




                             }
            Communities
                   Users
            Functionality
        Data / Info / Pubs
              Processing
                Network
Scholarly Communication
Imperatives
1. Comprehensive, global access to any type of
    scientific information
2. Minimum time and resources effort to access
    and use this information
3. Easy search/navigation, handling, manipulation,
    and re-dissemination of information
4. Maximum visibility to and communication with
    the research community, research impact
5. Long-term access and preservation of research
    results
Open Access
“Our mission of disseminating knowledge is only half
complete if the information is not made widely and
readily available to society. New possibilities of
knowledge dissemination not only through the classical
form but also and increasingly through the open access
paradigm via the Internet have to be supported.


  Berlin Declaration on Open Access to Knowledge
              in the Sciences and Humanities, 2003
Repository Landscape:
      Past-Present-Future

National, Regional, and Thematic DRs

Trans-National DRs (DRIVER)

Pan-European & Inter-Thematic DRs
(OpenAIRE)

Universal DRs
DRIVER High-Level Objectives

 Develop an environment for integrating existing
 national, regional, or thematic repositories
 Create a production-quality European DR
 infrastructure
 Prepare the future expansion and upgrade of the DR
 infrastructure across Europe
 Identify and promote the use of a relevant set of
 standards
 Raise awareness among user communities
D-NET eInfrastructure Software

 Service-Oriented Architecture
    Web Services, dynamic service registration, ...
 Distributed environment
    Services executed on a network of machines
 D-NET components (Lego approach)
  Enabling services: infrastructure middleware
  Data Management services: aggregation systems
  End-user Functionality services: search,
   community support, portals
DRIVER production infrastructure
           D-Net’s release v1.1

                                             Light User Interfaces
                 Advanced User Interfaces



                                                                       ?
  End users




                 PO    Functionality Layer
                                                      EU Open Access
                                                       Repositories
Administrators




                 PO       Data Layer



                 RO
                          Enabling Layer
DRIVER hard/soft-ware
Resources
DRIVER EU Repository Map
Repository Landscape




                          DRIVER activity
           254 repositories – 31 countries
                          220+ harvested
                         1,2M documents

 European repositories +/- 500
  World repositories +/- 1100
Story – Tales from
        Repository managers

 Initially I just used the Validation tool to see if
our repository is more or less on track and was
    reassured when the results looked good,
      which gave me confidence to register.



                                               - Louw Venter,
      Boloka Research Repository of the North-West University
                                                 South Africa
COAR
 Confederation of Open Access Repositories
 Permanent organisational backbone for
 European (and world) repository infrastructure
  – Geographic and thematic extension
  – Diffusion of DRIVER technology
  – Connect established communities of practice
  – Promote Open Access
  – Fill repositories with Open Access
    publications
Mature Federations
Extended affiliations

New partners    National aggregations
D-Net’s current uptake

 DRIVER European Information Space
  –   www.driver-community.eu
 OpenAIRE EC pilot
  –   www.openaire.eu
 European Film Gateway and other EC projects
  –   www.europeanfilmgateway.eu
 Experimentation of deployment of new infrastructure
 instances
  –   China, India, Portugal, Belgium, Spain, Slovenia
OpenAIRE High-Level Objectives

 Implement European policy on Open Access
 “Every publication resulting from European funding
 under FP7 or from the ERC should be stored in a
 repository and be openly available”
 Promote above policy to researchers
 Pilot project for full-scale implementation in the
 future
OpenAIRE - factsheet
Open Access Infrastructure for Research in Europe
  Programme: FP7 – Research Infrastructures
  Starting date: December 1, 2009
  Duration: 36 months
  Budget: 4.1 Million
  38 partners covering all European member-
  states
  To be reached at www.openaire.eu
Partners
University of Athens (coordinator)
                                               Scientific Communities
University of Goettingen Library (scientific
coordinator)                                    Health (Life Sciences)

CNR-ISTI (technical coordinator)                   – EMBL-EBI

University of Bielefeld                         Environment
                                                   – World Data Center for Climate
Spanish National Research Council (CSIC)
                                                   – Consultative Group on International
CERN                                                  Agricultural Research (CGIAR)
SURF                                            Information & Communication Science

ICM – University of Warsaw                         – Cognitive Interaction Technology
                                                      (CITEC)
University of Minho
                                                Socio-economic Sciences and Humanities
University of Gent Library                         – Data Archiving and Networked
eIFL                                                  Services (DANS)

Technical University Denmark                              Liaison Offices
Liaison Offices
  Region 1 North             Region 2 South                            Region 3 East                          Region 4 West
      (DTU)                    (UMINHO)                                   (eIFL)                                (UGENT)


                                                                                                                   Austria
      Denmark                                           Czech Republic                 Bulgaria              (University of Wien)
                                  Cyprus
  (Danish Technical                                  (Technical University of   (Bulgarian Academy of
                           (Universtity of Cyprus)
     University)                                            Ostrava)                  Sciences)
                                                                                                                  Belgium
                                 Greece                                                                      (Universtiy of Gent)
       Finland                  (National
                                                                                       Estonia
(University of Helsinki)   Documentation Center)       Hungary (HUNOR)
                                                                                 (University of Tartu)
                                                                                                                   France
                                                                                                                 (Couperin)
        Sweden                      Italy
  (National Library of           (CASPAR)                  Lithuania
                                                                                        Latvia
       Sweden)                                         (Kaunas Technical
                                                                                 (University of Latvia)           Germany
                                                          University)
                                                                                                            (University of Kostanz)
                                   Malta
                             (Malta Council for
                           Science & Technology)
                                                                                       Poland
                                                            Romania                                                 Ireland
                                                                                 (ICM – University of
                                                            (Kosson)                                           (Trinity College)
                                                                                      Warsaw)
                                  Portugal
                            (University of Minho)

                                                            Slovakia                                             Netherlands
                                                                                       Slovenia              (Utrecht University)
                                    Spain             (university Library of
                                                                                (University of Ljubljana)
                           (Spanish Foundation for         Bratislava)
                            Science & Technology)
                                                                                                                    UK
                                                                                                                 (SHERPA)
European Helpdesk
 National Open Access Liaison Offices (27 countries)
 Provide OA “toolkits” for
 –   Researchers
 –   Institutions
 Setup 24/7 portal for deposit, search of OA publications
 Liaison with
 –   Other European OA initiatives
 –   Publishers
 –   CRIS systems
Supporting repository
eInfrastructure
 OpenAIRE portal built on D-NET
 Access to scientific publications
  – Search, browse
  – Visualization tools
 Deposition of articles
  – Setup repository for “orphan” (better, “homeless”)
     researchers (CERN’s INVENIO)
  – Harvest OA publications from existing repositories
 Provide monitoring tools for
  – Document/depositing statistics
  – Usage statistics from repository infrastructure
 Interoperation with other infrastructures
OpenAIRE in a nutshell
                     D-NET platform
DRIVER-2-OpenAIRE Take Away

 Changing the culture in research publications
 Open accessibility to research results
 Metrics of research output vs. funding
 Technology + info + people infrastructures
Current Times
 Many/large teams of scientists
 Many locations
 Many disciplines
 Many phenomena

 Many data management systems
 Many data forms
 Web uploads for publications, data, processes, …
Data in 4 th Science Paradigm
      Captured by instruments or generated by simulators

      Processed by software

      Stored in computer as Information/Knowledge

      Analyzed while in scientist’s database / files
      using data management and statistics




37
Overall Data Flow
                                   Data acquisition, reduction,
                                   analysis, visualization, storage



Data
Acquisition                                         Remote users w/
System                                              local computing
                                                    and storage

   raw                   High Speed Network
                                                               Remote
   data
                                                                 users
              Metadata


                    Local
                                       Remote storage
Supercomputers      users
PAN-STARRS
 PS1
 –  detect ‘killer asteroids’,
    starting in November 2008
  – Hawaii + JHU + Harvard +
    Edinburgh + Max Planck Society
 Data Volume
  – >1 Petabytes/year raw data
  – Over 5B celestial objects
    plus 250B detections in DB
  – 100TB database
  – PS4: 4 identical telescopes in 2012, generating 4PB/yr
Cosmological Simulations
Cosmological simulations have 109 particles and
  produce over 30TB of data (Millennium)
   Build up dark matter halos
   Track merging history of halos
   Use it to assign star formation history
   Combination with spectral synthesis
   Realistic distribution of galaxy types


   Hard to analyze the data afterwards  need DB
   Optimize comparison to real data
Immersive Turbulence
 Unique turbulence database
  – Consecutive snapshots of a
    1,0243 simulation of turbulence:
    now 30 Terabytes
  – Soon 6K3 and 300 Terabytes
  – Hilbert-curve spatial index
    and massive mining
  – Treat it as an experiment, observe
    the database!
  – Throw test particles in from your laptop,
    immerse yourself into the simulation,
    like in the movie Twister
 New paradigm for analyzing
 HPC simulations!
Balloon
                                                     (30 km)

LHC and other HEP data                           CD stack with
                                                 1 year LHC data!
                                                 (~ 20 km)

 Very complex data model
 Will generate 1GB/s, 10 PB/y
  Data: raw  calibrated  skimmed  high-
 level objects  physics analyses  results    Concorde
                                               (15 km)
  Duplicated for in-silico experiments to
 interpret data
 Dependence on grey literature: calibration
 constants, algorithms ...  oral tradition!     Mt. Blanc
                                                 (4.8 km)
Other Reference Applications
 SDSS: 10TB total, 3TB in DB, soon 10TB, 6 years old
 SkyQuery: fast spatial joins on largest astronomy catalogs / replicate
 multi-TB datasets 20x for performance (1Bx1B in 3 mins)
 OncoSpace: 350TB of radiation oncology images today, 1PB in two years,
 to be analyzed on the fly
 BaBar: Grows 1TB/day
              2/3 simulation Information
              1/3 observational Information
 VLBA (NRAO): generates 1GB/s today
 NCBI: “only ½ TB” but 2X each year
      very rich dataset
 Pixar: 100 TB/Movie
D4Science:
Environmental Monitoring
 European Space Agency

 Global environmental issues: marine environment,
 forest ecosystem, air quality

 Sensor data analysis, integration and correlation of
 data sources; reasoning, information/knowledge mgnt

 Large amount of information ( 1TB), added-value
 applications and services

 Seamless workflow definition &
    on-demand data processing
D4Science: Fishery Resources Mgmt


 Fishery@FAO and WorldFish Center

 Worldwide spread researchers from many
 disciplines (biologists, climatologists, GIS
 experts, socio-economists, fishery managers,
 etc.)

 Continuous assessment for sustainable
 development & use of the ecosystem of world’s
 fisheries and aquaculture, e.g., species, aquatic
 resources, hydrological changes

 Extreme data diversity
Conclusions
 Digital Libraries & Repositories: The new way
 for scholarly communication (final product)
 Data Infrastructures: The new libraries for all
 scientific documentation (intermediate and
 final products)
 Huge technological and organizational
 challenges
             LONG way to go
              FUN way to go
Ευχαριστώ!

Ψηφιακές βιβλιοθήκες, ψηφιακά αποθετήρια, υποδομές δεδομένων: θεμέλια της νέας επιστήμης

  • 1.
    Ψηφιακές Βιβλιοθήκες, Ψηφιακά Αποθετήρια Υποδομές Δεδομέμωμ: Θεμέλια της Νέας Επιστήμης Γιάννης Ιωαννίδης Εργαςτήριο MaDgIK - Τμ. Πληροφορικήσ & Τηλ/νιών Πανεπιςτήμιο Αθηνών
  • 2.
  • 3.
    Outline Science Paradigms New Scholarly Communication & Open Access Digital Libraries & Repositories – DRIVER → OpenAIRE Computational & Data Challenges eInfrastructures – D4Science (I & II) – GRDI2020 Conclusions
  • 4.
    Science Paradigms 1st - Thousand years ago: science was empirical describing natural phenomena w/ some models, generalizations 2nd - Last few hundred years: 2 . a 4G c2 theoretical branch  a   3  2   a using models, generalizations   4
  • 5.
    Really Early Times One scientist One location One discipline One phenomenon One pencil (… carver …) One paper (… stone …) Street announcements, e.g., Εύρηκα!
  • 6.
    Science Paradigms 3rd - Last few decades: a computational branch simulating complex phenomena 6
  • 7.
    Recent Times Onesmall group of scientists One location One discipline One phenomenon One file system One local disk with custom files Publications at refereed forums
  • 8.
    Science Paradigms 4th - Today: data exploration (eScience) unify theory, experiment, and simulation 8
  • 9.
    Current Times Many/largeteams of scientists Many locations Many disciplines Many phenomena Many data management systems Many data forms Web uploads for publications, data, processes, …
  • 10.
    Current Times Webuploads for publications, data, processes, … New order in scholarly communication Open access Creator, author, publisher, curator, preserver roles mixed up Digital libraries & repositories at centre stage
  • 11.
    eInfrastructure Layers } Communities Users Functionality Data / Info / Pubs Processing Network
  • 12.
    Scholarly Communication Imperatives 1. Comprehensive,global access to any type of scientific information 2. Minimum time and resources effort to access and use this information 3. Easy search/navigation, handling, manipulation, and re-dissemination of information 4. Maximum visibility to and communication with the research community, research impact 5. Long-term access and preservation of research results
  • 13.
    Open Access “Our missionof disseminating knowledge is only half complete if the information is not made widely and readily available to society. New possibilities of knowledge dissemination not only through the classical form but also and increasingly through the open access paradigm via the Internet have to be supported. Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities, 2003
  • 14.
    Repository Landscape: Past-Present-Future National, Regional, and Thematic DRs Trans-National DRs (DRIVER) Pan-European & Inter-Thematic DRs (OpenAIRE) Universal DRs
  • 16.
    DRIVER High-Level Objectives Develop an environment for integrating existing national, regional, or thematic repositories Create a production-quality European DR infrastructure Prepare the future expansion and upgrade of the DR infrastructure across Europe Identify and promote the use of a relevant set of standards Raise awareness among user communities
  • 17.
    D-NET eInfrastructure Software Service-Oriented Architecture  Web Services, dynamic service registration, ... Distributed environment  Services executed on a network of machines D-NET components (Lego approach)  Enabling services: infrastructure middleware  Data Management services: aggregation systems  End-user Functionality services: search, community support, portals
  • 18.
    DRIVER production infrastructure D-Net’s release v1.1 Light User Interfaces Advanced User Interfaces ? End users PO Functionality Layer EU Open Access Repositories Administrators PO Data Layer RO Enabling Layer
  • 19.
  • 20.
  • 21.
    Repository Landscape DRIVER activity 254 repositories – 31 countries 220+ harvested 1,2M documents European repositories +/- 500 World repositories +/- 1100
  • 22.
    Story – Talesfrom Repository managers Initially I just used the Validation tool to see if our repository is more or less on track and was reassured when the results looked good, which gave me confidence to register. - Louw Venter, Boloka Research Repository of the North-West University South Africa
  • 23.
    COAR Confederation ofOpen Access Repositories Permanent organisational backbone for European (and world) repository infrastructure – Geographic and thematic extension – Diffusion of DRIVER technology – Connect established communities of practice – Promote Open Access – Fill repositories with Open Access publications
  • 24.
  • 25.
    Extended affiliations New partners National aggregations
  • 26.
    D-Net’s current uptake DRIVER European Information Space – www.driver-community.eu OpenAIRE EC pilot – www.openaire.eu European Film Gateway and other EC projects – www.europeanfilmgateway.eu Experimentation of deployment of new infrastructure instances – China, India, Portugal, Belgium, Spain, Slovenia
  • 28.
    OpenAIRE High-Level Objectives Implement European policy on Open Access “Every publication resulting from European funding under FP7 or from the ERC should be stored in a repository and be openly available” Promote above policy to researchers Pilot project for full-scale implementation in the future
  • 29.
    OpenAIRE - factsheet OpenAccess Infrastructure for Research in Europe Programme: FP7 – Research Infrastructures Starting date: December 1, 2009 Duration: 36 months Budget: 4.1 Million 38 partners covering all European member- states To be reached at www.openaire.eu
  • 30.
    Partners University of Athens(coordinator) Scientific Communities University of Goettingen Library (scientific coordinator) Health (Life Sciences) CNR-ISTI (technical coordinator) – EMBL-EBI University of Bielefeld Environment – World Data Center for Climate Spanish National Research Council (CSIC) – Consultative Group on International CERN Agricultural Research (CGIAR) SURF Information & Communication Science ICM – University of Warsaw – Cognitive Interaction Technology (CITEC) University of Minho Socio-economic Sciences and Humanities University of Gent Library – Data Archiving and Networked eIFL Services (DANS) Technical University Denmark Liaison Offices
  • 31.
    Liaison Offices Region 1 North Region 2 South Region 3 East Region 4 West (DTU) (UMINHO) (eIFL) (UGENT) Austria Denmark Czech Republic Bulgaria (University of Wien) Cyprus (Danish Technical (Technical University of (Bulgarian Academy of (Universtity of Cyprus) University) Ostrava) Sciences) Belgium Greece (Universtiy of Gent) Finland (National Estonia (University of Helsinki) Documentation Center) Hungary (HUNOR) (University of Tartu) France (Couperin) Sweden Italy (National Library of (CASPAR) Lithuania Latvia Sweden) (Kaunas Technical (University of Latvia) Germany University) (University of Kostanz) Malta (Malta Council for Science & Technology) Poland Romania Ireland (ICM – University of (Kosson) (Trinity College) Warsaw) Portugal (University of Minho) Slovakia Netherlands Slovenia (Utrecht University) Spain (university Library of (University of Ljubljana) (Spanish Foundation for Bratislava) Science & Technology) UK (SHERPA)
  • 32.
    European Helpdesk NationalOpen Access Liaison Offices (27 countries) Provide OA “toolkits” for – Researchers – Institutions Setup 24/7 portal for deposit, search of OA publications Liaison with – Other European OA initiatives – Publishers – CRIS systems
  • 33.
    Supporting repository eInfrastructure OpenAIREportal built on D-NET Access to scientific publications – Search, browse – Visualization tools Deposition of articles – Setup repository for “orphan” (better, “homeless”) researchers (CERN’s INVENIO) – Harvest OA publications from existing repositories Provide monitoring tools for – Document/depositing statistics – Usage statistics from repository infrastructure Interoperation with other infrastructures
  • 34.
    OpenAIRE in anutshell D-NET platform
  • 35.
    DRIVER-2-OpenAIRE Take Away Changing the culture in research publications Open accessibility to research results Metrics of research output vs. funding Technology + info + people infrastructures
  • 36.
    Current Times Many/largeteams of scientists Many locations Many disciplines Many phenomena Many data management systems Many data forms Web uploads for publications, data, processes, …
  • 37.
    Data in 4th Science Paradigm Captured by instruments or generated by simulators Processed by software Stored in computer as Information/Knowledge Analyzed while in scientist’s database / files using data management and statistics 37
  • 38.
    Overall Data Flow Data acquisition, reduction, analysis, visualization, storage Data Acquisition Remote users w/ System local computing and storage raw High Speed Network Remote data users Metadata Local Remote storage Supercomputers users
  • 39.
    PAN-STARRS PS1 – detect ‘killer asteroids’, starting in November 2008 – Hawaii + JHU + Harvard + Edinburgh + Max Planck Society Data Volume – >1 Petabytes/year raw data – Over 5B celestial objects plus 250B detections in DB – 100TB database – PS4: 4 identical telescopes in 2012, generating 4PB/yr
  • 40.
    Cosmological Simulations Cosmological simulationshave 109 particles and produce over 30TB of data (Millennium) Build up dark matter halos Track merging history of halos Use it to assign star formation history Combination with spectral synthesis Realistic distribution of galaxy types Hard to analyze the data afterwards  need DB Optimize comparison to real data
  • 41.
    Immersive Turbulence Uniqueturbulence database – Consecutive snapshots of a 1,0243 simulation of turbulence: now 30 Terabytes – Soon 6K3 and 300 Terabytes – Hilbert-curve spatial index and massive mining – Treat it as an experiment, observe the database! – Throw test particles in from your laptop, immerse yourself into the simulation, like in the movie Twister New paradigm for analyzing HPC simulations!
  • 42.
    Balloon (30 km) LHC and other HEP data CD stack with 1 year LHC data! (~ 20 km) Very complex data model Will generate 1GB/s, 10 PB/y Data: raw  calibrated  skimmed  high- level objects  physics analyses  results Concorde (15 km) Duplicated for in-silico experiments to interpret data Dependence on grey literature: calibration constants, algorithms ...  oral tradition! Mt. Blanc (4.8 km)
  • 43.
    Other Reference Applications SDSS: 10TB total, 3TB in DB, soon 10TB, 6 years old SkyQuery: fast spatial joins on largest astronomy catalogs / replicate multi-TB datasets 20x for performance (1Bx1B in 3 mins) OncoSpace: 350TB of radiation oncology images today, 1PB in two years, to be analyzed on the fly BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information VLBA (NRAO): generates 1GB/s today NCBI: “only ½ TB” but 2X each year very rich dataset Pixar: 100 TB/Movie
  • 45.
    D4Science: Environmental Monitoring EuropeanSpace Agency Global environmental issues: marine environment, forest ecosystem, air quality Sensor data analysis, integration and correlation of data sources; reasoning, information/knowledge mgnt Large amount of information ( 1TB), added-value applications and services Seamless workflow definition & on-demand data processing
  • 46.
    D4Science: Fishery ResourcesMgmt Fishery@FAO and WorldFish Center Worldwide spread researchers from many disciplines (biologists, climatologists, GIS experts, socio-economists, fishery managers, etc.) Continuous assessment for sustainable development & use of the ecosystem of world’s fisheries and aquaculture, e.g., species, aquatic resources, hydrological changes Extreme data diversity
  • 47.
    Conclusions Digital Libraries& Repositories: The new way for scholarly communication (final product) Data Infrastructures: The new libraries for all scientific documentation (intermediate and final products) Huge technological and organizational challenges LONG way to go FUN way to go
  • 48.