SlideShare a Scribd company logo
1 of 26
Download to read offline
GAIA Tech1 Data Repositories Meeting




Ingrid Bàrcena, HPC and Storage services manager
Ricard de la Vega, Portals and Repositories manager



GAIA Tech1 meeting
Madrid May 24 2011
Outline



1. ¿What is CESCA?
2. CESCA services
      HPC ans Storage
      Network
      University e-Administration
      Portals and Repositories
3. Digital Repositories
      Overview
      Two examples: DSpace and web archiving
      Long term preservation
4. CESCA and GAIA
      What is done
      What could be done
Centre de Supercomputació de Catalunya


  Public Consortium created in 1991   Patrons:
  ICTS since 2000                      • Generalitat de Catalunya
                                       • Fundació Catalana per a la Recerca i
                                         la Innovació
                                       • Universitat de Barcelona
                                       • Universitat Autònoma
                                         de Barcelona
                                       • Universitat Politècnica
                                         de Catalunya
                                       • Universitat Pompeu Fabra
                                       • Universitat de Girona
                                       • Universitat Rovira i Virgili
                                       • Universitat de Lleida
                                       • Universitat Oberta
                                         de Catalunya
                                       • Universitat Ramon Llull
                                       • Consell Superior
                                         d’Investigacions Científiques
Our Services
HPC and Storage


   HPC Service                             Storage Service                  Drug Design Service

                                                      Disk Library
                                                      NetApp FAS3170
                                                      150 TB               6 Pharma Labs
                                                                           10 Academic research groups
                                                      21 TB FC drives
19,48 Tflop/s Peack performance                       126 TB SATA drives

50 research projects ( 203 users)
Main areas:
     • Materials Science (31%)
      • Life Science (32%)
      • Environmental Science (28%)       Tape Library
      • Astronomy and Astrophysics (5%)
+ 3.5 HC used during 2010                 ADIC i2000
+ 50 scientific applications available    156 TB                           2 Software Packages
                                          6 LTO-4 drives
                                          300 slots
                                          NetBackup 6.5
Network services




+80 connected institutions          21 institutions in Catalonia   24 ISP and operators
                                    40 countries
2 core nodes at 10 Gbps
Flexible bandwidth
Services: IPv6, multimedia, Remot
Access Service,Voice over IP,
Eduroam, Security...


                                                                   Services: Multicast, IPv6, NTP
                                                                   Server, F root server (A and J,
                                                                   .com and .net coming soon)...
University e-Administration Projects


e-Register                           e-Vote                                e-Archive

                                     • Bid price : 405.000 €               • Transfer agreement: 12-7-10
• URV: production 02-01-11           • Awarded (03-18-10):                 • Inst. ATLAS: 17.800 €
                                       Scytl, 345.000 €                    • Integr. Doc. Mgt:
• UdL: production 03-14-11                                                     Award: IECI 51.920 € (02-12-11)
                                     • Production: 02-01-11                • Production: 06-01-11
• Sadiel: 32.692 €


                                                                                        Balancejadors F5 BIG-IP
SCD (e-Identitat i e-Signatura)     GPI
• Available:
                                    Improvements (02-03-11)                                       …
                                    • Inteum Sentinel i Technology
     EC-UR i EC-URV                   Publisher
     ER-CESCA, -URV, -UPC           • Office 2007; separació MVs per
                -UdL, -UPF            universitat; enviament correus
• In development: ER-UdG,           Licence renewal. UB i UPC
  ER-UB, ER-UAB i ER-UVic           Investment: 1.046,97 €
                                                                                  Capa de dades


  Cluster: 15 BL460c G6 (2 x Intel Xeon E5530 QC); 480 GB; 4,3 TB;
             XenServer Citrix; 2 load balancer F5 BIG-IP 1600; 110.487 €
Portals and Repositories




Since 2001                 Since 2005              Since 2006                Since 2009
18 universities            22 institutions         328 journals              10 universities
10,577 doctoral thesis     24,564 research         129,235 articles          1,814 learning objects
                               papers, eprints…
www.tdx.cat                                        www.raco.cat              www.mdx.cat
                           www.recercat.cat




Since 2006                                                                   Pilot 2009-10
                           Since 2010             Since 2006
39,587 websites crawled                                                      420 websites crawled
                           22 institutions        Turnkey development
118,039 versions crawled                          Evolutionary maintenance
                                                                             790 versions crawled
                           24,564 research
249M files in 7.5 TB           papers, eprints…   http://recyt.fecyt.es      http://recyt.fecyt.es
www.padicat.cat            www.recercat.cat                                  (restricted IP address)


                                                                                                31-03-11
Outline



1. ¿What is CESCA?
2. CESCA services
      HPC ans Storage
      Network
      University e-Administration
      Portals and Repositories
3. Digital Repositories
      Overview
      Two examples: DSpace and web archiving
      Long term preservation
4. CESCA and GAIA
      What is done
      What could be done
Digital Repositories



  A repository capture, store, index, preserve and distribute
  digital content.

  Data + Metadata
   •   Dublin Core (DC)
   •   Mets, Mods, marc21…
   •   VO?
   •   Astronomical?

  Main issues
   • Access (search / browse)
   • Preservation
   • Interoperability
        – Open Archive Initiative for metadada harvest (OAI-PMH)
            (based on Dublin Core metadata)
Repositories taxonomy




Towards a European e-Infrastructure for e-Science Digital Repositories. 7th e-Concentration Meeting, Brussels, 12-14th October, 2009
Repositories Hardware



                             High availability
                             Load balancing
                             Easy scalability
               Balancers     24x7 monitoring


               Services
       …
                                  Storage Area Network

                   Data
   …                       Disc            Tape
Repositories Software



  For general purpose
   • DSpace, EPrints, Fedora, Islandora…
   • Implemented in


  For journal management
   • Open Journal Systems (OJS)
   • Implemented in



  For web archives preservation
   • Heritrix, NutchWAX, WERA, Wayback, Webcurator…
   • Implemented in
Example on general purpose repository (DSpace)




  For digital objects, like PDF, images, videos, data…
  Index metadata and PDF for searching
Example on web archive (PADICAT)



   PADICAT consists of collecting, processing and providing
   permanent access to the entire cultural, scientific and
   general output of Catalonia in digital format. It is the
   Catalan web sites archive.
                PANDORA      UK ARCHIVE      IA          VEFSAFN        BNF        Kulturarw3   Netarchive
Scope            Australia       UK         World         Islandia     France       Sueden       Denmark
Begin             1996          2004        1996            2004        2002         1996         2005
Open access                                              since 2009
Search by URL
S. by keyword
Directori
N. websites       26.630        8.308          -              -            -            -       > 1,1 milions
N. crawls         60.276       32.618      150 billion        -            -            -        4,5 bilions
Space            4,63 TB      7,59 TB          -              -         180 TB          -          155TB
Data            16-12-2010   12-01-2011   13-12-2011     13-01-2011   13-01-2011   26-11-2010     08-2010



                                Since 2006
                                - 39,587 websites crawled - Open Access
                                - 118,039 versions crawled - Search by URL and keyword
                                - 249M files in 7.5 TB     - Catalogue and thematic directory
                                                                             www.padicat.cat
Web archive software architecture

1.    Harvest
2.    Index and search
3.    Catalogue and browse




                                                   WERA


                                             WAYBACK




CATALOG DATABASE
  (Crawl Metadata)



                             HERITRIX


     WEB CURATOR TOOL


                                         HADOOP           INDEX FOR KEYWORD SEARCHING
                                            +
                                        NUTCHWAX
                                                          INDEX FOR URL SEARCHING
                  ARXIUS                ARCINDEXER
                   ARC
PADICAT’s indexes



  Until now (< 100.000 website version crawled)
   • For search by URL (like Internet Archive)
       – Index with ArcIndexer (~100 GB) + visualize with Wayback √
   • For search by keyword
       – Index with Hadoop+NutchWAX + visualize with WERA √


  Now (120.000 website version crawls)
   • Performance problems for keyword indexing
   • Two solutions under evaluation:
       – Index with a new version of NutchWAX + visualize with TNH (the new
         hotness, from IA)
       – Index with JB (James Brown, from IA) + visualize with TNH
Long term preservation



  The e-infrastructure must ensure the long term data
  access, without failure.

  To succeed, it must be taken into account:
   •   Replication (more than one copy)
   •   Media refresh
   •   Format migration
   •   Data integrity (checksums)
   •   Contingency and recovery plan
   •   Preservation plan
   •   ...
An example of long term preservation



The “preservation history” of TDX (doctoral theses)…

  2001 – 80 GB, 8.000 access hits
   • SW: ETDdb (+ MySQL, Glimpse…) from Virginia Tech
   • HW: HP V2500 with 16 processors, 4 GB memory, 227 GB disk
   • HW: StorageTek TimberWolf 9740 with 2,7 TB of 9840 tapes




                                       Born in a supercomputer!
An example of long term preservation



The “preservation history” of TDX (doctoral theses)…

  Hardware migrations
   • 2003 (cpu + disk)
       – HP rp5430 with 2 processors, 704 GB memory
       – HP EVA V.2 with 2,8 TB disk
   • 2006 (cpu + tape)
       – High availability HP cluster with 32 Proliant DL360 nodes
       – Adic Scalar i2000 (from 9840 tapes to LTO3 tapes)
   • 2009 (disk)
       – NetApp FAS3170 with 60 TB disk


  Software migrations
   • 2010 – DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs
An example of long term preservation



The “preservation history” of TDX (doctoral theses)…

  Replication
   •   On disk - Online version (1)
   •   One backup on the tape library (2)
   •   Other backup on a fireproof cabinet (3)
   •   Other backup on a 50 Km remote Centre (4)
   •   A dark copy on the MetaArchive Cooperative
        – Private LOCKSS (Lots of Copies Keep Stuff Safe) Network
        – 10 more copies around the world (14)


  Data Integrity
   • Checksums on DSpace (online version)
   • Checksums on LOCKSS (dark copies)
An example of long term preservation



The “preservation history” of TDX (doctoral theses)…

  2011 – 300 GB, + of 3,5 million access hits
   •   SW: DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs
   •   HW: High availability HP cluster with 32 Prolian DL360 nodes
   •   HW: NetApp FAS3170 with 60 TB disk
   •   HW: Adic Scalar i2000
   •   SW: LOCKSS (+ Conspectus...)
   •   HW: HP DL380 (LOCKSS cache)


  xxxx – …


                                                            www.tdx.cat
Outline



1. ¿What is CESCA?
2. CESCA services
      HPC ans Storage
      Network
      University e-Administration
      Portals and Repositories
3. Digital Repositories
      Overview
      Two examples: DSpace and web archiving
      Long term preservation
4. CESCA and GAIA
      What is done
      What could be done
GAIA at CESCA: what is done


             2000   2001   2002   2003   2004   2005   2006   2007   2008   2009   2010   2011




   Data
Processing
 IDT/IDU



Database
GDASS/COG




Storage



Backup
GAiA and CESCA: what could be done



Data processing                      Data Repository
Database



Large data                                 Powerful
transfer                               Searches and
                                     interoperability


 Storage and                           Preservation:
 Backup
                                       Dark copy, …
¡Thank you!

¿Questions?



ibarcena@cesca.cat
rdelavega@cesca.cat

More Related Content

Similar to GAIA Tech1 Data Repositories Meeting

Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and CeremonyArchiver
 
Biodiversity Information Networks: dataflows for interdisciplinary science
Biodiversity Information Networks: dataflows for interdisciplinary scienceBiodiversity Information Networks: dataflows for interdisciplinary science
Biodiversity Information Networks: dataflows for interdisciplinary scienceBruno Danis
 
Biodiversity Information Networks: Dataflows for interdisciplinary sciences
Biodiversity Information Networks: Dataflows for interdisciplinary sciencesBiodiversity Information Networks: Dataflows for interdisciplinary sciences
Biodiversity Information Networks: Dataflows for interdisciplinary sciencesGBIF_NPT
 
Datos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbADatos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbADaniel Vila Suero
 
Pathways for EOSC-hub and MaX collaboration
Pathways for EOSC-hub and MaX collaborationPathways for EOSC-hub and MaX collaboration
Pathways for EOSC-hub and MaX collaborationEOSC-hub project
 
Open Source Visualization of Scientific Data
Open Source Visualization of Scientific DataOpen Source Visualization of Scientific Data
Open Source Visualization of Scientific DataMarcus Hanwell
 
Tutorial on Hybrid Data Infrastructures: D4Science as a case study
Tutorial on Hybrid Data Infrastructures: D4Science as a case studyTutorial on Hybrid Data Infrastructures: D4Science as a case study
Tutorial on Hybrid Data Infrastructures: D4Science as a case studyBlue BRIDGE
 
A Service Perspective: Unlocking metadata to enhance discoverability and conn...
A Service Perspective: Unlocking metadata to enhance discoverability and conn...A Service Perspective: Unlocking metadata to enhance discoverability and conn...
A Service Perspective: Unlocking metadata to enhance discoverability and conn...EDINA, University of Edinburgh
 
Danis biosystematics2011
Danis biosystematics2011Danis biosystematics2011
Danis biosystematics2011Bruno Danis
 
Structural Biology in the Clouds: A Success Story of 10 years
Structural Biology in the Clouds: A Success Story of 10 yearsStructural Biology in the Clouds: A Success Story of 10 years
Structural Biology in the Clouds: A Success Story of 10 yearsAlexandreBonvin2
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017Stacy Véronneau
 
ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016dp-blog-cz
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyArchiver
 

Similar to GAIA Tech1 Data Repositories Meeting (20)

Jorge gomes
Jorge gomesJorge gomes
Jorge gomes
 
Jorge gomes
Jorge gomesJorge gomes
Jorge gomes
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and Ceremony
 
Biodiversity Information Networks: dataflows for interdisciplinary science
Biodiversity Information Networks: dataflows for interdisciplinary scienceBiodiversity Information Networks: dataflows for interdisciplinary science
Biodiversity Information Networks: dataflows for interdisciplinary science
 
Biodiversity Information Networks: Dataflows for interdisciplinary sciences
Biodiversity Information Networks: Dataflows for interdisciplinary sciencesBiodiversity Information Networks: Dataflows for interdisciplinary sciences
Biodiversity Information Networks: Dataflows for interdisciplinary sciences
 
Datos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbADatos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbA
 
Pathways for EOSC-hub and MaX collaboration
Pathways for EOSC-hub and MaX collaborationPathways for EOSC-hub and MaX collaboration
Pathways for EOSC-hub and MaX collaboration
 
Open Source Visualization of Scientific Data
Open Source Visualization of Scientific DataOpen Source Visualization of Scientific Data
Open Source Visualization of Scientific Data
 
Tutorial on Hybrid Data Infrastructures: D4Science as a case study
Tutorial on Hybrid Data Infrastructures: D4Science as a case studyTutorial on Hybrid Data Infrastructures: D4Science as a case study
Tutorial on Hybrid Data Infrastructures: D4Science as a case study
 
The Ontario library research cloud
The Ontario library research cloudThe Ontario library research cloud
The Ontario library research cloud
 
A Service Perspective: Unlocking metadata to enhance discoverability and conn...
A Service Perspective: Unlocking metadata to enhance discoverability and conn...A Service Perspective: Unlocking metadata to enhance discoverability and conn...
A Service Perspective: Unlocking metadata to enhance discoverability and conn...
 
Danis biosystematics2011
Danis biosystematics2011Danis biosystematics2011
Danis biosystematics2011
 
Using a dumb identifier to do smart things
Using a dumb identifier to do smart thingsUsing a dumb identifier to do smart things
Using a dumb identifier to do smart things
 
Structural Biology in the Clouds: A Success Story of 10 years
Structural Biology in the Clouds: A Success Story of 10 yearsStructural Biology in the Clouds: A Success Story of 10 years
Structural Biology in the Clouds: A Success Story of 10 years
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017
 
ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
Open Science Area Activities
Open Science Area ActivitiesOpen Science Area Activities
Open Science Area Activities
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
 
ICOS Services and Products
ICOS Services and Products ICOS Services and Products
ICOS Services and Products
 

More from Ricard de la Vega

The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)Ricard de la Vega
 
Servicios de datos para todo el ciclode investigación
Servicios de datos para todo el ciclode investigaciónServicios de datos para todo el ciclode investigación
Servicios de datos para todo el ciclode investigaciónRicard de la Vega
 
Padicat: O archivo da web da Catalunha
Padicat: O archivo da web da CatalunhaPadicat: O archivo da web da Catalunha
Padicat: O archivo da web da CatalunhaRicard de la Vega
 
La conservació digital d'obres cinematpgràfiques: un projecte del CSUC pel Ce...
La conservació digital d'obres cinematpgràfiques: un projecte del CSUC pel Ce...La conservació digital d'obres cinematpgràfiques: un projecte del CSUC pel Ce...
La conservació digital d'obres cinematpgràfiques: un projecte del CSUC pel Ce...Ricard de la Vega
 
Proyectos cooperativos de ciencia abierta en Catalunya
Proyectos cooperativos de ciencia abierta en CatalunyaProyectos cooperativos de ciencia abierta en Catalunya
Proyectos cooperativos de ciencia abierta en CatalunyaRicard de la Vega
 
Requisitos funcionales para la creación de repositorios consorciados de datos...
Requisitos funcionales para la creación de repositorios consorciados de datos...Requisitos funcionales para la creación de repositorios consorciados de datos...
Requisitos funcionales para la creación de repositorios consorciados de datos...Ricard de la Vega
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Ricard de la Vega
 
Quatre tuits sobre metodologies àgils
Quatre tuits sobre metodologies àgilsQuatre tuits sobre metodologies àgils
Quatre tuits sobre metodologies àgilsRicard de la Vega
 
Preservaçao digital de tese e dissertaçoes
Preservaçao digital de tese e dissertaçoesPreservaçao digital de tese e dissertaçoes
Preservaçao digital de tese e dissertaçoesRicard de la Vega
 
Analysis of requirements and benchmarking of CRIS for the Universities of Cat...
Analysis of requirements and benchmarking of CRIS for the Universities of Cat...Analysis of requirements and benchmarking of CRIS for the Universities of Cat...
Analysis of requirements and benchmarking of CRIS for the Universities of Cat...Ricard de la Vega
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataRicard de la Vega
 
Recomendador de artículos científicos basado en metadatos de repositorios dig...
Recomendador de artículos científicos basado en metadatos de repositorios dig...Recomendador de artículos científicos basado en metadatos de repositorios dig...
Recomendador de artículos científicos basado en metadatos de repositorios dig...Ricard de la Vega
 
Preservaçao digital distribuída de um repositório de teses de doutorado (TDX)
Preservaçao digital distribuída de um repositório de teses de doutorado (TDX)Preservaçao digital distribuída de um repositório de teses de doutorado (TDX)
Preservaçao digital distribuída de um repositório de teses de doutorado (TDX)Ricard de la Vega
 
De què parlem quan parlem de serveis al núvol?
De què parlem quan parlem de serveis al núvol?De què parlem quan parlem de serveis al núvol?
De què parlem quan parlem de serveis al núvol?Ricard de la Vega
 
El Portal de la Investigación de Catalunya, una suma de información de los CR...
El Portal de la Investigación de Catalunya, una suma de información de los CR...El Portal de la Investigación de Catalunya, una suma de información de los CR...
El Portal de la Investigación de Catalunya, una suma de información de los CR...Ricard de la Vega
 
The Catalan Research portal: collecting information from Catalan universities...
The Catalan Research portal: collecting information from Catalan universities...The Catalan Research portal: collecting information from Catalan universities...
The Catalan Research portal: collecting information from Catalan universities...Ricard de la Vega
 
Let's do data research work: the creation of a portal with research informati...
Let's do data research work: the creation of a portal with research informati...Let's do data research work: the creation of a portal with research informati...
Let's do data research work: the creation of a portal with research informati...Ricard de la Vega
 

More from Ricard de la Vega (20)

The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
 
Servicios de datos para todo el ciclode investigación
Servicios de datos para todo el ciclode investigaciónServicios de datos para todo el ciclode investigación
Servicios de datos para todo el ciclode investigación
 
Visualització de dades
Visualització de dadesVisualització de dades
Visualització de dades
 
Visualització de dades
Visualització de dadesVisualització de dades
Visualització de dades
 
Padicat: O archivo da web da Catalunha
Padicat: O archivo da web da CatalunhaPadicat: O archivo da web da Catalunha
Padicat: O archivo da web da Catalunha
 
La conservació digital d'obres cinematpgràfiques: un projecte del CSUC pel Ce...
La conservació digital d'obres cinematpgràfiques: un projecte del CSUC pel Ce...La conservació digital d'obres cinematpgràfiques: un projecte del CSUC pel Ce...
La conservació digital d'obres cinematpgràfiques: un projecte del CSUC pel Ce...
 
Proyectos cooperativos de ciencia abierta en Catalunya
Proyectos cooperativos de ciencia abierta en CatalunyaProyectos cooperativos de ciencia abierta en Catalunya
Proyectos cooperativos de ciencia abierta en Catalunya
 
Requisitos funcionales para la creación de repositorios consorciados de datos...
Requisitos funcionales para la creación de repositorios consorciados de datos...Requisitos funcionales para la creación de repositorios consorciados de datos...
Requisitos funcionales para la creación de repositorios consorciados de datos...
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
 
Quatre tuits sobre metodologies àgils
Quatre tuits sobre metodologies àgilsQuatre tuits sobre metodologies àgils
Quatre tuits sobre metodologies àgils
 
Preservaçao digital de tese e dissertaçoes
Preservaçao digital de tese e dissertaçoesPreservaçao digital de tese e dissertaçoes
Preservaçao digital de tese e dissertaçoes
 
Informàtic
InformàticInformàtic
Informàtic
 
Analysis of requirements and benchmarking of CRIS for the Universities of Cat...
Analysis of requirements and benchmarking of CRIS for the Universities of Cat...Analysis of requirements and benchmarking of CRIS for the Universities of Cat...
Analysis of requirements and benchmarking of CRIS for the Universities of Cat...
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories Metadata
 
Recomendador de artículos científicos basado en metadatos de repositorios dig...
Recomendador de artículos científicos basado en metadatos de repositorios dig...Recomendador de artículos científicos basado en metadatos de repositorios dig...
Recomendador de artículos científicos basado en metadatos de repositorios dig...
 
Preservaçao digital distribuída de um repositório de teses de doutorado (TDX)
Preservaçao digital distribuída de um repositório de teses de doutorado (TDX)Preservaçao digital distribuída de um repositório de teses de doutorado (TDX)
Preservaçao digital distribuída de um repositório de teses de doutorado (TDX)
 
De què parlem quan parlem de serveis al núvol?
De què parlem quan parlem de serveis al núvol?De què parlem quan parlem de serveis al núvol?
De què parlem quan parlem de serveis al núvol?
 
El Portal de la Investigación de Catalunya, una suma de información de los CR...
El Portal de la Investigación de Catalunya, una suma de información de los CR...El Portal de la Investigación de Catalunya, una suma de información de los CR...
El Portal de la Investigación de Catalunya, una suma de información de los CR...
 
The Catalan Research portal: collecting information from Catalan universities...
The Catalan Research portal: collecting information from Catalan universities...The Catalan Research portal: collecting information from Catalan universities...
The Catalan Research portal: collecting information from Catalan universities...
 
Let's do data research work: the creation of a portal with research informati...
Let's do data research work: the creation of a portal with research informati...Let's do data research work: the creation of a portal with research informati...
Let's do data research work: the creation of a portal with research informati...
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

GAIA Tech1 Data Repositories Meeting

  • 1. GAIA Tech1 Data Repositories Meeting Ingrid Bàrcena, HPC and Storage services manager Ricard de la Vega, Portals and Repositories manager GAIA Tech1 meeting Madrid May 24 2011
  • 2. Outline 1. ¿What is CESCA? 2. CESCA services HPC ans Storage Network University e-Administration Portals and Repositories 3. Digital Repositories Overview Two examples: DSpace and web archiving Long term preservation 4. CESCA and GAIA What is done What could be done
  • 3. Centre de Supercomputació de Catalunya Public Consortium created in 1991 Patrons: ICTS since 2000 • Generalitat de Catalunya • Fundació Catalana per a la Recerca i la Innovació • Universitat de Barcelona • Universitat Autònoma de Barcelona • Universitat Politècnica de Catalunya • Universitat Pompeu Fabra • Universitat de Girona • Universitat Rovira i Virgili • Universitat de Lleida • Universitat Oberta de Catalunya • Universitat Ramon Llull • Consell Superior d’Investigacions Científiques
  • 5. HPC and Storage HPC Service Storage Service Drug Design Service Disk Library NetApp FAS3170 150 TB 6 Pharma Labs 10 Academic research groups 21 TB FC drives 19,48 Tflop/s Peack performance 126 TB SATA drives 50 research projects ( 203 users) Main areas: • Materials Science (31%) • Life Science (32%) • Environmental Science (28%) Tape Library • Astronomy and Astrophysics (5%) + 3.5 HC used during 2010 ADIC i2000 + 50 scientific applications available 156 TB 2 Software Packages 6 LTO-4 drives 300 slots NetBackup 6.5
  • 6. Network services +80 connected institutions 21 institutions in Catalonia 24 ISP and operators 40 countries 2 core nodes at 10 Gbps Flexible bandwidth Services: IPv6, multimedia, Remot Access Service,Voice over IP, Eduroam, Security... Services: Multicast, IPv6, NTP Server, F root server (A and J, .com and .net coming soon)...
  • 7. University e-Administration Projects e-Register e-Vote e-Archive • Bid price : 405.000 € • Transfer agreement: 12-7-10 • URV: production 02-01-11 • Awarded (03-18-10): • Inst. ATLAS: 17.800 € Scytl, 345.000 € • Integr. Doc. Mgt: • UdL: production 03-14-11 Award: IECI 51.920 € (02-12-11) • Production: 02-01-11 • Production: 06-01-11 • Sadiel: 32.692 € Balancejadors F5 BIG-IP SCD (e-Identitat i e-Signatura) GPI • Available: Improvements (02-03-11) … • Inteum Sentinel i Technology EC-UR i EC-URV Publisher ER-CESCA, -URV, -UPC • Office 2007; separació MVs per -UdL, -UPF universitat; enviament correus • In development: ER-UdG, Licence renewal. UB i UPC ER-UB, ER-UAB i ER-UVic Investment: 1.046,97 € Capa de dades Cluster: 15 BL460c G6 (2 x Intel Xeon E5530 QC); 480 GB; 4,3 TB; XenServer Citrix; 2 load balancer F5 BIG-IP 1600; 110.487 €
  • 8. Portals and Repositories Since 2001 Since 2005 Since 2006 Since 2009 18 universities 22 institutions 328 journals 10 universities 10,577 doctoral thesis 24,564 research 129,235 articles 1,814 learning objects papers, eprints… www.tdx.cat www.raco.cat www.mdx.cat www.recercat.cat Since 2006 Pilot 2009-10 Since 2010 Since 2006 39,587 websites crawled 420 websites crawled 22 institutions Turnkey development 118,039 versions crawled Evolutionary maintenance 790 versions crawled 24,564 research 249M files in 7.5 TB papers, eprints… http://recyt.fecyt.es http://recyt.fecyt.es www.padicat.cat www.recercat.cat (restricted IP address) 31-03-11
  • 9. Outline 1. ¿What is CESCA? 2. CESCA services HPC ans Storage Network University e-Administration Portals and Repositories 3. Digital Repositories Overview Two examples: DSpace and web archiving Long term preservation 4. CESCA and GAIA What is done What could be done
  • 10. Digital Repositories A repository capture, store, index, preserve and distribute digital content. Data + Metadata • Dublin Core (DC) • Mets, Mods, marc21… • VO? • Astronomical? Main issues • Access (search / browse) • Preservation • Interoperability – Open Archive Initiative for metadada harvest (OAI-PMH) (based on Dublin Core metadata)
  • 11. Repositories taxonomy Towards a European e-Infrastructure for e-Science Digital Repositories. 7th e-Concentration Meeting, Brussels, 12-14th October, 2009
  • 12. Repositories Hardware High availability Load balancing Easy scalability Balancers 24x7 monitoring Services … Storage Area Network Data … Disc Tape
  • 13. Repositories Software For general purpose • DSpace, EPrints, Fedora, Islandora… • Implemented in For journal management • Open Journal Systems (OJS) • Implemented in For web archives preservation • Heritrix, NutchWAX, WERA, Wayback, Webcurator… • Implemented in
  • 14. Example on general purpose repository (DSpace) For digital objects, like PDF, images, videos, data… Index metadata and PDF for searching
  • 15. Example on web archive (PADICAT) PADICAT consists of collecting, processing and providing permanent access to the entire cultural, scientific and general output of Catalonia in digital format. It is the Catalan web sites archive. PANDORA UK ARCHIVE IA VEFSAFN BNF Kulturarw3 Netarchive Scope Australia UK World Islandia France Sueden Denmark Begin 1996 2004 1996 2004 2002 1996 2005 Open access since 2009 Search by URL S. by keyword Directori N. websites 26.630 8.308 - - - - > 1,1 milions N. crawls 60.276 32.618 150 billion - - - 4,5 bilions Space 4,63 TB 7,59 TB - - 180 TB - 155TB Data 16-12-2010 12-01-2011 13-12-2011 13-01-2011 13-01-2011 26-11-2010 08-2010 Since 2006 - 39,587 websites crawled - Open Access - 118,039 versions crawled - Search by URL and keyword - 249M files in 7.5 TB - Catalogue and thematic directory www.padicat.cat
  • 16. Web archive software architecture 1. Harvest 2. Index and search 3. Catalogue and browse WERA WAYBACK CATALOG DATABASE (Crawl Metadata) HERITRIX WEB CURATOR TOOL HADOOP INDEX FOR KEYWORD SEARCHING + NUTCHWAX INDEX FOR URL SEARCHING ARXIUS ARCINDEXER ARC
  • 17. PADICAT’s indexes Until now (< 100.000 website version crawled) • For search by URL (like Internet Archive) – Index with ArcIndexer (~100 GB) + visualize with Wayback √ • For search by keyword – Index with Hadoop+NutchWAX + visualize with WERA √ Now (120.000 website version crawls) • Performance problems for keyword indexing • Two solutions under evaluation: – Index with a new version of NutchWAX + visualize with TNH (the new hotness, from IA) – Index with JB (James Brown, from IA) + visualize with TNH
  • 18. Long term preservation The e-infrastructure must ensure the long term data access, without failure. To succeed, it must be taken into account: • Replication (more than one copy) • Media refresh • Format migration • Data integrity (checksums) • Contingency and recovery plan • Preservation plan • ...
  • 19. An example of long term preservation The “preservation history” of TDX (doctoral theses)… 2001 – 80 GB, 8.000 access hits • SW: ETDdb (+ MySQL, Glimpse…) from Virginia Tech • HW: HP V2500 with 16 processors, 4 GB memory, 227 GB disk • HW: StorageTek TimberWolf 9740 with 2,7 TB of 9840 tapes Born in a supercomputer!
  • 20. An example of long term preservation The “preservation history” of TDX (doctoral theses)… Hardware migrations • 2003 (cpu + disk) – HP rp5430 with 2 processors, 704 GB memory – HP EVA V.2 with 2,8 TB disk • 2006 (cpu + tape) – High availability HP cluster with 32 Proliant DL360 nodes – Adic Scalar i2000 (from 9840 tapes to LTO3 tapes) • 2009 (disk) – NetApp FAS3170 with 60 TB disk Software migrations • 2010 – DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs
  • 21. An example of long term preservation The “preservation history” of TDX (doctoral theses)… Replication • On disk - Online version (1) • One backup on the tape library (2) • Other backup on a fireproof cabinet (3) • Other backup on a 50 Km remote Centre (4) • A dark copy on the MetaArchive Cooperative – Private LOCKSS (Lots of Copies Keep Stuff Safe) Network – 10 more copies around the world (14) Data Integrity • Checksums on DSpace (online version) • Checksums on LOCKSS (dark copies)
  • 22. An example of long term preservation The “preservation history” of TDX (doctoral theses)… 2011 – 300 GB, + of 3,5 million access hits • SW: DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs • HW: High availability HP cluster with 32 Prolian DL360 nodes • HW: NetApp FAS3170 with 60 TB disk • HW: Adic Scalar i2000 • SW: LOCKSS (+ Conspectus...) • HW: HP DL380 (LOCKSS cache) xxxx – … www.tdx.cat
  • 23. Outline 1. ¿What is CESCA? 2. CESCA services HPC ans Storage Network University e-Administration Portals and Repositories 3. Digital Repositories Overview Two examples: DSpace and web archiving Long term preservation 4. CESCA and GAIA What is done What could be done
  • 24. GAIA at CESCA: what is done 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Data Processing IDT/IDU Database GDASS/COG Storage Backup
  • 25. GAiA and CESCA: what could be done Data processing Data Repository Database Large data Powerful transfer Searches and interoperability Storage and Preservation: Backup Dark copy, …