1. GAIA Tech1 Data Repositories Meeting
Ingrid Bàrcena, HPC and Storage services manager
Ricard de la Vega, Portals and Repositories manager
GAIA Tech1 meeting
Madrid May 24 2011
2. Outline
1. ¿What is CESCA?
2. CESCA services
HPC ans Storage
Network
University e-Administration
Portals and Repositories
3. Digital Repositories
Overview
Two examples: DSpace and web archiving
Long term preservation
4. CESCA and GAIA
What is done
What could be done
3. Centre de Supercomputació de Catalunya
Public Consortium created in 1991 Patrons:
ICTS since 2000 • Generalitat de Catalunya
• Fundació Catalana per a la Recerca i
la Innovació
• Universitat de Barcelona
• Universitat Autònoma
de Barcelona
• Universitat Politècnica
de Catalunya
• Universitat Pompeu Fabra
• Universitat de Girona
• Universitat Rovira i Virgili
• Universitat de Lleida
• Universitat Oberta
de Catalunya
• Universitat Ramon Llull
• Consell Superior
d’Investigacions Científiques
5. HPC and Storage
HPC Service Storage Service Drug Design Service
Disk Library
NetApp FAS3170
150 TB 6 Pharma Labs
10 Academic research groups
21 TB FC drives
19,48 Tflop/s Peack performance 126 TB SATA drives
50 research projects ( 203 users)
Main areas:
• Materials Science (31%)
• Life Science (32%)
• Environmental Science (28%) Tape Library
• Astronomy and Astrophysics (5%)
+ 3.5 HC used during 2010 ADIC i2000
+ 50 scientific applications available 156 TB 2 Software Packages
6 LTO-4 drives
300 slots
NetBackup 6.5
6. Network services
+80 connected institutions 21 institutions in Catalonia 24 ISP and operators
40 countries
2 core nodes at 10 Gbps
Flexible bandwidth
Services: IPv6, multimedia, Remot
Access Service,Voice over IP,
Eduroam, Security...
Services: Multicast, IPv6, NTP
Server, F root server (A and J,
.com and .net coming soon)...
8. Portals and Repositories
Since 2001 Since 2005 Since 2006 Since 2009
18 universities 22 institutions 328 journals 10 universities
10,577 doctoral thesis 24,564 research 129,235 articles 1,814 learning objects
papers, eprints…
www.tdx.cat www.raco.cat www.mdx.cat
www.recercat.cat
Since 2006 Pilot 2009-10
Since 2010 Since 2006
39,587 websites crawled 420 websites crawled
22 institutions Turnkey development
118,039 versions crawled Evolutionary maintenance
790 versions crawled
24,564 research
249M files in 7.5 TB papers, eprints… http://recyt.fecyt.es http://recyt.fecyt.es
www.padicat.cat www.recercat.cat (restricted IP address)
31-03-11
9. Outline
1. ¿What is CESCA?
2. CESCA services
HPC ans Storage
Network
University e-Administration
Portals and Repositories
3. Digital Repositories
Overview
Two examples: DSpace and web archiving
Long term preservation
4. CESCA and GAIA
What is done
What could be done
10. Digital Repositories
A repository capture, store, index, preserve and distribute
digital content.
Data + Metadata
• Dublin Core (DC)
• Mets, Mods, marc21…
• VO?
• Astronomical?
Main issues
• Access (search / browse)
• Preservation
• Interoperability
– Open Archive Initiative for metadada harvest (OAI-PMH)
(based on Dublin Core metadata)
11. Repositories taxonomy
Towards a European e-Infrastructure for e-Science Digital Repositories. 7th e-Concentration Meeting, Brussels, 12-14th October, 2009
12. Repositories Hardware
High availability
Load balancing
Easy scalability
Balancers 24x7 monitoring
Services
…
Storage Area Network
Data
… Disc Tape
13. Repositories Software
For general purpose
• DSpace, EPrints, Fedora, Islandora…
• Implemented in
For journal management
• Open Journal Systems (OJS)
• Implemented in
For web archives preservation
• Heritrix, NutchWAX, WERA, Wayback, Webcurator…
• Implemented in
14. Example on general purpose repository (DSpace)
For digital objects, like PDF, images, videos, data…
Index metadata and PDF for searching
15. Example on web archive (PADICAT)
PADICAT consists of collecting, processing and providing
permanent access to the entire cultural, scientific and
general output of Catalonia in digital format. It is the
Catalan web sites archive.
PANDORA UK ARCHIVE IA VEFSAFN BNF Kulturarw3 Netarchive
Scope Australia UK World Islandia France Sueden Denmark
Begin 1996 2004 1996 2004 2002 1996 2005
Open access since 2009
Search by URL
S. by keyword
Directori
N. websites 26.630 8.308 - - - - > 1,1 milions
N. crawls 60.276 32.618 150 billion - - - 4,5 bilions
Space 4,63 TB 7,59 TB - - 180 TB - 155TB
Data 16-12-2010 12-01-2011 13-12-2011 13-01-2011 13-01-2011 26-11-2010 08-2010
Since 2006
- 39,587 websites crawled - Open Access
- 118,039 versions crawled - Search by URL and keyword
- 249M files in 7.5 TB - Catalogue and thematic directory
www.padicat.cat
16. Web archive software architecture
1. Harvest
2. Index and search
3. Catalogue and browse
WERA
WAYBACK
CATALOG DATABASE
(Crawl Metadata)
HERITRIX
WEB CURATOR TOOL
HADOOP INDEX FOR KEYWORD SEARCHING
+
NUTCHWAX
INDEX FOR URL SEARCHING
ARXIUS ARCINDEXER
ARC
17. PADICAT’s indexes
Until now (< 100.000 website version crawled)
• For search by URL (like Internet Archive)
– Index with ArcIndexer (~100 GB) + visualize with Wayback √
• For search by keyword
– Index with Hadoop+NutchWAX + visualize with WERA √
Now (120.000 website version crawls)
• Performance problems for keyword indexing
• Two solutions under evaluation:
– Index with a new version of NutchWAX + visualize with TNH (the new
hotness, from IA)
– Index with JB (James Brown, from IA) + visualize with TNH
18. Long term preservation
The e-infrastructure must ensure the long term data
access, without failure.
To succeed, it must be taken into account:
• Replication (more than one copy)
• Media refresh
• Format migration
• Data integrity (checksums)
• Contingency and recovery plan
• Preservation plan
• ...
19. An example of long term preservation
The “preservation history” of TDX (doctoral theses)…
2001 – 80 GB, 8.000 access hits
• SW: ETDdb (+ MySQL, Glimpse…) from Virginia Tech
• HW: HP V2500 with 16 processors, 4 GB memory, 227 GB disk
• HW: StorageTek TimberWolf 9740 with 2,7 TB of 9840 tapes
Born in a supercomputer!
20. An example of long term preservation
The “preservation history” of TDX (doctoral theses)…
Hardware migrations
• 2003 (cpu + disk)
– HP rp5430 with 2 processors, 704 GB memory
– HP EVA V.2 with 2,8 TB disk
• 2006 (cpu + tape)
– High availability HP cluster with 32 Proliant DL360 nodes
– Adic Scalar i2000 (from 9840 tapes to LTO3 tapes)
• 2009 (disk)
– NetApp FAS3170 with 60 TB disk
Software migrations
• 2010 – DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs
21. An example of long term preservation
The “preservation history” of TDX (doctoral theses)…
Replication
• On disk - Online version (1)
• One backup on the tape library (2)
• Other backup on a fireproof cabinet (3)
• Other backup on a 50 Km remote Centre (4)
• A dark copy on the MetaArchive Cooperative
– Private LOCKSS (Lots of Copies Keep Stuff Safe) Network
– 10 more copies around the world (14)
Data Integrity
• Checksums on DSpace (online version)
• Checksums on LOCKSS (dark copies)
22. An example of long term preservation
The “preservation history” of TDX (doctoral theses)…
2011 – 300 GB, + of 3,5 million access hits
• SW: DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs
• HW: High availability HP cluster with 32 Prolian DL360 nodes
• HW: NetApp FAS3170 with 60 TB disk
• HW: Adic Scalar i2000
• SW: LOCKSS (+ Conspectus...)
• HW: HP DL380 (LOCKSS cache)
xxxx – …
www.tdx.cat
23. Outline
1. ¿What is CESCA?
2. CESCA services
HPC ans Storage
Network
University e-Administration
Portals and Repositories
3. Digital Repositories
Overview
Two examples: DSpace and web archiving
Long term preservation
4. CESCA and GAIA
What is done
What could be done
24. GAIA at CESCA: what is done
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Data
Processing
IDT/IDU
Database
GDASS/COG
Storage
Backup
25. GAiA and CESCA: what could be done
Data processing Data Repository
Database
Large data Powerful
transfer Searches and
interoperability
Storage and Preservation:
Backup
Dark copy, …