GAIA Tech1 Data Repositories Meeting
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
811
On Slideshare
811
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
7
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. GAIA Tech1 Data Repositories MeetingIngrid Bàrcena, HPC and Storage services managerRicard de la Vega, Portals and Repositories managerGAIA Tech1 meetingMadrid May 24 2011
  • 2. Outline1. ¿What is CESCA?2. CESCA services HPC ans Storage Network University e-Administration Portals and Repositories3. Digital Repositories Overview Two examples: DSpace and web archiving Long term preservation4. CESCA and GAIA What is done What could be done
  • 3. Centre de Supercomputació de Catalunya Public Consortium created in 1991 Patrons: ICTS since 2000 • Generalitat de Catalunya • Fundació Catalana per a la Recerca i la Innovació • Universitat de Barcelona • Universitat Autònoma de Barcelona • Universitat Politècnica de Catalunya • Universitat Pompeu Fabra • Universitat de Girona • Universitat Rovira i Virgili • Universitat de Lleida • Universitat Oberta de Catalunya • Universitat Ramon Llull • Consell Superior d’Investigacions Científiques
  • 4. Our Services
  • 5. HPC and Storage HPC Service Storage Service Drug Design Service Disk Library NetApp FAS3170 150 TB 6 Pharma Labs 10 Academic research groups 21 TB FC drives19,48 Tflop/s Peack performance 126 TB SATA drives50 research projects ( 203 users)Main areas: • Materials Science (31%) • Life Science (32%) • Environmental Science (28%) Tape Library • Astronomy and Astrophysics (5%)+ 3.5 HC used during 2010 ADIC i2000+ 50 scientific applications available 156 TB 2 Software Packages 6 LTO-4 drives 300 slots NetBackup 6.5
  • 6. Network services+80 connected institutions 21 institutions in Catalonia 24 ISP and operators 40 countries2 core nodes at 10 GbpsFlexible bandwidthServices: IPv6, multimedia, RemotAccess Service,Voice over IP,Eduroam, Security... Services: Multicast, IPv6, NTP Server, F root server (A and J, .com and .net coming soon)...
  • 7. University e-Administration Projectse-Register e-Vote e-Archive • Bid price : 405.000 € • Transfer agreement: 12-7-10• URV: production 02-01-11 • Awarded (03-18-10): • Inst. ATLAS: 17.800 € Scytl, 345.000 € • Integr. Doc. Mgt:• UdL: production 03-14-11 Award: IECI 51.920 € (02-12-11) • Production: 02-01-11 • Production: 06-01-11• Sadiel: 32.692 € Balancejadors F5 BIG-IPSCD (e-Identitat i e-Signatura) GPI• Available: Improvements (02-03-11) … • Inteum Sentinel i Technology EC-UR i EC-URV Publisher ER-CESCA, -URV, -UPC • Office 2007; separació MVs per -UdL, -UPF universitat; enviament correus• In development: ER-UdG, Licence renewal. UB i UPC ER-UB, ER-UAB i ER-UVic Investment: 1.046,97 € Capa de dades Cluster: 15 BL460c G6 (2 x Intel Xeon E5530 QC); 480 GB; 4,3 TB; XenServer Citrix; 2 load balancer F5 BIG-IP 1600; 110.487 €
  • 8. Portals and RepositoriesSince 2001 Since 2005 Since 2006 Since 200918 universities 22 institutions 328 journals 10 universities10,577 doctoral thesis 24,564 research 129,235 articles 1,814 learning objects papers, eprints…www.tdx.cat www.raco.cat www.mdx.cat www.recercat.catSince 2006 Pilot 2009-10 Since 2010 Since 200639,587 websites crawled 420 websites crawled 22 institutions Turnkey development118,039 versions crawled Evolutionary maintenance 790 versions crawled 24,564 research249M files in 7.5 TB papers, eprints… http://recyt.fecyt.es http://recyt.fecyt.eswww.padicat.cat www.recercat.cat (restricted IP address) 31-03-11
  • 9. Outline1. ¿What is CESCA?2. CESCA services HPC ans Storage Network University e-Administration Portals and Repositories3. Digital Repositories Overview Two examples: DSpace and web archiving Long term preservation4. CESCA and GAIA What is done What could be done
  • 10. Digital Repositories A repository capture, store, index, preserve and distribute digital content. Data + Metadata • Dublin Core (DC) • Mets, Mods, marc21… • VO? • Astronomical? Main issues • Access (search / browse) • Preservation • Interoperability – Open Archive Initiative for metadada harvest (OAI-PMH) (based on Dublin Core metadata)
  • 11. Repositories taxonomyTowards a European e-Infrastructure for e-Science Digital Repositories. 7th e-Concentration Meeting, Brussels, 12-14th October, 2009
  • 12. Repositories Hardware High availability Load balancing Easy scalability Balancers 24x7 monitoring Services … Storage Area Network Data … Disc Tape
  • 13. Repositories Software For general purpose • DSpace, EPrints, Fedora, Islandora… • Implemented in For journal management • Open Journal Systems (OJS) • Implemented in For web archives preservation • Heritrix, NutchWAX, WERA, Wayback, Webcurator… • Implemented in
  • 14. Example on general purpose repository (DSpace) For digital objects, like PDF, images, videos, data… Index metadata and PDF for searching
  • 15. Example on web archive (PADICAT) PADICAT consists of collecting, processing and providing permanent access to the entire cultural, scientific and general output of Catalonia in digital format. It is the Catalan web sites archive. PANDORA UK ARCHIVE IA VEFSAFN BNF Kulturarw3 NetarchiveScope Australia UK World Islandia France Sueden DenmarkBegin 1996 2004 1996 2004 2002 1996 2005Open access since 2009Search by URLS. by keywordDirectoriN. websites 26.630 8.308 - - - - > 1,1 milionsN. crawls 60.276 32.618 150 billion - - - 4,5 bilionsSpace 4,63 TB 7,59 TB - - 180 TB - 155TBData 16-12-2010 12-01-2011 13-12-2011 13-01-2011 13-01-2011 26-11-2010 08-2010 Since 2006 - 39,587 websites crawled - Open Access - 118,039 versions crawled - Search by URL and keyword - 249M files in 7.5 TB - Catalogue and thematic directory www.padicat.cat
  • 16. Web archive software architecture1. Harvest2. Index and search3. Catalogue and browse WERA WAYBACKCATALOG DATABASE (Crawl Metadata) HERITRIX WEB CURATOR TOOL HADOOP INDEX FOR KEYWORD SEARCHING + NUTCHWAX INDEX FOR URL SEARCHING ARXIUS ARCINDEXER ARC
  • 17. PADICAT’s indexes Until now (< 100.000 website version crawled) • For search by URL (like Internet Archive) – Index with ArcIndexer (~100 GB) + visualize with Wayback √ • For search by keyword – Index with Hadoop+NutchWAX + visualize with WERA √ Now (120.000 website version crawls) • Performance problems for keyword indexing • Two solutions under evaluation: – Index with a new version of NutchWAX + visualize with TNH (the new hotness, from IA) – Index with JB (James Brown, from IA) + visualize with TNH
  • 18. Long term preservation The e-infrastructure must ensure the long term data access, without failure. To succeed, it must be taken into account: • Replication (more than one copy) • Media refresh • Format migration • Data integrity (checksums) • Contingency and recovery plan • Preservation plan • ...
  • 19. An example of long term preservationThe “preservation history” of TDX (doctoral theses)… 2001 – 80 GB, 8.000 access hits • SW: ETDdb (+ MySQL, Glimpse…) from Virginia Tech • HW: HP V2500 with 16 processors, 4 GB memory, 227 GB disk • HW: StorageTek TimberWolf 9740 with 2,7 TB of 9840 tapes Born in a supercomputer!
  • 20. An example of long term preservationThe “preservation history” of TDX (doctoral theses)… Hardware migrations • 2003 (cpu + disk) – HP rp5430 with 2 processors, 704 GB memory – HP EVA V.2 with 2,8 TB disk • 2006 (cpu + tape) – High availability HP cluster with 32 Proliant DL360 nodes – Adic Scalar i2000 (from 9840 tapes to LTO3 tapes) • 2009 (disk) – NetApp FAS3170 with 60 TB disk Software migrations • 2010 – DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs
  • 21. An example of long term preservationThe “preservation history” of TDX (doctoral theses)… Replication • On disk - Online version (1) • One backup on the tape library (2) • Other backup on a fireproof cabinet (3) • Other backup on a 50 Km remote Centre (4) • A dark copy on the MetaArchive Cooperative – Private LOCKSS (Lots of Copies Keep Stuff Safe) Network – 10 more copies around the world (14) Data Integrity • Checksums on DSpace (online version) • Checksums on LOCKSS (dark copies)
  • 22. An example of long term preservationThe “preservation history” of TDX (doctoral theses)… 2011 – 300 GB, + of 3,5 million access hits • SW: DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs • HW: High availability HP cluster with 32 Prolian DL360 nodes • HW: NetApp FAS3170 with 60 TB disk • HW: Adic Scalar i2000 • SW: LOCKSS (+ Conspectus...) • HW: HP DL380 (LOCKSS cache) xxxx – … www.tdx.cat
  • 23. Outline1. ¿What is CESCA?2. CESCA services HPC ans Storage Network University e-Administration Portals and Repositories3. Digital Repositories Overview Two examples: DSpace and web archiving Long term preservation4. CESCA and GAIA What is done What could be done
  • 24. GAIA at CESCA: what is done 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 DataProcessing IDT/IDUDatabaseGDASS/COGStorageBackup
  • 25. GAiA and CESCA: what could be doneData processing Data RepositoryDatabaseLarge data Powerfultransfer Searches and interoperability Storage and Preservation: Backup Dark copy, …
  • 26. ¡Thank you!¿Questions?ibarcena@cesca.catrdelavega@cesca.cat