Trusted Datagrids: Library of Congress Projects with UCSD


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This presentation discusses an project between the Library of Congress and SDSC. The focus of the project has been on “trust building” between the organizations. ***[THEME OF THE PROJECT: Exploring large-scale data transfer and storage, and developing trust in the ability of an outside party to reliably store content.] One of the collections that we’ve been working with from the LC is the Prokudin-Gorskiii photograph collection from Prints and Photographs. I’m going to be using some of the characters from this collection to help me tell my story.
  • This presentation discusses an project between the Library of Congress and SDSC. The focus of the project has been on “trust building” between the organizations. ***[THEME OF THE PROJECT: Exploring large-scale data transfer and storage, and developing trust in the ability of an outside party to reliably store content.] One of the collections that we’ve been working with from the LC is the Prokudin-Gorskiii photograph collection from Prints and Photographs. I’m going to be using some of the characters from this collection to help me tell my story.
  • As they moved into the realm of large-scale data, the LC found themselves wandering into unmarked territory. It can be a real wilderness out there! [Careful not to portray us as lost and clueless. You don’t want to be accidentally insulting.] [Again, the theme is : Exploring large-scale data transfer and storage, and developing trust in the ability of an outside party to reliably store content.]
  • Of course, this isn’t just a problem for the LC. There are a number of organizations facing the same problems – and wandering in the same wilderness. How can they decide what direction to take without an accurate map?
  • This is Pinchus the boatman, and he’s one of the most famous people from the P-G collection. And of course, there are plenty of people like Pinchus - people who claim they can help. They even claim to have the tools needed to move forward. But of course the question remains – how can you trust someone you can’t control? Because the problem is that these promises are often just like old photographs – they look clear and well-defined at a distance…
  • But, as you get closer and close, the picture gets fuzzier and fuzzier.
  • Self explanatory.
  • At SDSC we’re focused on bringing forward the right combinations of tools and experts – cyberinfrastructure. [Points to make: “ We began our project together with a set of scenarios created by the LC, describing, as best as LC could imagine, tests and tasks for SDSC to prove that LC’s content (hosted at SDSC) was safe and secure.” “ The first step for SDSC was to help LC create a solid cyberinfrastructure.”]
  • For example.
  • [Elaborate just a little more on what it took to set this up: your remote access to LC, Andy’s role in tweaking the internal network, the network monitoring, authentication at your end.] Another famous face from the P-G collection is the Emir of Burakha. Today the Emir is going to stand in for the LC’s data. One of the services that we provided was high speed replication across the country. We did this by checksumming the data first, then shipping it across the country using Internet2. Then we checked it upon arrival. We then did the opposite and shipped the data back, checking it upon arrival. All in all we achieved good transfer speeds. We hope to improve this service as we continue to work with LC.
  • At the same time we create a safe haven for the data, both within SDSC as well as across the country.
  • And of course we provided all of the enhanced services you would expect to keep your data safe. [You’ll elaborate, right? People will want to know specific details, probably culminating in the server report emails.]
  • We also created a demonstration for a new tool that the LC could use to access and manipulate their data. […if they chose too. “Ultimately LC decided they wanted to not manipulate their data but only to have SDSC store it safely.”]
  • [Will you talk about the SAM-QFS outage? It proved SDSC’s reliability and highlighted “accessibility” as a separate issue.] [“Duplication of structure is complicated.” Are you going to talk about the Prokudin-Gorskii symbolic links issue here?]
  • [Will you talk about the SAM-QFS outage? It proved SDSC’s reliability and highlighted “accessibility” as a separate issue.] [“Duplication of structure is complicated.” Are you going to talk about the Prokudin-Gorskii symbolic links issue here?]
  • Another LC collection we were working with was a 6TB web archive collection. We were tasked with working with the collection in new and innovative ways. We modified the original software to create a FAST version which still looked default to users. (In fact Internet Archive liked our changes so much that they rolled some of them back into the application.)
  • This was not acceptable. The time constraints were also important because we were working under a deadline to complete the work.
  • In here is where I need to talk about how we did this. Can I get some more information for this slide? - How many physical machines was this on - Was anything special done on the software side to get this to happen - Was there any custom configuration that needed to happen?
  • Finally, we’re nearing the end of the project and are working with the LC to come up with documentation based on our work that could be used by anyone in NDIIPP. This should provide a starting place for people looking to do the same kinds of tasks.
  • Overview Providers / NCAR, UMd, SDSC, UCSD Libraries Partners, in this case they are clients / ICPSR, CDL
  • [Are you going to be more specific? You address specifics of our projects (transfer of data over Internet 2, description of storage and system-monitoring environment). In what specific way does Chronopolis compare?]
  • Self-explanatory. [And please summarize exactly how the LC validated trust in SDSC]
  • Self-explanatory. [In what way? In a review sense?]
  • Self-explanatory. [I missed the point. Could you summarize how Cyberinfrastructure applies specifically to the elements of our project? This could be as simple as reviewing the Internet 2 network setup, the SDSC storage system, system reporting, and…you never mentioned geographic data replication. If you are going to end on “trust” on the next slide, make sure you make a relevant and specific summary statement here about how SDSC demonstrated trustabililty.
  • Self-explanatory.
  • Trusted Datagrids: Library of Congress Projects with UCSD

    1. 1. Trusted Datagrids: Library of Congress Projects with UCSD Ardys Kozbial – UCSD Libraries David Minor - SDSC
    2. 2. Building Trust in a 3 RD Party Repository: A Pilot Project David Minor San Diego Supercomputer Center
    3. 6. How can the LC trust someone they can’t control?
    4. 8. Moving forward in the right direction requires more than fuzzy promises
    5. 9. … it takes a combination of experts and tools. Cyberinfrastructure
    6. 10. Cyberinfrastructure is the collection of ... Resources + Glue Computers, data storage, networks, scientific instruments, experts, etc. Integrating software, systems, and organizations
    7. 11. “ Effective cyberinfrastructure for the humanities and social sciences will allow scholars to focus their intellectual and scholarly energies on the issues that engage them, and to be effective users of new media and new technologies, rather than having to invent them.” - ACLS Commission on Cyberinfrastructure for the Humanities & Social Sciences
    8. 12. <ul><li>“ The mission of the San Diego Supercomputer Center (SDSC) is to empower communities in data-oriented research, education, and practice through the innovation and provision of Cyberinfrastructure” </li></ul>
    9. 13. <ul><li>SDSC ... </li></ul><ul><li>Is one of the original NSF supercomputer centers </li></ul><ul><li>Supports high performance computing systems </li></ul><ul><li>Supports data applications for science, engineering, social sciences, cultural heritage institutions </li></ul><ul><li>Has LARGE data capabilities </li></ul><ul><ul><ul><li>3+ PB Disk Storage </li></ul></ul></ul><ul><ul><ul><li>25+ PB Tape Storage </li></ul></ul></ul>
    10. 14. UCSD Libraries <ul><ul><li>3.5+ million volumes </li></ul></ul><ul><ul><li>Digital Access Management System (in development) </li></ul></ul><ul><ul><ul><li>250,000+ objects </li></ul></ul></ul><ul><ul><ul><li>15+ TB </li></ul></ul></ul><ul><ul><li>Shared collections with UC </li></ul></ul><ul><ul><ul><li>California Digital Library </li></ul></ul></ul><ul><ul><ul><ul><li>Digital Preservation Repository </li></ul></ul></ul></ul><ul><ul><ul><ul><li>eScholarship repository </li></ul></ul></ul></ul>
    11. 15. Partnerships and Collaborations <ul><li>LC Pilot Project – Building Trust in a 3 rd Party Repository </li></ul><ul><ul><li>Using test image collections/web crawls ingest content to SDSC repository </li></ul></ul><ul><ul><li>Allow access for content audit </li></ul></ul><ul><ul><li>Track usage of content over time </li></ul></ul><ul><ul><li>Deliver content back to LC at end of project </li></ul></ul><ul><li>Library of Congress NDIIPP Chronopolis Program </li></ul><ul><ul><li>Build Production Capable Chronopolis Grid (50 TB x 3) </li></ul></ul><ul><ul><li>Further define transmission packaging for archival communities </li></ul></ul><ul><ul><li>Investigate best network transfer models for I2 and TeraGrid networks </li></ul></ul><ul><li>California Digital Library (CDL) Mass Transit Program </li></ul><ul><ul><li>Enable UC System Libraries to transfer high-speed mass digitization collections across CENIC/I2 </li></ul></ul><ul><ul><li>Develop transmission packaging for CDL content </li></ul></ul><ul><li>UCSD Libraries’ Digital Asset Management System </li></ul><ul><ul><li>RDF System with data managed in SRB at SDSC </li></ul></ul>
    12. 16. SDSC DPI Group <ul><li>Digital Preservation Initiatives Group </li></ul><ul><ul><li>Charged with Developing and Supporting Digital Preservation Services within the Production Systems Division of SDSC. </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li>Cross-Organizational Group </li></ul></ul><ul><ul><ul><li>SDSC Personnel/UCSD Libraries Personnel </li></ul></ul></ul><ul><ul><ul><ul><li>Libraries </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Archives </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Technology </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Information Science </li></ul></ul></ul></ul>
    13. 17. Cyberinfrastructure Trust
    14. 18. For Example:
    15. 19. We worked together to setup high speed data replication services Checksums Checksums Achieved 200Mb/s = 2 TB/day Highly reliable Internet2
    16. 20. Network setup involved … <ul><li>LC and SDSC staff working together </li></ul><ul><li>Configurations on networks and computers </li></ul><ul><li>Resolving different security environments </li></ul><ul><li>Network monitoring </li></ul>
    17. 21. Networking is hard! Can’t forget it once it’s setup It’s not magic - there’s always a reason It highlights collaborative nature of work Lessons Learned
    18. 22. Has a long-term solution been found? Have multi-institutional issues been solved? Does new infrastructure improve process? Trust Elements Is solution useful for other organizations?
    19. 24. SDSC created a robust storage environment for this data Multiple replications … … at SDSC … and geographically diverse locations
    20. 25. (a process with several characteristics) <ul><li>Needed to replicate structure exactly </li></ul><ul><li>This had to be done for 5+ replications </li></ul><ul><li>Complex environment had to be transparent </li></ul><ul><li>Data had to be available for manipulation </li></ul>
    21. 26. <ul><li>The Storage Resource Broker provided replication services ... </li></ul>
    22. 27. ... and extensive monitoring, logging and reporting functions (which led to many conversations)
    23. 28. Logging and monitoring procedures <ul><li>Scripts which compared the files within the system with a master list – checked changes on either side … fairly straightforward </li></ul><ul><li>But … </li></ul>What is the master list and who maintains it? Who decides what is a legitimate change? Do you want a dark archive or an active remote data center?
    24. 29. We tested a new Front-End
    25. 30. … and explored an important issue <ul><li>“ Reliability” </li></ul><ul><li>Versus </li></ul><ul><li>“ Accessibility” </li></ul>
    26. 31. Always keep expectations aligned Don’t confuse accessibility and reliability Duplication of structure is complicated Communication highlights communication Lessons Learned
    27. 32. Can remote data be accessed? Can remote data be retrieved and re-used? Can remote data be verified? Can ownership be clearly defined? Trust Elements
    28. 33. SDSC and LC explored a new approach to working with web archives 50,000 ARC files 6 Terabytes of data Short processing time Parallel indexing and display system Looked “default” to the user
    29. 34. Using default tools, our initial indexing rate was 1000 files per day… This was over our time budget. … more than 6 weeks of constant computing to index entire collection.
    30. 35. We ran 18 parallel indexing instances – reduced processing to a week We modified the Wayback sourcecode to create a new access infrastructure
    31. 36. Sometimes you need to start over Default setup isn’t always easiest Time is a wonderful motivator Experts are often interested in your work Lessons Learned
    32. 37. Can a new organization bring new expertise? Are the final results the same? Can the results be reached in a better way? Can a new organization work with your partners? Trust Elements
    33. 38. Next steps …. Chronopolis!
    34. 39. Chronopolis: A Partnership <ul><li>Chronopolis is being developed by a national consortium led by SDSC and the UCSD Libraries. </li></ul><ul><li>Initial Chronopolis provider sites include: </li></ul><ul><ul><li>SDSC and UCSD Libraries at UC San Diego </li></ul></ul><ul><ul><li>University of Maryland </li></ul></ul><ul><ul><li>National Center for Atmospheric Research (NCAR) in Boulder, CO </li></ul></ul>UCSD Libraries
    35. 40. Institutions and Roles - UCSD <ul><li>SDSC </li></ul><ul><ul><li>Storage and networking services </li></ul></ul><ul><ul><li>SRB support </li></ul></ul><ul><ul><li>Transmission Packaging Modules </li></ul></ul><ul><li>UCSD Libraries </li></ul><ul><ul><li>Metadata services (PREMIS) </li></ul></ul><ul><ul><li>DIPs (Dissemination Information Packages) </li></ul></ul><ul><ul><li>Other advanced data services as needed </li></ul></ul>
    36. 41. Institutions and Roles - NCAR <ul><li>National Center for Atmospheric Research </li></ul><ul><ul><li>Archives: Complete copy of all data </li></ul></ul><ul><ul><li>Storage and network support </li></ul></ul><ul><ul><li>Network testing </li></ul></ul>
    37. 42. Institutions and Roles - UMIACS <ul><li>University of Maryland – Institute for Advanced Computer Studies </li></ul><ul><ul><li>Archives: Complete copy of all data </li></ul></ul><ul><ul><li>Advanced data services </li></ul></ul><ul><ul><ul><li>PAWN: P roducer – A rchive W orkflow N etwork in Support of Digital Preservation </li></ul></ul></ul><ul><ul><ul><li>ACE: A uditing C ontrol E nvironment to Ensure the Long Term Integrity of Digital Archives </li></ul></ul></ul><ul><ul><li>Other advanced data services as needed </li></ul></ul>
    38. 43. SDSC Chronopolis Program
    39. 44. Chronopolis Vocabulary <ul><li>Partners – UCSD Libraries, National Center for Atmospheric Research, University of Maryland Institute for Advanced Computer Studies all provide grid enabled storage nodes for Chronopolis services. </li></ul><ul><li>Clients – ICPSR, CDL– contribute content to the Chronopolis preservation network. </li></ul><ul><li>SRB – Storage Resource Broker – datagrid software. </li></ul><ul><li>iRODS – integrated Rule Oriented Data System – datagrid software. </li></ul><ul><li>ACE – Audit Control Cnvironment – part of the ADAPT project at UMD. </li></ul><ul><li>PAWN – Producer Archive Workflow Network – part of the ADAPT project at UMD. </li></ul><ul><li>INCA – user level grid monitoring - executes periodic, automated, user-level testing of Grid software and services – grid middleware. </li></ul><ul><li>Bagit – Transfer specification developed by CDL and the Library of Congress. </li></ul><ul><li>GridFTP – parallel transfer technology - moves large collections within a grid wide-area network. </li></ul>
    40. 45. Chronopolis: Inside <ul><li>Linked by main staging grid where data is verified for integrity, and quarantined for security purposes. </li></ul><ul><li>Collections are independently pulled into each system. </li></ul><ul><li>Manifest layer provides added security for database management and data integrity validation. </li></ul><ul><li>Benefits </li></ul><ul><ul><li>3 independently managed copies of the collection </li></ul></ul><ul><ul><li>High availability </li></ul></ul><ul><ul><li>High reliability </li></ul></ul>Grid Brick Disks NCAR SDSC Core Center Archive SDSC Staging Grid Pull Pull Chron Clients: CDL ICPSR Pull Push UMD Copy 1 Copy 2 Copy 3 Manifest Management MCAT DB Multiple Hash Verifications Grid Brick Disks MCAT MCAT MCAT HPSS Tape
    41. 46. SDSC Leveraged Infrastructure <ul><li>Serves Both HPC & Digital Preservation </li></ul><ul><li>Archive </li></ul><ul><ul><li>25 PB capacity </li></ul></ul><ul><ul><li>Both HPSS & SAM-QFS </li></ul></ul><ul><li>Online disk </li></ul><ul><ul><li>~3PB total </li></ul></ul><ul><ul><li>HPC parallel file systems </li></ul></ul><ul><ul><li>Collections </li></ul></ul><ul><ul><li>Databases </li></ul></ul><ul><li>Access Tools </li></ul>Adapted from Richard Moore (SDSC)
    42. 47. Chronopolis Demonstration Project <ul><li>Demonstration Project 2006-2007 </li></ul><ul><ul><li>Demonstration Collections Ingested within Chronopolis </li></ul></ul><ul><ul><ul><li>National Virtual Observatory (NVO) </li></ul></ul></ul><ul><ul><ul><ul><li>3 TB Hyperatlas Images (partial collection) </li></ul></ul></ul></ul><ul><ul><ul><li>Library of Congress PG Image Collection </li></ul></ul></ul><ul><ul><ul><ul><li>600 GB Prokudin-Gorskii Image Collection </li></ul></ul></ul></ul><ul><ul><ul><li>Interuniversity Consortium for Political and Social Research (ICPSR) </li></ul></ul></ul><ul><ul><ul><ul><li>2TB Web Accessible Data </li></ul></ul></ul></ul><ul><ul><ul><li>NCAR Observational Data </li></ul></ul></ul><ul><ul><ul><ul><li>3TB Observational Re-Analysis Data </li></ul></ul></ul></ul>
    43. 48. NDIIPP Chronopolis Project <ul><li>Creating a 3-node federated data grid at SDSC, NCAR and UMD – up to 50 TB data from CDL and ICPSR </li></ul><ul><li>Installing and testing a suite of monitoring tools using ACE, PAWN, INCA </li></ul><ul><li>Creating Appropriate Transmission Information Packages </li></ul><ul><li>Generating PREMIS definitions for data </li></ul><ul><li>Writing Best Practices documents for clients and partners </li></ul>
    44. 49. Chronopolis Grid Framework Sun 6140 62TB SRB D-Broker SRB D-Broker SRB MCAT Sun SAM-QFS SRB D-Broker SRB D-Broker SRB MCAT Apple Xsan SRB D-Broker SRB D-Broker SRB MCAT CDL Server ICPSR Server NCAR Network MarylandNetwork SDSC Network ICPSR Network UC BerkeleyNetwork Chronopolis Data 12-25TB Chronopolis Data 12TB CDL Server SDSC Network NCAR Network UMD Network Tape Silos Adapted from Bryan Banister (SDSC)
    45. 50. NDIIPP Chronopolis Clients-CDL <ul><li>California Digital Library </li></ul><ul><ul><li>A part of UCOP, supports the University of California libraries </li></ul></ul><ul><ul><li>Providing up to 25TB of data: Web-At-Risk project </li></ul></ul><ul><ul><ul><li>Five years of political and governmental websites </li></ul></ul></ul><ul><ul><ul><li>ARC files created from web crawls </li></ul></ul></ul><ul><ul><ul><li>Using Bagit Transfer Structure </li></ul></ul></ul>
    46. 51. Diagram of CDL Data Transfer CDL Virtual Machine at UCB SDSC Network Wget Bagit Wget files 1-10, 11-20 File n Bagit Manifest File 1 Possible SRB/Bagit Module UMIACS Chron Staging Chron Repository NCAR Parallel Wget Xfer UMIACS Network NCAR Network Adapted from Bryan Banister (SDSC)
    47. 52. NDIIPP Chronopolis Clients-ICPSR <ul><li>Inter-University Consortium for Political and Social Research, University of Michigan </li></ul><ul><ul><li>Providing @12TB of data: Wide variety of types </li></ul></ul><ul><ul><li>Already working with SDSC using SRB </li></ul></ul>
    48. 53. Diagram of ICSPR Transfer ICPSR SRB Repository UMich SDSC Network Sput/Srsync Files Sput tar files File n EMC SAN File 1 Chron SRB MCAT UMIACS Chron Staging Chron Repository NCAR Parallel Sput/Srsync Xfer UMIACS Network NCAR Network Adapted from Bryan Banister (SDSC)
    49. 54. Ongoing and Future Initiatives <ul><li>Migration of Chronopolis from SRB to iRODS </li></ul><ul><li>Develop Interoperability with Community Based Archival Systems/Standards </li></ul><ul><li>TRAC compliance for SDSC Production Preservation Services/Chronopolis Consortium </li></ul>
    50. 55. Looking for Partnerships <ul><li>Repositories interested in moving large digital collections among heterogeneous repository systems. </li></ul><ul><li>Fedora, DSpace or E-Prints sites interested in managed datagrid storage. </li></ul><ul><li>Institutions interested in personnel swaps to conduct TRAC audit assessment compliance. </li></ul><ul><li>Community Needs for Mass-Scale Data Transmission and Storage. </li></ul>
    51. 56. Chronopolis Credits <ul><li>SDSC </li></ul><ul><ul><li>Fran Berman </li></ul></ul><ul><ul><li>Richard Moore </li></ul></ul><ul><ul><li>David Minor </li></ul></ul><ul><ul><li>Chris Jordan </li></ul></ul><ul><ul><li>Jim D’Aoust </li></ul></ul><ul><ul><li>Robert McDonald </li></ul></ul><ul><ul><li>Don Sutton </li></ul></ul><ul><ul><li>Brian Banister </li></ul></ul><ul><ul><li>Phong Dinh </li></ul></ul><ul><ul><li>Jay Dombrowski </li></ul></ul><ul><ul><li>Emilio Valente </li></ul></ul><ul><li>UCSD Libraries </li></ul><ul><ul><li>Brian Schottlaender </li></ul></ul><ul><ul><li>Luc Declerck </li></ul></ul><ul><ul><li>Ardys Kozbial </li></ul></ul><ul><ul><li>Brad Westbrook </li></ul></ul><ul><ul><li>Arwen Hutt </li></ul></ul><ul><li>NCAR </li></ul><ul><ul><li>Don Middleton </li></ul></ul><ul><ul><li>Michael Burek </li></ul></ul><ul><ul><li>Linda McGinley </li></ul></ul><ul><li>UMIACS </li></ul><ul><ul><li>Joseph JaJa </li></ul></ul><ul><ul><li>Mike Smorul </li></ul></ul><ul><ul><li>Mike McGann </li></ul></ul><ul><li>Library of Congress </li></ul><ul><ul><li>Martha Anderson </li></ul></ul><ul><ul><li>Lisa Hoppis </li></ul></ul><ul><li>CACI </li></ul><ul><ul><li>Mike Ivey </li></ul></ul>
    52. 57.
    53. 61. <ul><li>a geographically distributed preservation environment that supports long-term management and stewardship of digital collections </li></ul><ul><li>implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure. </li></ul><ul><li>technology forecasting and migration in support of long-term life-cycle management of the dedicated preservation environment. </li></ul>Chronopolis is ...
    54. 62. <ul><li>Assessment of the needs of potential user communities and development of appropriate service models </li></ul><ul><li>Development of Memoranda of Understanding (MOUs), Service Level Agreements (SLAs), etc. to formalize trust relationships and manage expectations </li></ul><ul><li>Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc. </li></ul><ul><li>Development of cost and risk models for long-term preservation </li></ul><ul><li>Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure </li></ul>Chronopolis focuses on ...
    55. 63. The people of Chronopolis are ... UCSD Libraries
    56. 64. Organizations need ways to validate trust in 3rd parties In conclusion …
    57. 66. SDSC and the Library of Congress explored one way to do this … … and demonstrating trust. by working with Cyberinfrastructure
    58. 67. With a trusted relationship, many journeys become possible