RDSI Dash Tinman
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

RDSI Dash Tinman

on

  • 95 views

 

Statistics

Views

Total Views
95
Views on SlideShare
95
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

RDSI Dash Tinman Document Transcript

  • 1. Data Sharing (DaSh) Programme Tinman –31st October 2011
  • 2. 2 SECTION A – CONTEXT AND PURPOSE The RDSI project was established through the SuperScience investment from the Education Infrastructure Fund (EIF) in the 2009 Federal Budget, and is managed through the Department of Innovation, Industry, Science and Resources (DIISR). The detailed objectives, expected outcomes and process to achieve these are described in the RDSI Project Plan, available from the RDSI website1 . Quoting: The expected benefits of RDSI are to: • improve the availability of quality research data for sharing and re-use and, as a result, expand the scale and scope of problems that Australian researchers may seek to address; • improve research efficiency; and • reduce institutional data storage costs and enable more extensive collaboration. The infrastructure may also assist institutions to: • sustain a quality of research in the digital age that includes the reproducibility of results; • meet the storage requirements of key research activities undertaken at that institution; and • comply with the research data provisions of Universities Australia’s Australian Code for the Responsible Conduct of Research. The RDSI project is delivered through four key programmes which are jointly coordinated, depend on each other, but are delivered through different and complementary approaches. • The Node Development (NoDe) programme will establish a small number of physical sites around Australia to provide baseline storage and access services to the research sector. • The Data Sharing (DaSh) programme will develop the technical architecture for inter-node and node-user data movement, access management and sharing functionality for the sector. • The Research Data Services (ReDS) programme will support the development of larger collections of value, their infrastructure requirement at nodes and their association with collaboration and analysis facilities. • The Vendor Panel (VePa) programme provides the public research sector with a set of preferred commercial suppliers for the delivery of storage infrastructure and services, leveraging the economies of scale of both the sector and the RDSI investment. The intention of the RDSI project is to foster the development of an enduring and sustainable infrastructure, on a cost-effective basis, well beyond the lifetime of the project itself. This document addresses the DaSh programme, providing a broad outline of the programme itself, its requirements, expectations and deliverables. As a “tinman” model it is not intended to be a final position, but to provide suggestions on points of further discussion and encourage feedback. It follows the earlier strawman workshop. It will form the basis for the final model of the DaSh programme. There will sector wide consultation on this tinman model. 1 http://rdsi.uq.edu.au/
  • 3. 3 SECTION B – SUMMARY OF THE DaSh PROGRAMME 1. Goals of the DaSh programme The DaSh Programme will build capability to support the sharing and re-use of research data and, as a result, is aimed at expanding the scale and scope of problems that Australian researchers may seek to address. In order to identify what high performance data sharing and data movement services are needed by the sector, consultations with relevant research sector stakeholders, combined with an evaluation of existing services will be undertaken during implementation of the Project. 2. DaSh programme Themes The DaSh Programme will consist of ten themes as follows: DaShNet – The network connecting nodes to users and to each other Federated Authorisation and Service registration – Upgrading the AAF for authorisation ReDS Application Processing – Automation of application workflows for the ReDS programme RDSI DaShBoard – A system to automatically collect and publish Node and Collections metrics RDSI Data Fabric – Providing a common access to collections and working storage for researchers RDSI File Systems – Establishing File System(s) across nodes with a consistent namespace RDSI Data Mover – Providing fast data movement between, into and out of nodes RDSI StoreGate – A gateway to external public storage RDSI DaShLab – An environment to support testing of implementation and changes of RDSI elements RDSI Portal – AAF Integrated access to RDSI elements/services with appropriate entitlement 3. Consultation and governance within the DaSh programme The DaSh Programme will establish a Technical Advisory Committee (TAC) to provide advice to the project on elements of the technical architecture. The TAC will consist of staff from the RDSI project, including the Project Director, Project Manager and DaSh Technical Architect together with a representative from each confirmed Node. A Technical Reference Group (TRG) will also be established to provide early comments on DaSh designs and proposals. Membership of the TRG will be open. 4. Development Principles for the DaSh programme Where possible, the project will seek to acquire or re-use software before considering development. If development is required, the RDSI project team will call for expressions of interest from Nodes to undertake the development. Where practical, there will be calls for expression of interest from Nodes to host RDSI services, if hosting is required.
  • 4. 4 SECTION C – DISCUSSION OF THEMES IN THE DaSh PROGRAMME This section will discuss the proposed themes in the DaSh programme together with considerations of implementation. DaShNet – The network connecting nodes to users and to each other Proposition Interconnecting RDSI nodes to each other with the highest available bandwidth will improve the availability of services across RDSI by supporting replication between nodes. Replication will allow the delivery of higher availability services than could be provided by individual data centres, which are often at Uptime Institute Tier 2 status. Resilience through widely distributed replication is more cost effective than upgrading data centres. These interconnections will also support rapid data movement between nodes. As the use of RDSI nodes increases, there is a potential for congestion at the access point for the node. This can be alleviated by providing high bandwidth dedicated access for each node. Discussion The goals of DaShNet will be to establish: (i) A set of interconnections between primary nodes using the fastest available wavelengths across the AARNet backbone (ii) A network access connection for each primary node to support dedicated access to the node. This will also use the fastest available wavelengths across the AARNet backbone (iii) Appropriate network connections to each additional node It is anticipated that the majority of funding for this theme will be for network equipment and wavelength implementation costs. The initial expectation is that there will be a single source of network equipment for this theme. Implementation Considerations DaShNet will be a project proposed by RDSI to the National Research Network (NRN) project which will look to use the upgraded AARNet backbone. There will, therefore, be early discussion between RDSI, NRN and AARNet. Federated Authorisation and Service registration – Upgrading the AAF for authorisation Proposition RDSI will require that users must be able to use the Australian Access Federation (AAF) for authentication unless there is an agreed exception. However, the mechanisms for granting authorisation to use a resource, such as a collection, would have to be implemented independently by resource managers unless there is a federated approach to authorisation. Implementation of an “Entitlements Service” to support such an approach will benefit users and managers of RDSI infrastructure and collections by eliminating duplication and providing a consistent approach to authorisation. The AAF is the logical home for such an entitlements service.
  • 5. 5 Discussion An entitlement service would be a logical extension to the AAF’s existing authorisation service and would have wide benefits in providing a consistent approach for a number of eResearch projects, including RDSI and NeCTAR. An entitlements service could be either developed specifically as a direct enhancement of the AAF or one of a small number of commercially available entitlement systems could be licenced and integrated with the AAF. This will be a crucial service for other developments in RDSI and in other projects; an early delivery of at least the interface specifications will therefore be essential. A directory holding authorisation, registration and other service information will be required to support RDSI services and the design of the directory will depend on the choice of solution for the entitlement service. A part of such a directory might be implemented in an RDSI portal. Implementation Considerations Early discussion will be undertaken between RDSI, AAF and NeCTAR to determine the most appropriate design. ReDS Application Processing – Automation of application workflows for the ReDS programme Proposition There will be a range of applications for allocation of space under the ReDS programme which vary in complexity and size. There will be advantages in both timeliness and workload if the process is automated to the greatest possible extent. This automation will also provide additional benefits by providing a point for automatically capturing and storing the parameters of a collection for use by the RDSI measurement and monitoring processes. Discussion The design of this application will be determined by developments in the ReDS programme which will establish agreed levels of delegation and automation and by the requirements of the RDSI DaShBoard which will determine the data to be captured for monitoring and measurement. The application may involve either bespoke development or licencing a commercial product for modification and integration. It will need to integrate with the RDSI portal and will influence the definition of metrics about collections to be used by the RDSI DaShBoard. Implementation Considerations The RDSI project team will develop a specification for this application taking into account the requirements of other DaSh themes. After an initial market survey of available commercial offerings, a specification for ReDS Application Processing will be developed and expressions of interest would then be sought from confirmed RDSI nodes to develop, integrate and host the application as appropriate. It is anticipated that ReDS Application processing would be integrated with the RDSI Portal and that potential users would access it through the portal.
  • 6. 6 RDSI DaShBoard – A system to automatically collect and publish Node and Collections metrics Proposition The project plan has described the process of establishing trust in the collection of RDSI nodes by openly and transparently publishing performance against agreed service levels and metrics for both nodes and collections. The RDSI DaShBoard will automate the process of collecting and publishing this data to support the production of timely information with low levels of manual intervention. Discussion Metrics for the monitoring of nodes will be jointly developed with the Node Development programme and metrics for monitoring collections will be jointly developed with the ReDS programme. A common protocol for the transmission of monitoring data to the RDSI DaShBoard must also be developed. It is anticipated that the DaShBoard would be a part of the RDSI Portal and would collect and display information from all RDSI Nodes automatically and on a regular basis. This information would relate to both nodes and collections. The DaShBoard could be developed specifically for RDSI or commercial software could be licenced and integrated with the RDSI Portal. Implementation Considerations After an initial market survey of available commercial offerings, a specification for the RDSI DaShBoard will be developed and expressions of interest would then be sought from confirmed RDSI nodes to develop, integrate and host the application as appropriate. RDSI Data Fabric – Providing a common access to collections and working storage for researchers Proposition As described in the RDSI Project Plan, one of the objectives for the DaSh programme is to provide a consistent interface for researchers to collections and it is understood that this may be only one of a number of interfaces depending on the nature and uses of the collection. At the same time, it is helpful for researchers to also have access to some easily accessible storage, through the same collaborative interface, to support their access to, use and development of, collections. This storage must support easy collaboration between researchers. The RDSI Data Fabric will be the means of achieving these objectives. Discussion The ARCS Data Fabric successfully provides a consistent interface to collaborative storage using iRODS middleware and iRODS has significant functionality to support a consistent interface to also distributed collections at RDSI nodes and elsewhere. The ARCS Data Fabric implements its own arrangements for entitlements and uses other ARCS functionality to establish service registration through an LDAP directory which also supports the additional identity credentials needed for WebDAV access to the Data Fabric. Whilst there is a goal of migrating functionality from the ARCS Data Fabric to the RDSI Data Fabric, this does not necessarily imply that the technology solution will be the same and there may well be benefits in ensuring that an RDSI Data Fabric integrates with and uses the entitlements service
  • 7. 7 described earlier. Furthermore a number of commercial solutions have emerged over the last year with some having tight coupling with an entitlements service. After an initial investigation and potentially testing of existing open source solutions and commercially available products, the RDSI Project team will work with iVEC, who are the existing development and support group for the ARCS Data Fabric, to develop an appropriate specification which will then be the subject of consultation with the sector. In the event that a commercially available product is chosen, joint work will be undertaken with the Vendor Panel (VePa) programme in relation to establishing a panel for procurement. Implementation Considerations After the development of an initial specification, the RDSI Data Fabric would be established by a call for expressions of interest from confirmed RDSI nodes, for one node to undertake any development or integration and three nodes host it. The developing node might also be one of the three hosting nodes. RDSI File Systems – Establishing File System(s) across nodes with a consistent namespace Proposition For some applications, a distributed file system providing a consistent namespace within and between nodes may be required to provide increased levels of durability. In addition, a file system with enhanced levels of security may be necessary if nodes are to host data collections with higher levels of confidentiality. Discussion The RDSI File Systems theme will work with the RDSI Research Data Managers, the ReDS Programme Manager and other stakeholders to develop appropriate use cases, whilst the DaSh Technical Architect will identify, and where feasible, test different options. These may include open source file systems or commercially available file systems that could be licenced by the sector. In the latter case, the VePa programme will be leveraged to establish an appropriate panel of vendors. The use cases and technical options will be discussed with confirmed nodes and other interested stakeholders before developing a requirements specification. Implementation Considerations Once a requirements specification has been developed the RDSI project team will discuss implementation options with confirmed nodes. RDSI Data Mover – Providing fast data movement between, into and out of nodes Proposition Researcher accessible tools to efficiently move data between nodes, into nodes and out of nodes will be of benefit to users of RDSI services.
  • 8. 8 Discussion As the size of data sets scales up to hundreds of terabytes and potentially petabytes, existing tools to ingest data into nodes, extract data from nodes or move it between nodes are severely challenged. In particular, for larger data movements, a third party transfer service is needed so that a researcher can submit a transfer and then continue to use their own computing resources whilst waiting for notification of completion. For efficiency, the process needs to be user driven and substantially automated. Implementation Considerations An investigation and potentially testing of existing open source solutions and the small number of commercially available products will be undertaken and published. After consultation with nodes and other interested parties, a specification for development or for licencing and integration of a commercially available product will be developed. RDSI StoreGate – A gateway to external public storage Proposition Researchers will benefit from streamlined access to one or more external public storage clouds both as a means of storing appropriate research data in the cloud and for accessing relevant services in public storage clouds. One potential use could be for additional copies of data that are stored at RDSI nodes; however there are a number of use cases. A particular benefit in using external public cloud storage is that it is an “on demand” service which can meet short term needs with a fast provisioning time (often in minutes), little upper limit on capacity and an ability to pay only for the time that the storage is actually needed. Potential users of external public storage will still need to pay for such storage; the proposition is that it will be faster and cheaper to access and that it may reduce proliferation in the use of small pools of external storage each with their own identity credentials. Discussion The RDSI project, owing to the nature of its funding, cannot fund public cloud storage. However, to facilitate use of public storage for research purposes it can develop a gateway for connection to a number of external public storage cloud providers. Use of such external storage encounters three principal difficulties; performance, proliferation and cost. Performance issues arise from accessing external storage over the public internet rather than taking advantage of the dedicated high bandwidth available across the Australian Research and Education Network (AREN). RDSI StoreGate would seek to address this issue by attempting to facilitate peering of a number of external public storage providers with the AREN. There is anecdotal evidence of significant existing use of external public storage providers for research data. An example would be the use of Dropbox which stores its data in Amazon’s storage service. The proliferation in the use of individual Dropbox accounts which do not integrate with other services in the sector, such as the AAF, forms a barrier to collaboration. RDSI StoreGate will investigate options to improve integration. The cost of using external public storage often breaks down into three components. A network traffic charge; a cost for moving data into and out of the external storage; and a cost of the storage itself. The first of these could be eliminated by peering a number of storage providers with the AREN as
  • 9. 9 described earlier. The second of these is a function of location and the content delivery networks used by external storage providers. It may also be improved by peering but it would greatly benefit from the ability to access Australian based providers. Both this and the third element of cost (the storage itself) are susceptible to price reduction through demand aggregation. By working with the RDSI Vendor Panel (VePa) programme to create a panel of external storage providers, RDSI StoreGate is intended to reduce costs through the aggregation of demand for external public storage. Implementation Considerations Internet 2 recently announced its Net+ services which include some form of aggregated access to external storage providers, Box.net and HP in the United States. The RDSI project team will review available providers in Australia and work closely with AARNet, Internet 2 and others in developing a specification for RDSI StoreGate and will also work closely with the VePa programme to construct a panel of external public storage providers. Implementation options will be developed after these stages. RDSI DaShLab – An environment to support testing of implementation and changes of RDSI elements Proposition The DaSh Technical Architect, together with Technical Architects from each of the nodes will benefit from the ability to test implementations of, and changes to the RDSI Technical Architecture. Discussion The RDSI Nodes and the network between them, present a unique environment which cannot easily be replicated by any individual node or institution. Successful implementation of infrastructure and applications will be dependent on an ability to undertake meaningful testing. By establishing a test environment or testbed which spans a number of nodes, it will be possible to support the testing of infrastructure and applications in a realistic environment. DaShLab will be the test environment spanning the Nodes. Implementation Considerations After the development of an initial specification, DaShLab would be established by a call for expressions of interest from confirmed RDSI nodes, with a target minimum of 2 nodes and no maximum number. The DaSh programme would fund infrastructure at the nodes to facilitate the development of DaShLab. RDSI Portal – AAF Integrated access to RDSI elements/services with appropriate entitlement Proposition An RDSI Portal will be an effective means of integrating access to all RDSI services including those described within other DaSh programme themes. It may also be effective in acting as an integration point with other eResearch project services. Discussion
  • 10. 10 The design of the RDSI Portal will be strongly dependent on developments within the other RDSI themes with which it must integrate. It must clearly be integrated with the AAF and with the Entitlements Service described earlier. Depending on the design of the Entitlements Service, it may be necessary for the RDSI Portal to hold directory information about service or resource registration. The portal may involve either bespoke development or licencing a commercial product for modification and integration. Implementation Considerations The RDSI project team will develop a specification for the RDSI Portal taking into account the requirements of other DaSh themes. After an initial market survey of available commercial offerings, a specification for the portal will be developed and expressions of interest would then be sought from confirmed RDSI nodes to develop, integrate and host the application as appropriate. SECTION D – IN-DEPTH DISCUSSION OF DaSh PROGRAMME ELEMENTS This section explores underlying components of the DaSh programme in depth. It is presented to underpin, extend and enhance the discussion on DaSh programme themes, which have been described earlier in summary form. The topics discussed in this section are generally applicable to more than one theme and it is not, therefore, intended that there should be a one to one correspondence between these topics and the themes. 1. Identity, Authentication and Authorization within RDSI The Australian Higher Education and Research sectors like many other countries has a SAML v2 based trust federation called the Australian Access Federation (AAF). This technology allows university staff, students and researchers to access applications using the credentials issued to them by their institutions. By later proving possession of and control over these credentials during some act of authentication at the institution's Identity Provider (IdP), the binding between the end-user and its digital identity is also proven at some level of assurance. At a simpler level, institutions manufacture the digital identity of staff, students and researchers within their institution, based on information within their systems-of-record like HR and SIS systems, using some form of identity and access management process tailored to that institution. The end- user's digital identity is composed of all the relevant attributes that may potentially be used to provide access to a resource. The AAF provides a mechanism, based on the SAML v2 specification, to assert some components of an end-user's digital identity and transport them securely to a service provider (SP) so that the resource owner can make an informed authorization decision to allow an end-user to access to that resource. No matter what resources an end-user wishes to access it is always based on the end-users digital identity. Effectively an end-user's digital identity is a constant across all the SPs in the federation. An end-user's digital identity can in some cases be supplemented by other Identity Providers outside the province of the end-user's institution. While this allows for a more expressive attribute economy for authorization it does create some policy and technical issues. Ideally an institution's IdP should only assert attributes within its province and identity process. Asserting attributes outside one's
  • 11. 11 province diminishes the level of assurance of those attributes. Secondly there is an issue of scale at the institutions themselves. As an example consider an attribute whose presence in a SAML assertion informs a SP that a group of researchers from several institutions can access a resource. Using only institutional IdPs to achieve this, an attribute must be present in the digital identity of every member of this group. Coordination of this level over a potentially large number of institutions and people is somewhat erratic. However if this attribute was asserted by a single non-institutional IdP the scaling problem is minimized. The AAF is in the process of creating a National Entitlement Service which will allow principle investigators and people of similar ilk to create entitlements linked to end-users which can be asserted to a SP in addition to the institution's assertion and used by the SP for fine-tuned authorization. RDSI will leverage this service in much of its web-browser-based applications. Node operators should also follow RDSI's lead and where appropriate use the AAF to authenticate to Node- based service providers. In fact it is one of the prime principles of the RDSI project to directly attempt to use AAF's federated identity to access both web-browser-based applications and non-web-browser-based applications. However the typical SAML v2 authentication profiles used in the AAF does not work well for applications that are not based on the web browser metaphor; which entails the use of HTTP Cookies and Redirects. Examples of some of these applications in the RDSI's circle of interest are: • WebDAV • i-commands for iRODS • XMPP/Jabber. (XMPP is a one of the potential protocols for the management and control of cloud resources.) • SSH. • Mounting and accessing file systems. • Accessing databases. • My Proxy (which is a service for issuing X.509 certificate for Grid computing.) Luckily there are initiatives already in play to develop the ability to use a federated credential to access these applications and services. For example the iPlant Collaborative <http://www.iplantcollaborative.org> is using work based on Project Moonshot to use federated access to authenticate and use iRODS i-commands on the command line. (The Project Moonshot work connects GSS with Radius and EAP to achieve this). There is also work to use the SAML v2 Enhanced Client or Proxy profile to achieve a similar result. RDSI and the AAF will work together to advance these innovative authentication and authorization initiatives for use within RDSI and its sister projects. Unfortunately these emerging technologies are still a bit rough on the edges and may not be available for production use within RDSI. Until these technologies become more mature one will need to use contemporary solutions to some of these access issues. Additionally the movement of data has been a significant component of Grid computing for some time and many applications have been developed to provide these services using the Globus Tool Kit. These Grid services typically use GSI (Grid Security Infrastructure) to provide authentication and authorization using X.509 certificates. While the Globus tools are, in some people’s eyes, overly
  • 12. 12 complicated, it would be a mistake to ignore an existing production infrastructure that does do the job. For this reason GSI will be a significant component of authentication and authorization in RDSI Nodes. 2. Identity, Authentication and Authorization within Data Storage Systems In the previous section the underling concept of an end-user's digital identity being constant across all service providers is a powerful one. But how does one provide an analogous concept of an end-user's digital identity being constant across all data storage systems, both within RDSI Nodes, RDSI's sister projects and other programs? One way of achieving this goal is to synchronize all participating data storage systems to use a common identity layer. One such layer could be implemented using a LDAP Directory service, common across all participating data storage systems. This in concert with the Pluggable Authentication Modules (PAM) mechanism, which is standard in almost all Unix-like systems, can provide such an identity layer. We will also concentrate on the Portable Operating System Interface for Unix (POSIX) series of standards as again this covers most UNIX systems as well as Microsoft Windows systems if the Microsoft Windows Services for UNIX (SFU) component is installed. It should be noticed that POSIX/Unix semantics are different to the Microsoft Windows/NFS semantics but with SFU installed the core identity layer should be consistent over both UNIX and Windows. POSIX systems link a user to a numeric ID; called the UID; and link a collection of users to a group which is identified by the numeric ID called the GID. In this POSIX representation the user names and group names are only there as crutches for the “wetware” that use these systems. It is the numeric values of the UID and GID that matters in the file system operations. Access to files or directories are based on the UID, the GID, the permissions of the file (which are stored in the inode of the file or directory) and credentials used to prove the identity of the user. A LDAP Directory or Active Directory server can store these mapping of users to UIDs and collections of users to GIDs and the credentials used to authenticate the end-user. A number of other useful information can be stored using LDAP schemas like RFC 2307. This Data Storage Identity Layer (DSIL) will provide a consistent user/UID and group/GID namespace which can be plugged in to a both remote and local file systems using the likes of PAM and the DSIL LDAP server, etc so that remote and local file systems share the same semantics of a particular UID or GID without any remapping. Administrators of the local or remote file systems that use these mappings can provision user accounts as longs as they do not degrade the semantics of the mappings. For instance if the user Bob has a UID/GID of 12345/67890 as defined in the DSIL LDAP directory, any account provisioned for Bob on any participating file system must have a username of Bob, a UID of numeric value 12345 and a default GID of numeric value 67890. Additional attributes related to provisioning of accounts like the home directory, the GECOS field, the preferred shell, etc are in the domain of the local or remote administrators. As OpenLDAP is likely to be chosen as the DSIL LDAP service a local administrator may host a local leaf-node LDAP replica using the OpenLDAP Translucent Proxy and rewrite the non-mandatory attributed on the fly.
  • 13. 13 An interface within the RSDI portal will also be provided for those who do not want to use DSIL service and instead provide their own UID/GID mappings. This interface will allow such users to design their own mapping options which they will use when mounting a remote file system onto their local file system. In both these cases it is typically the root user that mounts these remote file system. There must be a certain level of trust in this act amongst all parties. It should be noted that without this intelligent design provided by DSIL, the sharing of data through a remote file system is made more difficult as individual UIDs and GIDs may have to be remapped from a remote file system's UIDs/GIDs to the local file systems UIDs/GIDs so as to receive the full benefit of the remote file system. Taking this in account and the fact that there are many such file systems in the Australian High Education and Research sectors, using UID/GID remapping is an unscalable and piecemeal solution. Additionally there are issues concerning the confidentiality and integrity of the data exported from or imported to remote file systems. File systems like NFSv4.1 provide a GSS-API mechanisms to provide the confidentiality and integrity of the data without the likes of TLS, GRE or SSH tunnelling; however many file systems do not. More on this topic will be discussed later sections. The central component of DSIL is a well replicated LDAP service using various typical LDAP schemas including the likes of RFC 2307 and 2377. The directory needs to be populate with identity information from both end-user's IdPs and various Attribute Authorities that may have additional identity information outside the scope of the their institution. A prototypical workflow of a person named Bob wishing to register with DSIL is as follows: (i) Bob uses his credentials issued by his institution to access the RDSI portal using the AAF infrastructure for the first time. Bob's email address, surname, given name and other appropriate attributes are asserted in the SAML payload to RDSI DSIL Registration portal. As this is Bob's first visit to the portal, it requires Bob to nominate two unique usernames. The first is an 8 character username based on the original POSIX standard. This will provide a compatibility level over all POSIX based systems where required. The second username will be a long username; potentially 256 characters. Both of these accounts will be linked. While DNs of both LDAP entries are different, the important attributes like the uidNumber, gidNumber, etc will be provisioned uniquely and replicated to both accounts. (It should also be noted that most modern systems use unsigned 32bit integers to store UIDs and GIDs. This potentially provides the DSIL directory service a maximum of 2 billion accounts and also 2 billion groups. To ensure that system accounts don't collide with DSIL, end user accounts will start with UIDs and GIDs of 1,000,000.) (ii) Bob must also provide a new password for this account. Passwords will be stored in a Kerberos v5 KDC (Key Distribution Centre) which will impose strong passwords and strong hashes (like AES) as defined by RDSI policy. Kerberos v5 pre-authentication will also be enabled to reduce the risk of comprised hashes. This password will also suffer a password aging regime as per RDSI policy. Nearing the end of the aging Bob will receive a series of email prompting him to access the RDSI portal password management system to restart the aging
  • 14. 14 process. Ignoring these emails will trigger the archiving of Bob's account. (iii) Bob at this time (or any other time) can upload and manage his SSH public keys. A patched version of SSH, namely OpenSSH-LPK, provides an easy way of centralizing strong user authentication by using an LDAP server for retrieving public keys instead of ~/.ssh/authorised keys. This allows the de-provisioning of a user's SSH access at one point. (iv) Bob at a previous time has accessed the AAF National Entitlement Service where he has defined and managed a set of Australian and New Zealand Standard Research Classification codes which represents the research discipline that Bob is interested in. This information coupled with similar codes related to research data will allow RDSI to track in a board sense how researchers use research collections and allow RDSI to tune the use of RDSI and Nodes over the project. As part of the SAML workflow the AAF National Entitlement Service Attribute Authority will be queried to add these ANZSRC code to the SAML assertion. (v) Bob at this time (or any other time) can request to be a group coordinator. A new unique GID will be provided to Bob and he will be able to invite other users to register with the DSIL services (if needed) and join his group. At this time the group only exists as an entry in the DSIL directory. To provision this group access control, a local or remote administrator must change the GID of the file or directory the new GID. Bob can also define other group coordinators which will have the same rights as Bob within that group. (vi) Bob can only manage the password of his DSIL account after a successful federated authentication to RDSI portal password management system. System Administrators should be very reluctant to change Bob's password. This will ensure that their DSIL credentials maintain an appropriate level of authentication assurance. Through end-users registering with the DSIL service RDSI will organically grow a database of identity information that will span both web-browser-based application and data storage systems. 3. RDSI Portal The RDSI portal is one of the major web applications within the RDSI ecosystem. It will provide several federated services which will be of use to the end users of RDSI and potentially other sister projects. These services will be detailed below. DSIL Registration Portal The purpose of the DSIL Registration portal is to extend the digital identity of an end-user into the realm of data storage by adding attributes that describe the end-user in a file system. Once an end- user has authenticated to the portal an LDAP entry is created in the DSIL directory. Such a directory entry might look like: dn: uid=bobuser,ou=users,dc=dsil,dc=rdsi,dc=edu,dc=au objectclass: top objectclass: person objectclass: organizationalPerson objectclass: inetOrgPerson objectclass: posixAccount objectclass: ldapPublicKey
  • 15. 15 description: Bob User's Account userPassword: {KERBEROS}bobuser@RDSI.EDU.AU cn: Bob User sn: User givenname: Bob mail: bob@domain.com manager: uid=bobsboss,ou=users,dc=dsil,dc=rdsi,dc=edu,dc=au uid: bobuser uidNumber: 1234500 gidNumber: 6789000 homeDirectory: /home/bobuser sshPublicKey: ssh-dss AAAAB3... sshPublicKey: ssh-dss AAAAM5... At this stage there is only the potentiality of this entry. A system administrator of a file system that participates in the use of the DSIL service must provision the account as they seem fit as long as they do not degrade the semantics of the user/UID and group/GID mappings. Account Management Portal End-users will need to manage their own DSIL credentials in-line with the RDSI password policy. This portal will ensure that DSIL credentials are sufficiently strong over the period of the password aging process. Previous passwords will not be appropriate to use for new passwords and will be rejected. The DSIL service will initially attempt to achieve a level of authentication assurance similar to the NIST 800-63/Liberty Identity Assurance Framework standard at level 2. Levels of identity assurance are asserted by the end-user's IDP and transported to the DSIL service in the payload of a SAML assertion where it will be encoded within the eduPersonAssurance attribute. As there will be password aging associated with their DSIL credentials end-users might lose access to the DSIL service because they have ignored the series of emails of impending doom of their account. Moreover there are situations where end-users just fall off the map. End-users should nominate an email address of a colleague or similar trusted person into this portal so that if a person cannot respond to an email of doom another may respond so as to fend off the account archiving process. Group management Portal Group management is a crucial component of any research activity. Once you have proven your identity to a service provider or relying party some authorisation process kicks in to determine if you have the rights to access it. This is true from the large scale of the Large Hadron Collider to the much smaller scale of a simple file system contained protected data. Authorization in a file system is typically achieve using groups. If a user is in the right group and the permissions of a file or directory allow access to that group one can access the file or directory. As described above, a group coordinator defines the need of a particular group to provide access control to a resource. These groups and their members must be added to DSIL directory and await the provisioning of this group to a resource (i.e. file or directory) by a local or remote administrator.
  • 16. 16 However there are many additional ways to provide authorization information so as to create a more vivid pastel of mechanisms other than just file system groups. Especially as the DSIL directory will contain both local and institutionally sourced attributes. One means to manage all these authorization data within the DSIL directory may be provide an instance of Internet2's Grouper Groups Management Toolkit v2.0. Such a directory entry for a group might look like: dn: cn=bobsgroup,ou=groups,dc=dsil,dc=rdsi,dc=edu,dc=au objectclass: top objectclass: posixGroup description: Bobs Group cn: bobsgroup gidNumber: 1000000 memberUid: bobuser memberUid: eveuser RDSI Attribute Authority A (SAML v2) Attribute Authority is an effective way of having your authorization cake and eating it as well. When an institutional IDP asserts a set of attributes to a SP, it should only assert information that is within its scope of the institution. However there may be other sources of authorization information pertaining to the authenticated end-user of the institutional IdP which may provide extra information to an SP. Mashing these sets of authorization data together provides a richer pallet of authorization possibilities. An Attribute Authority (AA) provides a secondary source of attributes to be asserted with the payload of the institutional IdP. An AA is a somewhat like a lobotomized IdP and is usually backed by a LDAP server; in this case the DSIL directory. The management of the AA attributes can be provided by applications like Internet2 Grouper Groups Management Toolkit v2.0, as described above, and will allow delegated individuals access and management of various authorization data. When programs like the Project Moonshot reach a certain level of production quality the RDSI AA will be ready to provide direct authentication and authorization use institutional credentials. ReDs Portal The ReDs portal, a component of the RDSI portal, allows collection owners and data curators to submit their data for merit-based ReDs funding so as to offset the cost of storing their data at the various RDSI nodes. Using the collection owner's or data curator's federated identity, the portal will allow them to upload sufficient information so that the RDSI Resource Allocation Panel can assay the merit of the submission. The information required for the submission is detailed in the ReDs program. Collection owners and data curators will be able to track the progression of their ReDs bid through the portal. Also all formal communications between ReDs bidders and the RDSI Resource Allocation Panel will be tracked as well. Monitoring and Analytics As in any business it is important to maintain a constant vigil of the metrics that describe the health of the business so as to maximize its profits. In a similar way the RDSI project also needs to keep a close eye on the metrics that describe its health. The RDSI ecosystem consists of many entities such as
  • 17. 17 potential and successful Node bidders, collection owners and data custodians, potentially and successful ReDs bidders and of course the end-user researchers as well. All these entities need to have sufficient information so as to make their component of RDSI a success and therefore the whole project a success. RDSI will ensure that these metrics are monitored and provided as openly as possible to the all. My Node Portal As stated above the RDSI ReDs program provides funding for the storage of significant data collections. Once a collection owner or data curator has been successful in their ReDs bid they have to store the collection in one of the RDSI Nodes. The choice of which node is of course up to successful bidder. A conscientious bidder would need to take in many facts concerning the way a particular node functions as a business or how their collections would suit a node that specializes around a set of disciplines. This information is typically somewhat elusive in most cases. To aide successful ReDs bidders to make an informed choice RDSI will ensure that sufficient information is available to them. RDSI Nodes must supply up-to-date detailed information and metrics concerning their operations. This information will be displayed on the My Node portal, a component of the RDSI portal. All RDSI nodes will be require to regularly collect various selections of information concerning all the facets of a node's operations. This data will be transfer to the My Node portal and displayed in an intuitive manner. 4. RDSI Analytics It is of considerate importance for RDSI itself to monitor how researchers of various disciplines interact with data sets produced by various disciplines. While this information at a level of individual researcher is somewhat overbearing and is an issue to researchers’ privacy, at a discipline level it can provide information that will enhance the success of the RDSI project. Relating the Australian and New Zealand Standard Research Classification (ANZSRC) codes of researchers to the same codes associated with the data sets as metadata should provide de-identified data that will help the RDSI project to measure its success. Knowledge Management The sharing of knowledge is an important process in research. Without this sharing the efficiency of research endeavours would be much curtailed and researchers would spend significant time re- inventing the wheel. The RDSI portal will provide a wiki so as to allow researchers to share the tricks of their trades; data wise. The wiki will also allow researchers to document how they create, use and store their data. This may well produce productive synergies between various researchers and even disciplines themselves. It is also important for RDSI and Node operators to have a good understanding of the data practices of researchers and disciplines so to meet their needs.
  • 18. 18 5. DaShNet Moving data from a RDSI Node to a researcher or when a researcher ingesting a new data collection into RDSI Node will be one of the “meat and potatoes” daily operations within the RDSI project. However these daily operations are fraught with consequences especially if the volume of data to be transferred is large. If there is insufficient network bandwidth and/or high network latencies between the researcher and the data they are trying to access, the efficiency of the research process will deteriorate. Researchers usually have many activities “on the go” and they will typically move on to another activity while waiting for a long data transfer to finish. Getting back to the original activity may take some time or in some cases never. As an example consider this; most Australian universities have either at the minimum a 1Gbps connection or a 10Gbps connection at the maximum. Transferring 1TB of data will take either slightly over 2 hours at 1Gbps or 13 minutes at 10Gbps. In this scenario a researcher will probably just go out for a cup of coffee rather that move on to another activity. However if 100TB of data was transferred, it would take either slightly over 222 hours at 1Gbps or 22 hours at 10Gbps, the researcher would definitely move on to a new activity. The solution for this issue is twofold. Firstly the network bandwidth between the researcher and a RDSI Node must be maximized considering the network topology both inside the researcher's institution and the AREN (Australian Research and Education Network) backbone. Similarly the network latency must be likewise minimized. In a coordinated move the AREN is currently moving their backbone bandwidth to 100Gbps and the NRN (National Research Network) are providing 40Gbps network links from the AREN backbone to a RDSI Node's border router. Reconsidering the previous 100TB data transfer at a bandwidth of 40Gbps and assuming that an institution will eventually upgrade their border routers to at least 40Gbps it would take approximately 5 hours to transfer the 100TB of data rather than the 22 hours at 10Gbps. Secondly highly efficient data movement protocols must be employed. This topic will be discussed in a later section. 6. National File System One of the initiatives of the Australian Research and Collaboration Services (ARCS) was the ARCS Data Fabric which provided 25GB of free storage to all researchers. Unfortunately ARCS funding finished 1st July 2011 leaving this service in financial doubt. However RDSI has to step forward to continue this service. The RDSI project will provide a National File System that will be provided to researchers in the Australian Higher Education and Research sector 25GB of free storage. The deployment of this file system will be in much the same image of the ARCS Data Fabric so as to provide the similar interface to previous and current users. It will run the iRODS v3 software using the OS authentication feature. This will allow the DSIL LDAP directory to provide the same username/UID and group/GID semantics within iRODS as without. 7. Data as a Service The RDSI project is a prime example of DaaS, Data as a Service. As defined by wikipedia: DaaS is based on the concept that the product, data in this case, can be provided on
  • 19. 19 demand to the user regardless of geographic or organizational separation of provider and consumer. Data as a Service brings the notion that data quality can happen in a centralized place, cleansing and enriching data and offering it to different systems, applications or users, irrespective of where they were in the organization or on the network. As such, Data as Service solutions provide the following advantages: • Agility – Customers can move quickly due to the simplicity of the data access and the fact that they don’t need extensive knowledge of the underlying data. If customers require a slightly different data structure or has location specific requirements, the implementation is easy because the changes are minimal. • Cost-effectiveness – Providers can build the base with the data experts and outsource the presentation layer, which makes for very cost effective user interfaces and makes change requests at the presentation layer much more feasible. • Data quality – Access to the data is controlled through the data services, which tends to improve data quality because there is a single point for updates. Once those services are tested thoroughly, they only need to be regression tested if they remain unchanged for the next deployment. In RDSI's case the data itself is generated by researchers doing the normal things that researchers do; i.e. compiling discipline based data sets and publishing their findings. Such data sets as prescribe by the rigours of the RDSI ReDs program will be uploaded to the central repositories within the collection of RDSI Nodes. Easy discovery and access to the data contained within the RDSI Nodes is an imperative. Data Discovery and Metadata As the RDSI Nodes will be brimming with useful data sets and collections it will be very important for a researcher to be able to easily find a particular data set. However without sufficient metadata describing the data set it will be next to impossible for a researcher to discover the existence of the data let alone where it is located. Without accurate and sufficient metadata the purpose of the RDSI infrastructure is pointless. In Medieval times parish priests were entrusted with the care of souls. These priests were titled curates. In present times data custodians are entrusted with the care of metadata. Data collections and data sets must have data custodians too so that they can be curated, cared for and discoverable throughout their life cycle. It is assumed that ANDS (Australian National Data Service) will provide its expertise with respect to curation matters. For more details on this subject please read the ANDS Guide The Data Curation Continuum. Data Movement One of the prime purposes of a RDSI Node is to be able to move data from or to a Node to where it can be consumed by a researcher so as to provide some form of new scientific result. This data
  • 20. 20 movement can be achieved in an extraordinary large number of ways and means. However the data movement mechanism that is chosen is usually the proscribed data movement protocol of a particular discipline or the preferred data movement mechanism of the researcher or his/hers research group. In a sector as robust as the Australian Higher Education and Research sector this still provides a potentially large numbers of data movement mechanisms in use. It would be economically infeasible for every RDSI Node to provide an interface for every data movement mechanism used in the sector. At some stage a Node must choose what interfaces it will support. So how can RDSI help Nodes in the choice of data movement mechanisms? An obvious answer is that RDSI through DaSh Technical Architecture will compel all Nodes to implement a certain set of data movement mechanism. These mechanisms will be decided through a community input process. Nodes can of course implement other data movement mechanism as well and this choice will obviously be one of the many differentiators of the Node from other Nodes; either attracting or repelling successful ReDs bidders. Of these compelled interfaces a number will be consider as a commodity type. That is to the end- users these interfaces will be well known and common in their use. For the system administrators of Nodes these interfaces will also be well known and the installation and support of them should be a well known quantity to Node system administrators. Some potential examples of these are for instance GridFTP, NFS v4, CIFS, webDAV and iRODS. A number of these compelled data movement mechanism may also be of a specialist type where the burden of installation and support is higher than the commodity type and end-users may not have been commonly exposed to them. The list of compelled data movement mechanism will provide an initial level playing field for all Nodes. It will also provide a level playing field for all end-users of RDSI Node repositories. In the next sections we will discuss some of the data movement interfaces that may have a part to play within the RDSI project. As a gross simplification these interfaces will be categorized as: • File Transfers (in which the provision of the service is mostly stateless). • File Systems (in which the provision of the service is mostly stateful). • Data Middleware (in which there are other applications between the data and the end- user). Which interfaces that will be compelled will be teased out using community advice. As initial list may look like this: File Transfers File Systems Data Middleware GridFTP NFS v4.1 (pNFS) Clustered NFS iRODS Rsync over SSH pCIFS CIFS Globus Online Amazon S3 webDAV Reliable File Transfer (RFT) HTTP Storage Resource Manager (SRM)
  • 21. 21 As in all movement of data from one place to another there is always a risk that either the confidentiality and/or the integrity of the data may be compromised in transit. Some data movement mechanisms provide a layer of encryption to minimize these risks. Others use digital signatures or checksums to detect that the data has been tampered with. However there are other data movement mechanisms that do not provide any security of the data in transit. The use of such insecure data movement mechanisms within the RDSI project will only be tolerated when these data movement mechanisms are tunnelled through a layer that will supply a layer of confidentiality and integrity of the data. Such layers are provided by protocols like TLS, GRE or SSH tunnelling. In these cases it is the responsibility of the Node to provide this end-to-end layer from a RDSI Node to the end-user however they wish. File Transfers The original File Transfer Protocol (FTP) specification was published as RFC 114 in 1971, even before TCP and IP existed. Since then file transfers have been in the past the heavy lifters in the data movement area. Simply put, file transfers move a complete file or a piece of a file from one place to another; ideally as fast as possible. Examples of file transfer mechanisms of interested to RDSI are: • GridFTP is a protocol for network transfers using grid frameworks. GridFTP is part of the Globus toolkit and was designed for efficient and secure transfer of large amounts of data. GridFTP uses extensions to the FTP protocol to add enhancements such as parallel transfers and automatic restart of transfer after interruption. o ARCS GridFTP service • Rsync • HTTP • Amazon S3 is an online storage web service offered by Amazon Web Services. Amazon S3 provides storage through web services interfaces (REST, SOAP, and Bit Torrent). • Tsunami UDP is a fast user-space file transfer protocol that uses TCP control and UDP data for transfer over very high speed long distance networks (≥ 1 Gbps and even 10 GE), designed to provide more throughput than possible with TCP over the same networks. • Aspera’s fasp™ transport technology is an emerging standard for the high-speed movement of large files or large collections of files over wide area networks. • Bitspeed Velocity is a software application that accelerates file transfers. It maximizes existing WAN bandwidth to up to 100% utilization 8. SRM based File Transfers In the simplest situation a file transfer mechanism assumes there is there is only one protocol supported at both ends of the transfer. However in real life either ends of the transport may support multiple file transfer mechanism and there may not be an exact overlap of these mechanisms. In such cases the separate end points must negotiate a common mechanism before a file transfer can be initiated. Storage Resource Management (SRM) is a Grid middleware application which that help provide this negotiation layer as well as other useful features such as coordinating storage allocation, dynamic space reservation and automatic garbage collection that prevents clogging of storage systems.
  • 22. 22 File Systems A distributed file system or network file system is any file system that allows access to files from multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple machines to share files and storage resources. The client nodes do not have direct access to the underlying block storage but interact over the network using a protocol. This makes it possible to restrict access to the file system depending on access lists or capabilities on both the servers and the clients, depending on how the protocol is designed. Ideally these file systems should be able to move data as fast as possible so as to maximize researcher productivity. Examples of network file systems of interested to RDSI are: • NFS v4.1/pNFS. NFSv4.1 adds the Parallel NFS pNFS capability, which enables data access parallelism. The NFSv4.1 protocol defines a method of separating the file system meta-data from the location of the file data; it goes beyond the simple name/data separation by striping the data amongst a set of data servers. Whether an implementation of NFS v4.1/pNFS provides sufficient aspects of the standard to provide strong authentication, data integrity and data privacy is the concern of a Node operator given the RDSI stance on the importance of this matter. • NFS v4. The NFS v4 protocol specification RFC 3010 provides both strong authentication using GSSAPI as well as strong integrity and privacy using LIPKEY and SPKM-3. Whether an implementation of NFS v4 provides sufficient aspects of the standard to provide strong authentication, data integrity and data privacy is the concern of a Node operator given the RDSI stance on the importance of this matter. • SMB/CIFS/pCIFS. The Common Internet File System (CIFS), also known as Server Message Block (SMB), is a network protocol whose most common use is sharing files on a Local Area Network. While CIFS can use strong authentication protocols like Kerberos it has little natively support in the areas of data integrity or privacy. To combat this deficiency one can tunnel CIFS/SMB file systems over protocols like SSH, TLS or GRE. CTDB is a cluster implementation of the TDB database used by Samba and other projects to store temporary data and is the core component that provides pCIFS ("parallel CIFS") with Samba3/4. • webDAV (RFC 4918) is a set of methods based on the Hypertext Transfer Protocol that facilitates collaboration between users in editing and managing documents and files stored on web servers. The WebDAV protocol makes the Web a readable and writable medium. It provides a framework for users to create, change and move documents on a server. The most important features of the WebDAV protocol include: • Locking ("overwrite prevention") • Properties (creation, removal, and querying of information about author, modified date et cetera); • Namespace management (ability to copy and move Web pages within a server's namespace) • Collections (creation, removal, and listing of resources) The webDAV specification does not natively support data integrity or privacy however
  • 23. 23 typically webDAV is tunnelled through TLS to provide these services. Data Middleware • iRODS. The Integrated Rule-Oriented Data System, is open source software that helps people manage large collections of digital data distributed across multiple sites running diverse infrastructure. • OpeNDAP. An acronym for "Open-source Project for a Network Data Access Protocol", is a data transport architecture and protocol widely used by earth scientists. The protocol is based on HTTP and the current specification is OPeNDAP 2.0 draft. OPeNDAP includes standards for encapsulating structured data, annotating the data with attributes and adding semantics that describe the data. The protocol is maintained by OPeNDAP.org, a publicly-funded non-profit organization that also provides free reference implementations of OPeNDAP servers and clients. • Globus Online. Globus Online is a fast, reliable file transfer service that makes it easy for any user to move any data anywhere. Recommended by HPC centres and user communities of all kinds, Globus Online automates the time-consuming and error-prone activity of managing file transfers, so users can stay focused on what’s most important: their research. • Globus Reliable File Transfer (RFT) Service. RFT is a Web Services Resource Framework (WSRF) compliant web service that provides “job scheduler"-like functionality for data movement. You simply provide a list of source and destination URLs (including directories or file globs) and then the service writes your job description into a database and then moves the files on your behalf. Once the service has taken your job request, interactions with it are similar to any job scheduler. • Globus Replica Location Service (RLS). The RLS service is one component of data management services for Grid environments. RLS is a tool that provides the ability keep track of one or more copies, or replicas, of files in a Grid environment. This tool, which is included in the Globus Toolkit, is especially helpful for users or applications that need to find where existing files are located in the Grid. • Globus Data Replication Service (DRS). The function of the DRS is to ensure that a specified set of files exists on a storage site. The DRS begins by querying RLS to discover where the desired files exist in the Grid. After the files are located, the DRS creates a transfer request that is executed by RFT. After the transfers are completed, DRS registers the new replicas with RLS. • WAN Data Cache. Researchers are naturally distributed over the city and country. In most cases researchers are locate at universities where their access to sufficient network bandwidth is both sufficiently large and sufficiently close. Access speeds to data within the RDSI Nodes will thus be sufficient due to the AREN, NRN and the DaShNet initiative. However there will always researchers who may not be so endowed network-bandwidth-wise. These “spatially disenfranchised” are still required to perform their science and access data within the RDSI Nodes. WAN Data Caches can help these spatially disenfranchised researchers to achieve significantly more effective access and bandwidth to data within a RDSI Node than they currently have. However this access and bandwidth will always be less than that of “spatially enfranchised researchers”.
  • 24. 24 Structure Data The labels "structured data" and "unstructured data" are often used ambiguously by different interest groups; and often used lazily to cover multiple distinct aspects of the issue. In reality, there are at least three orthogonal aspects to structure2: • The structure of the data itself. • The structure of the container that hosts the data. • The structure of the access method used to access the data. These three dimensions are largely independent and one does not need to imply another. For example, it is absolutely feasible and reasonable to store unstructured data in a structured database container and access it by unstructured search mechanisms. In many cases researchers will have their data described and constrained by some of the aspects detailed above. To support these activities Nodes would require more infrastructure that “just plain storage”. However Node operators should see this as an opportunity. In fact a collection of Nodes might collaborate together to provide, for example, a massive distributed query engine based on the concepts of NoSQL and Map/Reduce. Such a service could be quite enticing to a significant portion of the Australian Higher Education and Research sectors. 9. Data Integrity Data Integrity within the RDSI project is of utmost importance. If a RDSI Node can't provide data to an end-user in the same state in which it was ingested, then researchers may not be able to trust the data from that Node. Moreover the stain of the loss of data integrity from one Node may affect the trust-worthiness of another Node in the eyes of data custodians and the end users as well. While it is impossible to reduce the risk of data integrity to zero, it is possible to management this risk. Node operators bear the brunt of this risk and they must ensure that proactive and reactive measures be taken. As a proactive measure storage systems should be able to detect such events like bit rot and silent corruptions and attempt to heal them without human input. Such events should also be monitored within the My Node portal even if the system successful healed the loss of data integrity. As a secondary proactive measure a service similar to fsprobe, the CERN probabilistic data integrity checker, should performs a regular check of file systems by writing various combinations of bit patterns and then reading them back. This can be used to identify file system, operating system and hardware problems. As a reactive measure when a storage system does detect a fault, the cause must be investigated promptly and mitigation strategies designed and put in place. RDSI must be informed of these faults and mitigation strategies. Node operators should share this information with each other so that a body of knowledge of these storage anomalies can help minimize future storage anomalies over all RSDI infrastructure. 2 Duncan Pauly, founder and chief technology officer of Coppereye
  • 25. 25 10. Manifestation of Trust within RDSI program It is obvious that all RDSI infrastructures must manifest a significant level of trust worthiness so that researchers, data custodians and other users will feel secure in its use. In an infrastructure like PKI or a SAML based federation the province of trust is usually located at a single point. For instance the trust root of either a root CA (for the case of a PKI) or a self-signed certificate (for the case of a SAML federation for use to digitally signing an aggregation of SAML metadata). For both PKI and a SAML federation there are also open and well-published practice statements that allow end-users and replying parties to understand the risks of using the PKI or SAML federation. In the RDSI infrastructure there are a number of trust centres that manifest this aggregated trust. Some are manifested by RDSI itself as a governance and policies layer, some are manifested by the RDSI Nodes and their work practices. Some are manifested in the appropriate sanctioned use of the DSIL LDAP directory. There are also trust manifestation centres that at first glance have little real connection to either RDSI or Nodes. For example when a remote RDSI file system is mounted on a local file system it is the work practices of the local system administrators that generate the trust- worthiness of that act. For this reason the manifestation of trust for all aspects of the RDSI project is somewhat more complicated than the simple case of a PKI or SAML federation. This increases the risk to end-users and replying parties as they may not be able to the full understand the risks of using the RDSI infrastructure. The RDSI governance layer must manage the perception of risks well so as to optimize the significant level of trust worthiness so that researchers, data custodians and other users will feel secure in its use.