This talk was given at a workshop entitled "Cybersecurity Engagement in a Research Environment" at Rady School of Management at UCSD. The workshop was organized by Michael Corn, the UCSD CISO. It tries to provoke discussion around the cybersecurity features and requirements of international science collaborations, as well as more generally, federated cyberinfrastructure systems.
3. Jensen Huang keynote @ SC19
3
The Largest Cloud Simulation in History
50k NVIDIA GPUs in the Cloud
350 Petaflops for 2 hours
Distributed across US, Europe & Asia
Saturday morning before SC19 we bought all GPU capacity that was for sale in
Amazon Web Services, Microsoft Azure, and Google Cloud Platform worldwide
4. Science with 51,000 GPUs
achieved as peak performance
4
Time in Minutes
Each color is a different
cloud region in US, EU, or Asia.
Total of 28 Regions in use.
Peaked at 51,500 GPUs
~380 Petaflops of fp32
I can purchase a 300PFLOP32 hour in the cloud for $15k today
and nobody asks me any questions about cybersecurity.
• Nothing about my nationality or visa or …
• Nothing about two-factor authentication or my software
• Everything is wide open on the internet
5. Should cybersecurity requirements imposed
on open academic research executed at on-
prem resources be adjusted to the realities of
executing the same research on cloud
resources ?
9. Cybersecurity enabling Science
• Humanity has built extraordinary instruments by
pooling human and financial resources globally.
• To derive science from the data and simulations
for those instruments requires globally
integrated Cyberinfrastructure.
• Cybersecurity is enabling this science.
Policy framework
Operational security
Infrastructure software
9
Disk space use per site by CMS
11. XENON1T Storage & Processing
Challenge
• Experiment in Gran Sasso, Italy
• Tape Archive in Sweden
• Disk storage in 7 locations across Holland, Italy,
Israel, France, USA
Petabyte of data divided into 20k datasets
• Compute sites on EGI, OSG, and NSF HPC
allocation
11
OSG took on the integration challenge
via ”embedded” technical support.
14. OSG Compute Federation
14
OSG federates
~200 clusters
worldwide
Owners determine
policy of use.
Many allow
opportunistic use
of spare capacity.
> 2 Billion CPU core hours per year
15. Federation Principle
• Any provider can bring their resources to the
table.
• Truth in advertising:
Resource providers accurately specify (some)
details about the resource.
• Any consumer can decide which of the
available resources they are willing to use.
15
OSG matches consumers to providers globally
following policies expressed locally.
16. “NETFLIX” for Open Science
• NETFLIX operates a CDN, providing streaming access to
searchable curated data from anywhere at anytime to any
subscriber.
• For open science, the CDN needs to (in addition) be federated.
Anybody can share their data from their locally owned data origin into the
CDN.
Data Access is mediated via caches in the network and at endpoints to
minimize requirements on origins to maximally stimulate sharing.
Performance of data access is determined by location and performance of
the closest cache rather than the data’s origin.
Locally defined and managed groups of users share data securely with
each other globally. Data access is global.
16
Locally defined policies are enforced globally by the CDN
17. The OSG Data Federation
17
Cur r ent st ashcache infr ast r uct ur e (US)
GaTech
We operate a production “prototype” of such a CDN
19. Authz: Person vs Capability
• Operations teams are a mix of ”permanent”
staff and transients.
E.g. CERN pays for ”Operators” funded via
”authorship fees”.
• Delegating a person’s identity to a computing
activity in order to authenticate the activity at a
remote server makes little sense.
• Delegating a capability to a computing activity
in order to authenticate it at a remote server
makes a lot of sense.
19
20. Division of Responsibility
• To maximize the capacity provided we need
to minimize the effort required to provide it.
• The services required for the CDN and/or
compute federations are specialized and
non-trivial.
Large learning curve to achieve low cost
operations.
20
Service Operations is most (cost) effective
when separated from hardware operations
21. Network Cache Ops Model
• OSG supports the researchers
using the Data Federation
• OSG deploys & operates the
caching middleware.
• PRP, TNRP, I2, Regionals, …
responsible for network
performance.
• Hardware owners operate
hardware, OS install, and join
K8S for container orchestration.
21
Science Applications
Data Federation Services
Network Performance
Hardware & OS
A layered approach to distributed DevOps Responsibility
22. Cybersecurity Issues (I)
• Hardware owners only provide hardware
Deploy OS and Kubernetes.
• Service Operators (I)
A team that operates the K8S cluster.
• Service Operators (II)
A team that deploys and operates the CDN service as
containers inside (and across generally multiple) K8S
clusters.
• Software Operations
A team that provides the container images
22
How do you design a security model that supports this structure?
23. Cybersecurity Issues (II)
• Container Security Model
• Security Model that allows hardware owners to give service
responsibility to service operators.
Diverse requirements
Some institutions will want to operate their own K8S simply because of the
level of control that implies.
Others won’t because of the level of effort it requires.
How do DOE and other National Labs fit into this?
How can a service provider in the US operate a service on hardware in
EU and Asia? Or vice versa.
What about India, Pakistan, China, Iran, … pick your favorite country ….
How to deal with institutions that require US Citizenship even for SUDO
access?
23
The set of issues and diversity of constraints seems endless
And now think back to the beginning: All of this is trivial in the cloud!!!
24. Summary & Conclusions
24
• Humanity has built extraordinary instruments by
pooling human and financial resources globally.
• To derive science from the data and simulations
for those instruments requires globally
integrated Cyberinfrastructure.
• Cybersecurity is enabling this science.
Policy framework
Operational security
Infrastructure software
Contact us at: help@opensciencegrid.org
Or me personally at: fkw@ucsd.edu
25. Acknowledgements
• This work was partially supported by the
NSF grants OAC-1941481, MPS-1148698,
OAC-1841530, OAC-1904444, and OAC-
1826967
25