Eli Dart, Network Engineer ESnet Science Engagement Lawrence Berkeley National Laboratory Cosmology CrossConnects Workshop Berkeley, CA February 11, 2015
Network Engineering for High Speed Data SharingGlobus
These slides were presented by ESnet's Eli Dart at the AGU Fall Meeting 2018 in a session titled "Scalable Data Management Practices in Earth Sciences" convened by Ian Foster, Globus co-founder and director of Argonne's data science and learning division.
Grid optical network service architecture for data intensive applicationsTal Lavian Ph.D.
Integrated SW System Provide the “Glue”
Dynamic optical network as a fundamental Grid service in data-intensive Grid application, to be scheduled, to be managed and coordinated to support collaborative operations
From Super-computer to Super-network
In the past, computer processors were the fastest part
peripheral bottlenecks
In the future optical networks will be the fastest part
Computer, processor, storage, visualization, and instrumentation - slower "peripherals”
eScience Cyber-infrastructure focuses on computation, storage, data, analysis, Work Flow.
The network is vital for better eScience
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Data Warehouses store integrated and consistent data in a subject-oriented data repository dedicated
especially to support business intelligence processes. However, keeping these repositories updated usually
involves complex and time-consuming processes, commonly denominated as Extract-Transform-Load tasks.
These data intensive tasks normally execute in a limited time window and their computational requirements
tend to grow in time as more data is dealt with. Therefore, we believe that a grid environment could suit
rather well as support for the backbone of the technical infrastructure with the clear financial advantage of
using already acquired desktop computers normally present in the organization. This article proposes a
different approach to deal with the distribution of ETL processes in a grid environment, taking into account
not only the processing performance of its nodes but also the existing bandwidth to estimate the grid
availability in a near future and therefore optimize workflow distribution.
Network Engineering for High Speed Data SharingGlobus
These slides were presented by ESnet's Eli Dart at the AGU Fall Meeting 2018 in a session titled "Scalable Data Management Practices in Earth Sciences" convened by Ian Foster, Globus co-founder and director of Argonne's data science and learning division.
Grid optical network service architecture for data intensive applicationsTal Lavian Ph.D.
Integrated SW System Provide the “Glue”
Dynamic optical network as a fundamental Grid service in data-intensive Grid application, to be scheduled, to be managed and coordinated to support collaborative operations
From Super-computer to Super-network
In the past, computer processors were the fastest part
peripheral bottlenecks
In the future optical networks will be the fastest part
Computer, processor, storage, visualization, and instrumentation - slower "peripherals”
eScience Cyber-infrastructure focuses on computation, storage, data, analysis, Work Flow.
The network is vital for better eScience
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Data Warehouses store integrated and consistent data in a subject-oriented data repository dedicated
especially to support business intelligence processes. However, keeping these repositories updated usually
involves complex and time-consuming processes, commonly denominated as Extract-Transform-Load tasks.
These data intensive tasks normally execute in a limited time window and their computational requirements
tend to grow in time as more data is dealt with. Therefore, we believe that a grid environment could suit
rather well as support for the backbone of the technical infrastructure with the clear financial advantage of
using already acquired desktop computers normally present in the organization. This article proposes a
different approach to deal with the distribution of ETL processes in a grid environment, taking into account
not only the processing performance of its nodes but also the existing bandwidth to estimate the grid
availability in a near future and therefore optimize workflow distribution.
Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...Tal Lavian Ph.D.
Lambda Data Grid
An Agile Optical Platform for Grid Computing
and Data-intensive Applications
Focus on BIRN Mouse application.
Great vision –
LambdaGrid is one step towards this concepts
LambdaGrid –
A novel service architecture
Lambda as a Scheduled Service
Lambda as a prime resource - like storage and computation
Change our current systems assumptions
Potentially opens new horizon
Dr. Frank Wuerthwein from the University of California at San Diego presentation at International Super Computing Conference on Big Data, 2013, US Until recently, the large CERN experiments, ATLAS and CMS, owned and controlled the computing infrastructure they operated on in the US, and accessed data only when it was locally available on the hardware they operated. However, Würthwein explains, with data-taking rates set to increase dramatically by the end of LS1 in 2015, the current operational model is no longer viable to satisfy peak processing needs. Instead, he argues, large-scale processing centers need to be created dynamically to cope with spikes in demand. To this end, Würthwein and colleagues carried out a successful proof-of-concept study, in which the Gordon Supercomputer at the San Diego Supercomputer Center was dynamically and seamlessly integrated into the CMS production system to process a 125-terabyte data set.
Efficient node bootstrapping for decentralised shared-nothing Key-Value StoresHan Li
This slide was presented in ACM/IFIP/USENIX Middleware 2013, for the paper of "Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores". Abstract of the paper is shown below.
Abstract. Distributed key-value stores (KVSs) have become an important component for data management in cloud applications. Since resources can be provisioned on demand in the cloud, there is a need for efficient node bootstrapping and decommissioning, i.e. to incorporate or eliminate the provisioned resources as a members of the KVS. It requires the data be handed over and the load be shifted across the nodes quickly. However, the data partitioning schemes in the current-state shared nothing KVSs are not efficient in quick bootstrapping. In this paper, we have designed a middleware layer that provides a decentralised scheme of auto-sharding with a two-phase bootstrapping. We experimentally demonstrate that our scheme reduces bootstrap time and improves load-balancing thereby increasing scalability of the KVS.
Challenges and Issues of Next Cloud Computing PlatformsFrederic Desprez
Cloud computing has now crossed the frontiers of research to reach industry. It is used every day , whether to exchange emails or make
reservations on web sites. However, many research works remain to be done to improve the performance and functionality of these platforms of tomorrow. In this talk, I will do an overview of some these theoretical and appliead researches done at INRIA and particularly around Clouds distribution, energy monitoring and management, massive data processing and exchange, and resource management.
Slides from my talk on R&D innovation projects around the Janet network for the HEAnet / Juniper Innovation Day, September 2015. I talk about some recent Janet R&D initiatives such as our Reach scheme for connecting industry to the network, our end to end performance initiative, and our Safe Share project for secure access to sensitive data by researchers - e.g. medical records. There is also a recap of some of our recent activity around equipment sharing, our shared data centre, connectivity and deals with major cloud providers.
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Archiving data from Durham to RAL using the File Transfer Service (FTS)Jisc
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...balmanme
As current technology enables faster storage devices and larger interconnect bandwidth, there is a substantial need for novel system design and middleware architecture to address increasing latency, scalability, and throughput requirements. In this talk, I will outline network-aware data management and present solutions based on my past experience in large-scale data migration between remote repositories.
I will first describe my experience in the initial evaluation of 100Gbps network as a part of the Advance Network Initiative project. We needed intense fine-tuning in network, storage, and application layers, to take advantage of the higher network capacity. I will introduce a special data movement prototype, successfully tested in one of the first 100Gbps demonstrations, in which applications map memory blocks for remote data, in contrast to the send/receive semantics.
Within this scope, I will introduce a flexible network reservation algorithm for on-demand bandwidth guaranteed virtual circuit services. Flexible reservations find best path in a time-dependent dynamic network topology to support predictable application performance. I will then present a data-scheduling model with advance provisioning, in which data movement operations are defined with earliest start and latest completion times.
I will conclude my talk with a very brief overview of my other related projects on performance engineering, hyper-converged virtual storage, and optimization in control and data path for virtualized environments.
Sept 28, 2015
Akamai, Cambridge, MA
Paul Messina from Argonne presented this deck at the HPC User Forum in Santa Fe.
"The Exascale Computing Project (ECP) was established with the goals of maximizing the benefits of high-performance computing (HPC) for the United States and accelerating the development of a capable exascale computing ecosystem. Exascale refers to computing systems at least 50 times faster than the nation’s most powerful supercomputers in use today.The ECP is a collaborative effort of two U.S. Department of Energy organizations – the Office of Science (DOE-SC) and the National Nuclear Security Administration (NNSA)."
Watch the video: http://insidehpc.com/2017/04/update-exascale-computing-project-ecp/
Learn more: https://exascaleproject.org/
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...Tal Lavian Ph.D.
Lambda Data Grid
An Agile Optical Platform for Grid Computing
and Data-intensive Applications
Focus on BIRN Mouse application.
Great vision –
LambdaGrid is one step towards this concepts
LambdaGrid –
A novel service architecture
Lambda as a Scheduled Service
Lambda as a prime resource - like storage and computation
Change our current systems assumptions
Potentially opens new horizon
Dr. Frank Wuerthwein from the University of California at San Diego presentation at International Super Computing Conference on Big Data, 2013, US Until recently, the large CERN experiments, ATLAS and CMS, owned and controlled the computing infrastructure they operated on in the US, and accessed data only when it was locally available on the hardware they operated. However, Würthwein explains, with data-taking rates set to increase dramatically by the end of LS1 in 2015, the current operational model is no longer viable to satisfy peak processing needs. Instead, he argues, large-scale processing centers need to be created dynamically to cope with spikes in demand. To this end, Würthwein and colleagues carried out a successful proof-of-concept study, in which the Gordon Supercomputer at the San Diego Supercomputer Center was dynamically and seamlessly integrated into the CMS production system to process a 125-terabyte data set.
Efficient node bootstrapping for decentralised shared-nothing Key-Value StoresHan Li
This slide was presented in ACM/IFIP/USENIX Middleware 2013, for the paper of "Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores". Abstract of the paper is shown below.
Abstract. Distributed key-value stores (KVSs) have become an important component for data management in cloud applications. Since resources can be provisioned on demand in the cloud, there is a need for efficient node bootstrapping and decommissioning, i.e. to incorporate or eliminate the provisioned resources as a members of the KVS. It requires the data be handed over and the load be shifted across the nodes quickly. However, the data partitioning schemes in the current-state shared nothing KVSs are not efficient in quick bootstrapping. In this paper, we have designed a middleware layer that provides a decentralised scheme of auto-sharding with a two-phase bootstrapping. We experimentally demonstrate that our scheme reduces bootstrap time and improves load-balancing thereby increasing scalability of the KVS.
Challenges and Issues of Next Cloud Computing PlatformsFrederic Desprez
Cloud computing has now crossed the frontiers of research to reach industry. It is used every day , whether to exchange emails or make
reservations on web sites. However, many research works remain to be done to improve the performance and functionality of these platforms of tomorrow. In this talk, I will do an overview of some these theoretical and appliead researches done at INRIA and particularly around Clouds distribution, energy monitoring and management, massive data processing and exchange, and resource management.
Slides from my talk on R&D innovation projects around the Janet network for the HEAnet / Juniper Innovation Day, September 2015. I talk about some recent Janet R&D initiatives such as our Reach scheme for connecting industry to the network, our end to end performance initiative, and our Safe Share project for secure access to sensitive data by researchers - e.g. medical records. There is also a recap of some of our recent activity around equipment sharing, our shared data centre, connectivity and deals with major cloud providers.
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Archiving data from Durham to RAL using the File Transfer Service (FTS)Jisc
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...balmanme
As current technology enables faster storage devices and larger interconnect bandwidth, there is a substantial need for novel system design and middleware architecture to address increasing latency, scalability, and throughput requirements. In this talk, I will outline network-aware data management and present solutions based on my past experience in large-scale data migration between remote repositories.
I will first describe my experience in the initial evaluation of 100Gbps network as a part of the Advance Network Initiative project. We needed intense fine-tuning in network, storage, and application layers, to take advantage of the higher network capacity. I will introduce a special data movement prototype, successfully tested in one of the first 100Gbps demonstrations, in which applications map memory blocks for remote data, in contrast to the send/receive semantics.
Within this scope, I will introduce a flexible network reservation algorithm for on-demand bandwidth guaranteed virtual circuit services. Flexible reservations find best path in a time-dependent dynamic network topology to support predictable application performance. I will then present a data-scheduling model with advance provisioning, in which data movement operations are defined with earliest start and latest completion times.
I will conclude my talk with a very brief overview of my other related projects on performance engineering, hyper-converged virtual storage, and optimization in control and data path for virtualized environments.
Sept 28, 2015
Akamai, Cambridge, MA
Paul Messina from Argonne presented this deck at the HPC User Forum in Santa Fe.
"The Exascale Computing Project (ECP) was established with the goals of maximizing the benefits of high-performance computing (HPC) for the United States and accelerating the development of a capable exascale computing ecosystem. Exascale refers to computing systems at least 50 times faster than the nation’s most powerful supercomputers in use today.The ECP is a collaborative effort of two U.S. Department of Energy organizations – the Office of Science (DOE-SC) and the National Nuclear Security Administration (NNSA)."
Watch the video: http://insidehpc.com/2017/04/update-exascale-computing-project-ecp/
Learn more: https://exascaleproject.org/
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Michael Sullivan, M.D. Associate Director, Health Sciences, Internet2
AAMC 2013 Information Technology in Academic Medicine Conference Vancouver CA June 5-7, 2013
This was a 30 min talk intended as one of the opening/overview presentations before a full-day deep dive into ScienceDMZ design patterns and architectures.
Direct downloads are not enabled. Contact me directly (chris@bioteam.net) if you for some odd reason want a copy of this slide deck!
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Network Management and Flow Analysis in Today’s Dense IT EnvironmentsSolarWinds
For more information on NetFlow Traffic Analyzer visit: http://www.solarwinds.com/products/network-traffic-analyzer/info.aspx
For more information on IP SLA visit: http://www.solarwinds.com/products/ip-sla-monitoring/info.aspx
Watch this webcast: http://www.solarwinds.com/resources/webcasts/network-management-and-flow-analysis-in-today-dense-it-environments.html
In the 1990’s, when the Internet and enterprise network build-out occurred you had to manage individual intersite WAN connections and single-purpose networking equipment. Network management required managing devices and their basic functions. Today, multifunctional and virtual devices are common. New device types and services are going to market every day.
During this webcast, we discuss network management and flow analysis evolved to keep pace with today’s complicated and dense IT environments. Specifically we’ll discuss managing data centers, changes in WAN technologies and WAN management, and best practices for flow analysis, including where to deploy flow exporters.
Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...Tal Lavian Ph.D.
Lambda Data Grid
An Agile Optical Platform for Grid Computing
and Data-intensive Applications
Integrated SW System Provide the “Glue”
Dynamic optical network as a fundamental Grid service in data-intensive Grid application, to be scheduled, to be managed and coordinated to support collaborative operations
Supporting Research through "Desktop as a Service" models of e-infrastructure...David Wallom
Keynote presentation given 13/9/16 @ ESA Earth Observation Open Science workshop 2016.
"The rise in cloud computing as an e-infrastructure model is one that has the power to democratise access to computational and data resources throughout the research communities. We have seen the difference that Infrastructure as a Service (IaaS) has made for different communities and are now only beginning to understand what different models further up the stack can make. It is also becoming clear that with the increase in research data volumes, the number of sources and the possibility of utilising data from different regulatory regimes that a different model of how analysis is performed on the data is possible. Utilising a "Desktop as a Service" model, with community focused applications installed on a common and well understood virtual system image that is directly connected to community relevant data allows the researcher to no longer have to consider moving data but only the final analysed results. This massively simplifies both the user model and the data and resource owner model. We will consider the specific example of the Environmental Ecomics Synthesis Cloud and how it could easily be generalised to other areas."
Similar to Common Design Elements for Data Movement Eli Dart (20)
Updated Policy Brief: Cooperatives Bring Fiber Internet Access to Rural AmericaEd Dodds
Originally published in 2017, our report, Cooperatives Fiberize Rural America: A Trusted Model for the Internet Era, focuses on cooperatives as a proven model for deploying fiber optic Internet access across the country. An update in the spring of 2019 included additional information about the rate co-ops are expanding Internet service, and now we’ve updated it again, with a new map and personal stories from areas where co-ops have drastically impacted local life.
Digital Inclusion and Meaningful Broadband Adoption Initiatives Colin Rhinesm...Ed Dodds
This report presents findings from a national study of digital inclusion organizations that help low-income individuals and families adopt high-speed Internet service. The study looked at eight digital inclusion organizations across the United States that are working at the important intersection between making high-speed Internet available and strengthening digital skills—two essential and interrelated components of digital inclusion, which is focused on increasing digital access, skills, and relevant content.
Innovation Accelerators:
Defining Characteristics Among Startup Assistance Organizations by C. Scott Dempwolf, Jennifer Auer, and
Michelle D’Ippolito
Optimal Solutions Group, LLC
College Park, MD 20740
contract number SBAHQ -13-M-0197
Release Date: October 2014
This report was developed under a contract with the Small Business Administration, Office of Advocacy, and contains information and analysis that were reviewed by officials of the Office of Advocacy. However, the final conclusions of the report do not necessarily reflect the views of the Office of Advocacy.
Executive Summary. Thriving in a Turbulent, Technological and Transformed Global Economy | Council on Competitiveness 900 17th Street, NW, Suite 700 Washington, D.C. 20006 T 202 682 4292 Compete.org
America has long been a nation of innovators. The United States is the birthplace of the Internet, which today connects three billion people around the world. American scientists and engineers sequenced the human genome, invented the semiconductor, and sent humankind to the moon. And America is not done yet. For an advanced economy such as the United States, innovation is a wellspring of economic growth. While many countries can grow by adopting existing technologies and business practices, America must continually innovate because our workers and firms are often operating at the technological frontier. Innovation is also a powerful tool for addressing our most pressing challenges as a nation, such as enabling more Americans to lead longer, healthier lives, and accelerating the transition to a low-carbon economy.
Report to the President and Congress Ensuring Leadership in Federally Funded ...Ed Dodds
In the report, PCAST focuses on eight R&D areas: cybersecurity, IT and health, Big Data and data-intensive computing, IT and the physical world, privacy protection, cyber-human systems, high capability computing, and foundational computing research. All of these areas help to achieve the Nation’s priorities. For example, Big Data, IT and the physical world, and high-capability computing are essential contributors to addressing issues within energy and the environment.
Data Act Federal Register Notice Public Summary of ResponsesEd Dodds
Summary of Responses to the Treasury Bureau of the Fiscal Service Notice in the Federal Register on 9/26/2014 for “Public Input on the Establishment of Financial Data Standards (Data Exchange)
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
1. Common
Design
Elements
for
Data
Movement
Eli
Dart,
Network
Engineer
ESnet
Science
Engagement
Lawrence
Berkeley
Na<onal
Laboratory
Cosmology
CrossConnects
Workshop
Berkeley,
CA
February
11,
2015
3. Context
• Data-‐intensive
science
con<nues
to
need
high-‐performance
data
movement
between
geographically
distant
loca<ons
– Observa<on
(or
instrument)
to
analysis
– Distribu<on
of
data
products
to
users
– Aggrega<on
of
data
sets
for
analysis
– Replica<on
to
archival
storage
• Move
computa<on
to
data?
Of
course!
Except
when
you
can’t…
– A
liquid
market
in
fungible
compu<ng
alloca<ons
does
not
exist
– Users
get
an
alloca<on
of
<me
on
a
specific
compute
resource
–
if
the
data
isn’t
there
already,
it
needs
to
be
put
there
– If
data
can’t
be
stored
long-‐term
where
it’s
generated,
it
must
be
moved
– Other
reasons
too
–
the
point
is
we
have
to
be
able
to
move
Big
Data
• Given
the
need
for
data
movement,
how
can
we
reliably
do
it
well?
2/10/15
3
4. The
Task
of
Large
Scale
Data
Movement
• Several
different
ways
to
look
at
a
data
movement
task
• People
perspec<ve:
– I
am
a
member
of
a
collabora<on
– Our
collabora<on
has
accounts
with
compute
alloca<ons
and
data
storage
alloca<ons
at
a
set
of
sites
– I
need
to
move
data
between
those
sites
• Organiza<on/facility
perspec<ve:
– ANL,
NCSA,
NERSC,
ORNL
and
SDSC
are
all
used
by
the
collabora<on
– All
these
sites
must
have
data
transfer
tools
in
common
– I
must
learn
what
tools
and
capabili<es
each
site
has,
and
apply
those
tools
to
my
task
• Note
that
the
integra<on
burden
is
on
the
scien<st!
2/10/15
4
5. Service
Primi<ves
• There
is
another
way
to
look
at
data
movement
• All
large-‐scale
data
movement
tasks
are
composed
of
a
set
of
primi<ves
– Those
primi<ves
are
common
to
most
such
workflows
– If
major
sites
can
agree
on
a
set
of
primi<ves,
all
large-‐scale
data
workflows
will
benefit
• What
are
the
common
primi<ves?
– Storage
systems
(filesystems,
tape
archives,
etc.)
– Data
transfer
applica<ons
(Globus,
others)
– Workflow
tools,
if
automa<on
is
used
– Networks
• Local
networks
• Wide
area
networks
• What
if
these
worked
well
together
in
the
general
case?
• Compose
them
into
common
design
paJerns
2/10/15
5
9. Design
PaGern
–
The
Science
DMZ
Model
• Design
paJerns
are
reusable
solu<ons
to
design
problems
that
recur
in
the
real
world
– High
performance
data
movement
is
a
good
fit
for
this
– Science
DMZ
model
• Science
DMZ
incorporates
several
things
– Network
enclave
at
or
near
site
perimeter
– Sane
security
controls
• Good
fit
for
high-‐performance
applica<ons
• Specific
to
Science
DMZ
services
– Performance
test
and
measurement
– Dedicated
systems
for
data
transfer
(Data
Transfer
Nodes)
• High
performance
hosts
• Good
tools
• Details
at
hJp://fasterdata.es.net/science-‐dmz/
2/10/15
9
10. Context:
Science
DMZ
Adop<on
• DOE
Na<onal
Laboratories
– Both
large
and
small
sites
– HPC
centers,
LHC
sites,
experimental
facili<es
• NSF
CC-‐NIE
and
CC*IIE
programs
leverage
Science
DMZ
– $40M
and
coun<ng
(third
round
awards
coming
soon,
es<mate
addi<onal
$18M
to
$20M)
– Significant
investments
across
the
US
university
complex,
~130
awards
– Big
shoutout
to
Kevin
Thompson
and
the
NSF
–
these
programs
are
cri<cally
important
• Na<onal
Ins<tutes
of
Health
– 100G
network
infrastructure
refresh
• US
Department
of
Agriculture
– Agricultural
Research
Service
is
building
a
new
science
network
based
on
the
Science
DMZ
model
– hJps://www.ro.gov/index?s=opportunity&mode=form&tab=core&id=a7f291f4216b5a24c1177a5684e1809b
• Other
US
agencies
looking
at
Science
DMZ
model
– NASA
– NOAA
• Australian
Research
Data
Storage
Infrastructure
(RDSI)
– Science
DMZs
at
major
sites,
connected
by
a
high
speed
network
– hJps://www.rdsi.edu.au/dashnet
– hJps://www.rdsi.edu.au/dashnet-‐deployment-‐rdsi-‐nodes-‐begins
• Other
countries
– Brazil
– New
Zealand
– More
2/10/15
10
11. Context:
Community
Capabili<es
• Many
Science
DMZs
directly
support
science
applica<ons
– LHC
(Run
2
is
coming
soon)
– Experiment
opera<on
(Fusion,
Light
Sources,
etc.)
– Data
transfer
into/out
of
HPC
facili<es
• Many
Science
DMZs
are
SDN-‐ready
– Openflow-‐capable
gear
– SDN
research
ongoing
• High-‐performance
components
– High-‐speed
WAN
connec<vity
– perfSONAR
deployments
– DTN
deployments
• Metcalfe’s
Law
of
Network
U<lity
– Value
propor<onal
to
the
square
of
the
number
of
DMZs?
n
log(n)?
– Cyberinfrastructure
value
increases
as
we
all
upgrade
2/10/15
11
12. Strategic
Impacts
• What
does
this
mean?
– We
are
in
the
midst
of
a
significant
cyberinfrastructure
upgrade
– Enterprise
networks
need
not
be
unduly
perturbed
J
• Significantly
enhanced
capabili<es
compared
to
3
years
ago
– Terabyte-‐scale
data
movement
is
much
easier
– Petabyte-‐scale
data
movement
possible
outside
the
LHC
experiments
• 3.1Gbps
=
1PB/month
• (Try
doing
that
through
your
enterprise
firewall!)
– Widely-‐deployed
tools
are
much
beJer
(e.g.
Globus)
• Raised
expecta<ons
for
network
infrastructures
– Scien<sts
should
be
able
to
do
beJer
than
residen<al
broadband
• Many
more
sites
can
now
achieve
good
performance
• Incumbent
on
science
networks
to
meet
the
challenge
– Remember
the
TCP
loss
characteris<cs
– Use
perfSONAR
– Science
experiments
assume
this
stuff
works
–
we
can
now
meet
their
needs
2/10/15
12
13. High
Performance
Data
Transfer
-‐
Requirements
• There
is
a
set
of
things
required
for
reliable
high-‐performance
data
transfer
– Long-‐haul
networks
• Well-‐provisioned
• High-‐performance
– Local
networks
• Well-‐provisioned
• High-‐performance
• Sane
security
– Local
data
systems
• Dedicated
to
data
transfer
(else
too
much
complexity)
• High-‐performance
access
to
storage
– Good
data
transfer
tools
• Interoperable
• High-‐performance
– Ease
of
use
• Usable
by
people
• Usable
by
workflows
• Interoperable
across
sites
(remove
integra<on
burden)
2/10/15
13
14. Long-‐Haul
Network
Status
• 100
Gigabit
per
second
networks
deployed
globally
– USA/DOE
Na<onal
Laboratories
–
ESnet
– USA/.edu
–
Internet2
– Europe
–
GEANT
– Many
state
and
regional
networks
have
or
are
deploying
100Gbps
cores
• What
does
this
mean
in
terms
of
capability?
– 1TB/hour
requires
less
than
2.5Gbps
(2.5%
of
100Gbps
network)
– 1PB/week
requires
less
than
15Gbps
(15%
of
100Gbps
network)
– hJp://fasterdata.es.net/home/requirements-‐and-‐expecta<ons
– The
long-‐haul
capacity
problem
is
now
solved,
to
first
order
• Some
networks
are
s<ll
in
the
middle
of
upgrades
• However,
steady
progress
is
being
made
2/10/15
14
15. Local
Network
Status
• Many
ESnet
sites
now
have
100G
connec<ons
to
ESnet
– 2x100G:
BNL,
CERN,
FNAL
– 1x100G:
ANL,
LANL,
LBNL,
NERSC,
ORNL,
SLAC
• Capacity
provisioning
is
much
easier
in
a
LAN
environment
• Security
requires
aJen<on
(see
Science
DMZ)
• Major
DOE
compu<ng
facili<es
have
a
lot
of
capacity
deployed
to
their
data
systems
– ANL:
60Gbps
– NERSC:
80Gbps
– ORNL:
20Gbps
• Big
win
if
sites
use
Science
DMZ
model
2/10/15
15
16. Progress
So
Far
• There
is
a
set
of
things
required
for
reliable
high-‐performance
data
transfer
– Long-‐haul
networks
• Well-‐provisioned
• High-‐performance
– Local
networks
• Well-‐provisioned
• High-‐performance
• Sane
security
– Local
data
systems
• Dedicated
to
data
transfer
(else
too
much
complexity)
• High-‐performance
access
to
storage
– Good
data
transfer
tools
• Interoperable
• High-‐performance
– Ease
of
use
• Usable
by
people
• Usable
by
workflows
• Interoperable
across
sites
(remove
integra<on
burden)
2/10/15
16
17. Local
Data
Systems
• Science
DMZ
model
calls
these
Data
Transfer
Nodes
– Dedicated
to
high-‐performance
data
transfer
tasks
– Short,
clean
path
to
outside
world
• At
HPC
facili<es,
they
mount
the
global
filesystem
– Transfer
data
to
the
DTN
– Data
available
on
HPC
resource
• High-‐performance
data
transfer
tools
– Globus
Transfer
– Command-‐line
globus-‐url-‐copy
– BBCP
• These
are
deployed
now
at
many
HPC
facili<es
– ANL,
NERSC,
ORNL
– NCSA,
SDSC
2/10/15
17
18. Data
Transfer
Tools
• Interoperability
is
really
important
– Remember,
scien<sts
should
not
have
to
do
the
integra<on
– HPC
facili<es
should
agree
on
a
common
toolset
– Today,
that
common
toolset
has
a
few
members
• Globus
Transfer
• SSH/SCP/Rsync
(yes,
I
know
–
ick!)
• Many
niche
tools
• Globus
appears
to
be
the
most
full-‐featured
– GUI,
data
integrity
checks,
fault
recovery
– Fire
and
forget
– API
for
workflows
• Globus
is
also
widely
deployed
– ANL,
NERSC,
ORNL
– NCSA,
SDSC
(all
of
XSEDE)
– Many
other
loca<ons
2/10/15
18
19. More
Progress
• There
is
a
set
of
things
required
for
reliable
high-‐performance
data
transfer
– Long-‐haul
networks
• Well-‐provisioned
• High-‐performance
– Local
networks
• Well-‐provisioned
• High-‐performance
• Sane
security
– Local
data
systems
• Dedicated
to
data
transfer
(else
too
much
complexity)
• High-‐performance
access
to
storage
– Good
data
transfer
tools
• Interoperable
• High-‐performance
– Ease
of
use
• Usable
by
people
• Usable
by
workflows
• Interoperable
across
sites
(remove
integra<on
burden)
2/10/15
19
20. Mission
Scope
and
Science
Support
• Resource
providers
each
have
their
own
mission
– ESnet:
high-‐performance
networking
for
science
– ANL,
NERSC,
ORNL:
HPC
for
DOE
science
users
– NCSA,
SDSC,
et.
al.:
HPC
for
NSF
users
– Globus:
full-‐featured,
high-‐performance
data
transfer
tools
• No
responsibility
for
individual
science
projects
– Resource
provider
staff
usually
not
part
of
science
projects
– Science
projects
have
to
do
their
own
integra<on
(see
beginning
of
talk)
• However,
resource
providers
are
typically
responsive
to
user
requests
– If
you
have
a
problem,
it’s
their
job
to
fix
it
– I
propose
we
use
this
to
get
something
done
2/10/15
20
21. Hypothe<cal:
HPC
Data
Transfer
Capability
• This
community
has
significant
data
transfer
needs
– I
have
worked
with
some
of
you
in
the
past
– Simula<ons,
sky
surveys,
etc.
– Expecta<on
over
<me
that
needs
will
increase
• Improve
data
movement
capability
– ANL,
NERSC,
ORNL
– NCSA,
SDSC
– This
is
an
arbitrary
list,
based
on
my
incomplete
understanding
– Should
there
be
others?
• Goal:
per-‐Globus-‐job
performance
of
1PB/week
– I
don’t
mean
we
have
to
transfer
1PB
every
week
– But,
if
we
need
to,
we
should
be
able
to
– Remember,
this
only
takes
15%
of
a
100G
network
path
2/10/15
21
22. What
Would
Be
Required?
• We
would
need
several
things:
– Specific
workflow
(move
dataset
D
of
size
S
from
A
to
Z,
frequency
F)
– A
commitment
by
resource
providers
to
see
it
through
• ESnet
(+
other
networks
if
needed)
• Compu<ng
facili<es
• Globus
• Is
it
100%
plug-‐and-‐play?
No.
– There
are
almost
certainly
some
wrinkles
– However,
most
of
the
hard
part
is
done
• Networks
• Data
transfer
nodes
• Tools
• Let’s
work
together
and
make
this
happen!
2/10/15
22
23. Ques<ons
For
You
• Would
an
effort
like
this
be
useful?
(I
think
so)
• Does
this
community
need
this
capability?
(I
think
so)
• Are
there
obvious
gaps?
(probably,
e.g.
performance
to
tape)
• Which
sites
would
be
involved?
• Am
I
crazy?
(I
think
not)
2/10/15
23
24. Thanks!
Eli
Dart
Energy
Sciences
Network
(ESnet)
Lawrence
Berkeley
Na<onal
Laboratory
hJp://fasterdata.es.net/
hJp://my.es.net/
hJp://www.es.net/
26. Support
For
Science
Traffic
• The
Science
DMZ
is
typically
deployed
to
support
science
traffic
– Typically
large
data
transfers
over
long
distances
– In
most
cases,
the
data
transfer
applica<ons
use
TCP
• The
behavior
of
TCP
is
a
legacy
from
the
conges<on
collapse
of
the
Internet
in
the
1980s
– Loss
is
interpreted
as
conges<on
– TCP
backs
off
to
avoid
conges<on
à
performance
degrades
– Performance
hit
related
to
the
square
of
the
packet
loss
rate
• Addressing
this
problem
is
a
dominant
engineering
considera<on
for
science
networks
– Lots
of
design
effort
– Lots
of
engineering
<me
– Lots
of
troubleshoo<ng
effort
2/10/15
26
27. A small amount of packet loss makes a huge
difference in TCP performance
2/10/15
Metro
Area
Local
(LAN)
Regional
Con<nental
Interna<onal
Measured (TCP Reno) Measured (HTCP) Theoretical (TCP Reno) Measured (no loss)
With loss, high performance
beyond metro distances is
essentially impossible