Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Frederic Desprez
The increasing complexity of available infrastructures (hierarchical, parallel, distributed, etc.) with specific features (caches, hyper-threading, dual core, etc.) makes it extremely difficult to build analytical models that allow for a satisfying prediction. Hence, it raises the question on how to validate algorithms and software systems if a realistic analytic study is not possible. As for many other sciences, the one answer is experimental validation. However, such experimentations rely on the availability of an instrument able to validate every level of the software stack and offering different hardware and software facilities about compute, storage, and network resources.
Almost ten years after its premises, the Grid'5000 testbed has become one of the most complete testbed for designing or evaluating large-scale distributed systems. Initially dedicated to the study of large HPC facilities, Grid’5000 has evolved in order to address wider concerns related to Desktop Computing, the Internet of Services and more recently the Cloud Computing paradigm. We now target new processors features such as hyperthreading, turbo boost, and power management or large applications managing big data. In this keynote we will both address the issue of experiments in HPC and computer science and the design and usage of the Grid'5000 platform for various kind of applications.
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
In this deck from DataTech19, Debbie Bard from NERSC presents: Supercomputing and the scientist: How HPC and large-scale data analytics are transforming experimental science.
"Debbie Bard leads the Data Science Engagement Group NERSC. NERSC is the mission supercomputing center for the USA Department of Energy, and supports over 7000 scientists and 700 projects with supercomputing needs. A native of the UK, her career spans research in particle physics, cosmology and computing on both sides of the Atlantic. She obtained her PhD at Edinburgh University, and has worked at Imperial College London as well as the Stanford Linear Accelerator Center (SLAC) in the USA, before joining the Data Department at NERSC, where she focuses on data-intensive computing and research, including supercomputing for experimental science and machine learning at scale."
Watch the video: https://wp.me/p3RLHQ-kLV
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Frederic Desprez
The increasing complexity of available infrastructures (hierarchical, parallel, distributed, etc.) with specific features (caches, hyper-threading, dual core, etc.) makes it extremely difficult to build analytical models that allow for a satisfying prediction. Hence, it raises the question on how to validate algorithms and software systems if a realistic analytic study is not possible. As for many other sciences, the one answer is experimental validation. However, such experimentations rely on the availability of an instrument able to validate every level of the software stack and offering different hardware and software facilities about compute, storage, and network resources.
Almost ten years after its premises, the Grid'5000 testbed has become one of the most complete testbed for designing or evaluating large-scale distributed systems. Initially dedicated to the study of large HPC facilities, Grid’5000 has evolved in order to address wider concerns related to Desktop Computing, the Internet of Services and more recently the Cloud Computing paradigm. We now target new processors features such as hyperthreading, turbo boost, and power management or large applications managing big data. In this keynote we will both address the issue of experiments in HPC and computer science and the design and usage of the Grid'5000 platform for various kind of applications.
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
In this deck from DataTech19, Debbie Bard from NERSC presents: Supercomputing and the scientist: How HPC and large-scale data analytics are transforming experimental science.
"Debbie Bard leads the Data Science Engagement Group NERSC. NERSC is the mission supercomputing center for the USA Department of Energy, and supports over 7000 scientists and 700 projects with supercomputing needs. A native of the UK, her career spans research in particle physics, cosmology and computing on both sides of the Atlantic. She obtained her PhD at Edinburgh University, and has worked at Imperial College London as well as the Stanford Linear Accelerator Center (SLAC) in the USA, before joining the Data Department at NERSC, where she focuses on data-intensive computing and research, including supercomputing for experimental science and machine learning at scale."
Watch the video: https://wp.me/p3RLHQ-kLV
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...Rafael Ferreira da Silva
Presentation held at ICCS 2015 Conference - Reykjavik, Iceland
High throughput computing (HTC) has aided the scientific community in the analysis of vast amounts of data and computational jobs in distributed environments. To manage these large workloads, several systems have been developed to efficiently allocate and provide access to distributed resources. Many of these systems rely on job characteristics estimates (e.g., job runtime) to characterize the workload behavior, which in practice is hard to obtain. In this work, we perform an exploratory analysis of the CMS experiment workload using the statistical recursive partitioning method and conditional inference trees to identify patterns that characterize particular behaviors of the workload. We then propose an estimation process to predict job characteristics based on the collected data. Experimental results show that our process estimates job runtime with 75% of accuracy on average, and produces nearly optimal predictions for disk and memory consumption.
More information: www.rafaelsilva.com
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
A science-gateway for workflow executions: online and non-clairvoyant self-h...Rafael Ferreira da Silva
PhD Thesis presented on November 29th 2013 at INSA-Lyon
Abstract - Science gateways, such as the Virtual Imaging Platform (VIP), enable transparent access to distributed computing and storage resources for scientific computations. However, their large scale and the number of middleware systems involved lead to many errors and faults. In practice, science gateways are often backed by substantial support staff who monitors running experiments by performing simple yet crucial actions such as rescheduling tasks, restarting services, killing misbehaving runs or replicating data files to reliable storage facilities. Fair quality of service (QoS) can then be delivered, yet with important human intervention. Automating such operations is challenging for two reasons. First, the problem is online by nature because no reliable user activity prediction can be assumed, and new workloads may arrive at any time. Therefore, the considered metrics, decisions and actions have to remain simple and to yield results while the application is still executing. Second, it is non-clairvoyant due to the lack of information about applications and resources in production conditions. Computing resources are usually dynamically provisioned from heterogeneous clusters, clouds or desktop grids without any reliable estimate of their availability and characteristics. Models of application execution times are hardly available either, in particular on heterogeneous computing resources. In this thesis, we propose a general healing process for autonomous detection and handling of operational incidents in workflow executions. Instances are modeled as Fuzzy Finite State Machines (FuSM) where state degrees of membership are determined by an external healing process. Degrees of membership are computed from metrics assuming that incidents have outlier performance, e.g. a site or a particular invocation behaves differently than the others. Based on incident degrees, the healing process identifies incident levels using thresholds determined from the platform history. A specific set of actions is then selected from association rules among incident levels.
For more information visit http://www.rafaelsilva.com
The Pacific Research Platform (PRP) aims to achieve transparent and rapid data access among collaborating scientists at multiple institutions through an integrated implementation of data-focused networking that extends the university campus Science DMZ model to a regional, national, and, eventually, a global scale.
PRP researchers are routinely achieving high-performance end-to-end networking from their labs to their collaborators’ labs and data centers, traversing multiple, heterogeneous Science DMZs and wide-area networks connecting multiple campus gateways, enabling researchers across the partnership to transfer data over dedicated optical lightpaths at speeds from 10Gb/s to 100Gb/s.
Within this tutorial we present the results of recent research about the cloud enablement of data streaming systems. We illustrate, based on both industrial as well as academic prototypes, new emerging uses cases and research trends. Specifically, we focus on novel approaches for (1) fault tolerance and (2) scalability in large scale distributed streaming systems. In general, new fault tolerance mechanisms strive to be more robust and at the same time introduce less overhead. Novel load balancing approaches focus on elastic scaling over hundreds of instances based on the data and query workload. Finally, we present open challenges for the next generation of cloud-based data stream processing engines.
This talk will examine issues of workflow execution, in particular using the Pegasus Workflow Management System, on distributed resources and how these resources can be provisioned ahead of the workflow execution. Pegasus was designed, implemented and supported to provide abstractions that enable scientists to focus on structuring their computations without worrying about the details of the target cyberinfrastructure. To support these workflow abstractions Pegasus provides automation capabilities that seamlessly map workflows onto target resources, sparing scientists the overhead of managing the data flow, job scheduling, fault recovery and adaptation of their applications. In some cases, it is beneficial to provision the resources ahead of the workflow execution, enabling the re-use of resources across workflow tasks. The talk will examine the benefits of resource provisioning for workflow execution.
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
Dr. Frank Wuerthwein from the University of California at San Diego presentation at International Super Computing Conference on Big Data, 2013, US Until recently, the large CERN experiments, ATLAS and CMS, owned and controlled the computing infrastructure they operated on in the US, and accessed data only when it was locally available on the hardware they operated. However, Würthwein explains, with data-taking rates set to increase dramatically by the end of LS1 in 2015, the current operational model is no longer viable to satisfy peak processing needs. Instead, he argues, large-scale processing centers need to be created dynamically to cope with spikes in demand. To this end, Würthwein and colleagues carried out a successful proof-of-concept study, in which the Gordon Supercomputer at the San Diego Supercomputer Center was dynamically and seamlessly integrated into the CMS production system to process a 125-terabyte data set.
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
a seminar presented at the School of Computer Science at the University of St Andrews 18 October 2019 (see https://blogs.cs.st-andrews.ac.uk/csblog/2019/09/25/daniel-katz-parsl/)
In the era of big data, even though we have large infrastructure, storage data varies in size,
formats, variety, volume and several platforms such as hadoop, cloud since we have problem associated
with an application how to process the data which is varying in size and format. Data varying in
application and resources available during run time is called dynamic workflow. Using large
infrastructure and huge amount of resources for the analysis of data is time consuming and waste of
resources, it’s better to use scheduling algorithm to analyse the given data set, for efficient execution of
data set without time consuming and evaluate which scheduling algorithm is best and suitable for the
given data set. We evaluate with different data set understand which is the most suitable algorithm for
analysis of data being efficient execution of data set and store the data after analysis
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...Rafael Ferreira da Silva
Presentation held at ICCS 2015 Conference - Reykjavik, Iceland
High throughput computing (HTC) has aided the scientific community in the analysis of vast amounts of data and computational jobs in distributed environments. To manage these large workloads, several systems have been developed to efficiently allocate and provide access to distributed resources. Many of these systems rely on job characteristics estimates (e.g., job runtime) to characterize the workload behavior, which in practice is hard to obtain. In this work, we perform an exploratory analysis of the CMS experiment workload using the statistical recursive partitioning method and conditional inference trees to identify patterns that characterize particular behaviors of the workload. We then propose an estimation process to predict job characteristics based on the collected data. Experimental results show that our process estimates job runtime with 75% of accuracy on average, and produces nearly optimal predictions for disk and memory consumption.
More information: www.rafaelsilva.com
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
A science-gateway for workflow executions: online and non-clairvoyant self-h...Rafael Ferreira da Silva
PhD Thesis presented on November 29th 2013 at INSA-Lyon
Abstract - Science gateways, such as the Virtual Imaging Platform (VIP), enable transparent access to distributed computing and storage resources for scientific computations. However, their large scale and the number of middleware systems involved lead to many errors and faults. In practice, science gateways are often backed by substantial support staff who monitors running experiments by performing simple yet crucial actions such as rescheduling tasks, restarting services, killing misbehaving runs or replicating data files to reliable storage facilities. Fair quality of service (QoS) can then be delivered, yet with important human intervention. Automating such operations is challenging for two reasons. First, the problem is online by nature because no reliable user activity prediction can be assumed, and new workloads may arrive at any time. Therefore, the considered metrics, decisions and actions have to remain simple and to yield results while the application is still executing. Second, it is non-clairvoyant due to the lack of information about applications and resources in production conditions. Computing resources are usually dynamically provisioned from heterogeneous clusters, clouds or desktop grids without any reliable estimate of their availability and characteristics. Models of application execution times are hardly available either, in particular on heterogeneous computing resources. In this thesis, we propose a general healing process for autonomous detection and handling of operational incidents in workflow executions. Instances are modeled as Fuzzy Finite State Machines (FuSM) where state degrees of membership are determined by an external healing process. Degrees of membership are computed from metrics assuming that incidents have outlier performance, e.g. a site or a particular invocation behaves differently than the others. Based on incident degrees, the healing process identifies incident levels using thresholds determined from the platform history. A specific set of actions is then selected from association rules among incident levels.
For more information visit http://www.rafaelsilva.com
The Pacific Research Platform (PRP) aims to achieve transparent and rapid data access among collaborating scientists at multiple institutions through an integrated implementation of data-focused networking that extends the university campus Science DMZ model to a regional, national, and, eventually, a global scale.
PRP researchers are routinely achieving high-performance end-to-end networking from their labs to their collaborators’ labs and data centers, traversing multiple, heterogeneous Science DMZs and wide-area networks connecting multiple campus gateways, enabling researchers across the partnership to transfer data over dedicated optical lightpaths at speeds from 10Gb/s to 100Gb/s.
Within this tutorial we present the results of recent research about the cloud enablement of data streaming systems. We illustrate, based on both industrial as well as academic prototypes, new emerging uses cases and research trends. Specifically, we focus on novel approaches for (1) fault tolerance and (2) scalability in large scale distributed streaming systems. In general, new fault tolerance mechanisms strive to be more robust and at the same time introduce less overhead. Novel load balancing approaches focus on elastic scaling over hundreds of instances based on the data and query workload. Finally, we present open challenges for the next generation of cloud-based data stream processing engines.
This talk will examine issues of workflow execution, in particular using the Pegasus Workflow Management System, on distributed resources and how these resources can be provisioned ahead of the workflow execution. Pegasus was designed, implemented and supported to provide abstractions that enable scientists to focus on structuring their computations without worrying about the details of the target cyberinfrastructure. To support these workflow abstractions Pegasus provides automation capabilities that seamlessly map workflows onto target resources, sparing scientists the overhead of managing the data flow, job scheduling, fault recovery and adaptation of their applications. In some cases, it is beneficial to provision the resources ahead of the workflow execution, enabling the re-use of resources across workflow tasks. The talk will examine the benefits of resource provisioning for workflow execution.
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
Dr. Frank Wuerthwein from the University of California at San Diego presentation at International Super Computing Conference on Big Data, 2013, US Until recently, the large CERN experiments, ATLAS and CMS, owned and controlled the computing infrastructure they operated on in the US, and accessed data only when it was locally available on the hardware they operated. However, Würthwein explains, with data-taking rates set to increase dramatically by the end of LS1 in 2015, the current operational model is no longer viable to satisfy peak processing needs. Instead, he argues, large-scale processing centers need to be created dynamically to cope with spikes in demand. To this end, Würthwein and colleagues carried out a successful proof-of-concept study, in which the Gordon Supercomputer at the San Diego Supercomputer Center was dynamically and seamlessly integrated into the CMS production system to process a 125-terabyte data set.
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
a seminar presented at the School of Computer Science at the University of St Andrews 18 October 2019 (see https://blogs.cs.st-andrews.ac.uk/csblog/2019/09/25/daniel-katz-parsl/)
In the era of big data, even though we have large infrastructure, storage data varies in size,
formats, variety, volume and several platforms such as hadoop, cloud since we have problem associated
with an application how to process the data which is varying in size and format. Data varying in
application and resources available during run time is called dynamic workflow. Using large
infrastructure and huge amount of resources for the analysis of data is time consuming and waste of
resources, it’s better to use scheduling algorithm to analyse the given data set, for efficient execution of
data set without time consuming and evaluate which scheduling algorithm is best and suitable for the
given data set. We evaluate with different data set understand which is the most suitable algorithm for
analysis of data being efficient execution of data set and store the data after analysis
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...Absi Ahmed
Scaling map reduce applications across hybrid clouds to meet soft deadlines - By Michael Mattess, Rodrigo N. Calheiros, and Rajkumar Buyya, Proceedings of the 27th IEEE International Conference on Advanced Information Networking and Applications (AINA 2013, IEEE CS Press, USA), Barcelona, Spain, March 25-28, 2013.
A time efficient and accurate retrieval of range aggregate queries using fuzz...IJECEIAES
Massive growth in the big data makes difficult to analyse and retrieve the useful information from the set of available data’s. Existing approaches cannot guarantee an efficient retrieval of data from the database. In the existing work stratified sampling is used to partition the tables in terms of stratic variables. However k means clustering algorithm cannot guarantees an efficient retrieval where the choosing centroid in the large volume of data would be difficult. And less knowledge about the stratic variable might leads to the less efficient partitioning of tables. This problem is overcome in the proposed methodology by introducing the FCM clustering instead of k means clustering which can cluster the large volume of data which are similar in nature. Stratification problem is overcome by introducing the post stratification approach which will leads to efficient selection of stratic variable. This methodology leads to an efficient retrieval process in terms of user query within less time and more accuracy.
An enhanced adaptive scoring job scheduling algorithm with replication strate...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...graphhoc
In data-intensive applications data transfer is a primary cause of job execution delay. Data access time depends on bandwidth. The major bottleneck to supporting fast data access in Grids is the high latencies of Wide Area Networks and Internet. Effective scheduling can reduce the amount of data transferred across the internet by dispatching a job to where the needed data are present. Another solution is to use a data replication mechanism. Objective of dynamic replica strategies is reducing file access time which leads to reducing job runtime. In this paper we develop a job scheduling policy and a dynamic data replication strategy, called HRS (Hierarchical Replication Strategy), to improve the data access efficiencies. We study our approach and evaluate it through simulation. The results show that our algorithm has improved 12% over the current strategies
The task scheduling is a key process in large-scale distributed systems like cloud computing infrastructures
which can have much impressed on system performance. This problem is referred to as a NP-hard problem
because of some reasons such as heterogeneous and dynamic features and dependencies among the
requests. Here, we proposed a bi-objective method called DWSGA to obtain a proper solution for
allocating the requests on resources. The purpose of this algorithm is to earn the response quickly, with
some goal-oriented operations. At first, it makes a good initial population by a special way that uses a bidirectional
tasks prioritization. Then the algorithm moves to get the most appropriate possible solution in a
conscious manner by focus on optimizing the makespan, and considering a good distribution of workload
on resources by using efficient parameters in the mentioned systems. Here, the experiments indicate that
the DWSGA amends the results when the numbers of tasks are increased in application graph, in order to
mentioned objectives. The results are compared with other studied algorithms.
Grid computing can involve lot of computational tasks which requires trustworthy computational nodes. Load balancing in grid computing is a technique which overall optimizes the whole process of assigning computational tasks to processing nodes. Grid computing is a form of distributed computing but different from conventional distributed computing in a manner that it tends to be heterogeneous, more loosely coupled and dispersed geographically. Optimization of this process must contains the overall maximization of resources utilization with balance load on each processing unit and also by decreasing the overall time or output. Evolutionary algorithms like genetic algorithms have studied so far for the implementation of load balancing across the grid networks. But problem with these genetic algorithm is that they are quite slow in cases where large number of tasks needs to be processed. In this paper we give a novel approach of parallel genetic algorithms for enhancing the overall performance and optimization of managing the whole process of load balancing across the grid nodes.
Web Oriented FIM for large scale dataset using Hadoopdbpublications
In large scale datasets, mining frequent itemsets using existing parallel mining algorithm is to balance the load by distributing such enormous data between collections of computers. But we identify high performance issue in existing mining algorithms [1]. To handle this problem, we introduce a new approach called data partitioning using Map Reduce programming model.In our proposed system, we have introduced new technique called frequent itemset ultrametric tree rather than conservative FP-trees. An investigational outcome tells us that, eradicating redundant transaction results in improving the performance by reducing computing loads.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
(R)evolution of the computing continuum - A few challengesFrederic Desprez
Initially proposed to interconnect computers worldwide, the Internet has significantly evolved to become in two decades a key element in almost all our activities. This (r)evolution mainly relies on the progress that has been achieved in computation and communication fields and that has led to the well-known and widely spread Cloud Computing paradigm.
With the emergence of the Internet of Things (IoT), stakeholders expect a new revolution that will push, once again, the limits of the Internet, in particular by favouring the convergence between physical and virtual worlds. This convergence is about to be made possible thanks to the development of minimalist sensors as well as complex industrial physical machines that can be connected to the Internet through edge computing infrastructures.
Among the obstacles to this new generation of Internet services is the development of a convenient and powerful framework that should allow operators, and devops, to manage the life-cycle of both the digital infrastructures and the applications deployed on top of these infrastructures, throughout the cloud to IoT continuum.
In this keynote, Frédéric Desprez and his colleague Adrien Lebre presented research issues and provide preliminary answers to identify whether the challenges brought by this new paradigm is an evolution or a revolution for our community.
SILECS/SLICES - Super Infrastructure for Large-Scale Experimental Computer Sc...Frederic Desprez
The aim of the SILECS and SLICES projects is to design and build a large infrastructure for experimental research on various aspects of distributed computing, from small connected objects to the large data centres of tomorrow. This infrastructure will allow end-to-end experimentation with software and applications at all levels of the software layers, from event capture (sensors, actuators) to data processing and storage, to radio transmission management and dynamic deployment of edge computing services, enabling reproducible research on all-point programmable networks, ... SILECS is the french node of a european infrastructure called SLICES.
Super Infrastructure for Large-Scale Experimental Computer Science, (Almost) everything you wanted to know about SILECS/SLICES but didn't dare to ask. Presentation at "journées du GDR RSD", Nantes, Jan. 23, 2020/.
SILECS: Super Infrastructure for Large-scale Experimental Computer ScienceFrederic Desprez
SILECS, based on two existing infrastructure (FIT and Grid'5000), aims to provide a large robust, trustable and scalable instrument for research in
distributed computing and networks. Experiments from the Internet of Things, data centers, cloud computing, security services, and the networks
connecting them will be possible, in a reproducible way, on various hardware and software. This instrument will offer a multi-platform experimental
infrastructure (HPC, Cloud, Big Data, Software Defined Storage, IoT, wireless, Software Defined Network / Radio) capable of exploring the
infrastructures that will be deployed tomorrow and assist researchers and industrial about how to design, build and operate a multi-scale, robust and
safe computer system. Diverse digital resources (compute, storage, link, IO devices) are be assembled to support a “playground” at scale.
Challenges and Issues of Next Cloud Computing PlatformsFrederic Desprez
Cloud computing has now crossed the frontiers of research to reach industry. It is used every day , whether to exchange emails or make
reservations on web sites. However, many research works remain to be done to improve the performance and functionality of these platforms of tomorrow. In this talk, I will do an overview of some these theoretical and appliead researches done at INRIA and particularly around Clouds distribution, energy monitoring and management, massive data processing and exchange, and resource management.
34. Deployment example with Universities Sed = Server Daemon, installed on any server running Loadleveler. Note that we can define rescue SeD. MA = master agent, coordinates Jobs. We can define rescue or multiple Master Agent. WN = worker node http://www.decrypthon.fr/ ORSAY SeD LoadLeveler BORDEAUX Project Users SeD LoadLeveler SeD LoadLeveler SeD LoadLeveler Web Interface Orsay Decrypthon2 CRIHAN DB2 Orsay Decrypthon1 Master Agent DIET Décrypthon LILLE JUSSIEU BD AFM Cliniques Lyon IBM WII Data manager Interface
35.
36. Data management Credits: H. N’Guyen, O. Poch, IGBMC Décrypthon Grid - Grid Resources Dedicated to Neuromuscular Disorders, Bard, N., Bolze, R., Caron, E., Desprez, F., Heymann, M., Friedrich, A., Moulinier, L., Nguyen, N.-H., Poch, O. and Toursel, T., 8th HealthGrid conference, Paris, France, June, 2010.
En comparant deux séquences dont l’une à un r ôle connu, on espère, si elles se ressemblent, trouver le rôle de la nouvelle séquence dans une protèine ou dans un gène. Les bases de données utilisées sont simplement des fichiers plats qui contiennent des séquences accompagnées de leurs descriptions.
Si la base n’est pas « installée » dans DAGD, l’utilisateur la transmet en paramètre, sinon on se sert du système d’identifiants partagés de DAGDA (alias sur les données qui évite de se servir d’un identifiant uuid pas pratique à partager/échanger) Rem : La division du fichier d’entrée et la fusion des résultats ont un co ût négligeable en regard du coût d’exécution des BLAST.
Lorsqu’un job est envoyé sur un nœud qui n’a pas la donnée, il la télécharge, la met dans un cache sur lequel on utilise LRU pour sélectionner la donnée à effacer quand on a besoin de place. En attendant que la donnée soit effacée, elle est disponible pour les autres nœuds (c’est donc un replicat).
3 fois le meilleur résultat pour l’ordo où la donnée est présente, peu importe la manière dont elle a été repliquée (random ou least loaded)… On peut noter que lorsqu’on ne réplique pas, le meilleur temps de réponse est obtenu quand on exécute le job là où il a été soumis…
5 bases de tailles de 150 Mo à 5 Go. 5 algos : blastn, blastp, blastx, tblastn et tblastx (adn vs adn, protein vs protein, adn vs protein etc…) Le plus long est tblastx (environ 20 fois plus long que blastn) Le pic du début dans le graphique SRA : Les ensembles de jobs soumis sont petits, les replications finissent après la soumission du dernier job => Le temps moyen des jobs est important puisqu’ils sont tous ordonnancés sur les m êmes machines. Le temps d’attente décroit au fur et à mesure que les réplications se finissent, les derniers jobs soumis profitant pleinement des réplications. MCT envoie chaque job sur le nœud qui sera le plus rapide pour l’exécuter à l’instant de sa soumission : Il va souvent copier les bases sur le site le plus rapide, même s’il faut pour ça effacer une base très grande et donc longue à retransmettre ensuite. Ce qu’il fera quand même pour les jobs les plus longs (tblastx). Pour les jobs les plus rapides (blastn), il avantage les nœuds qui ont déjà la base même s’ils sont lents et même si ils sont très nombreux…
Optorsim
Explicite : L’utilisateur décide explicitement de répliquer les données. Implicite : Ce sont les appels aux services qui provoquent les réplications de données. Contrairement à DTM, les données sont répliquées et pas déplacées. Accès direct aux données stockées + Ajout direct d’une donnée dans DIET. Automatic data management : Quand on souhaite installer une donnée sur un nœud qui ne dispose plus assez d’espace, on efface une donnée en utilisant un algo choisi dans la configuration du nœud. Transfer optimization : On choisit la « meilleure » source pour une donnée en fonction de stats réalisées pendant les transferts précédents. Storage usage management : On peut choisir quelle quantité de mémoire et quel espace disque sont réservés aux données gérées par DAGDA. Data backup/restoration : On peut enregistrer l’état actuel des distributions de données et rétablir la situation au redémarrage de DIET. (Par exemple, on arrive à la fin d’une réservation, et on veut continuer une expèrience plus tard. Au redémarrage, les données sont remises comme elles étaient avant la coupure.)
Contrairement à DTM, c’est le SeD qui télécharge les données et pas le client qui les envoies « d’autorité ». Seules les descriptions des données (type, taille etc.) sont envoyées pour les requ êtes. Si on a configuré la taille maximum des messages envoyés par DAGDA, les données trop grandes sont envoyées en plusieurs fois. Ca permet également de limiter la quantité de mémoire nécessaire pour les transferts. DTM charge tout en mémoire avant d’envoyer les données.
Le « cœur » de DAGDA gère l’identification et la recherche des données ainsi que le choix des sources/destinations pour les transferts. Les éléments étendus de DAGDA gèrent les limitations de ressources fixées par les utilisateurs et la sauvegarde/restauration des données. L’API permet d’accéder/ajouter directement des données dans la plateforme ainsi que de lancer des réplications.
Une requ ête est un ensemble de séquences à « BLASTER » sur une base donnée. Une sous-requête est un sous-ensemble de ces séquences à BLASTER sur la même base.
Use of plugin schedulers
Division maximum : Si le fichier requ ête de départ contient n séquences, on crée n fichiers de requête chacun d’entre eux ne contenant qu’une séquence. Division en n sous-requêtes : On a n nœuds dispos, on crée n sous-requêtes de taille identique. Chaque nœud n’a à traiter qu’une seule requête. Avec Random, MCT & Round-Robin, la multiplication des requêtes provoque de l’overhead qui n’est pas compensé par l’ordonnancement. Le mieux reste de découper les requêtes en le nombre de nœuds dispos. Avec SRA, plus on a de requêtes, plus les fréquences sont fiables, et donc, l’algo est plus efficace. On compense l’overhead par l’ordonnancement. Globalement Dynamic-SRA est meilleur, m ême en découpant la requêtes en n parties si on a suffisamment de nœuds (ici 300 SeDs) : Sur 300 fichiers, on arrive à avoir des fréquences à peu près convenables. Avec moins de nœuds, donc moins de requ êtes, les fréquences sont de plus en plus approximatives et comme on optimise le débit de la plateforme, SRA-dynamique devient de moins en moins bon.
Les algos ont des complexité différentes : BLASTN est le plus rapide à faire (ADN => alphabet de 4 lettres Vs ADN). BLASTP : Protéine => alphabet de 20 lettres Vs Protéines. BLASTX : ADN traduit en protéine Vs Protéines (traduction ADN + BLASTP) TBLASTX : Le plus long => ADN traduit en Protéine Vs Une base ADN traduite en Protéines. (Traduction de toutes les séquences et BLASTP) Globalement, le changement des fréquences n’a pas beaucoup d’influence sur MCT.