The document discusses wait-free data structures and task scheduling algorithms for embedded multi-core systems. It presents the objectives of evaluating existing approaches, implementing real-time compliant wait-free data structures, and defining benchmark scenarios. Specific data structures covered include pools, queues, and stacks. Pools are adapted using a compartment approach. The Kogan-Petrank queue is modified to remove its phase counter. Evaluation focuses on latency and buffering scenarios.
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems iosrjce
Imprecise computation model is used in dynamic scheduling algorithm having heuristic function to
schedule task sets. A task is characterized by ready time, worst case computation time, deadline and resource
requirements. A task failing to meet its deadline and resource requirements on time is split into mandatory part
and optional part. These sub-tasks of a task can execute concurrently on multiple cores, thus achieving
parallelization provided by the multi-core system. Mandatory part produces acceptable results while optional
part refines the result further. To study the effectiveness of proposed scheduling algorithm, extensive simulation
studies have been carried out. Performance of proposed scheduling algorithm is compared with myopic and
improved myopic scheduling algorithm. The simulation studies shows that schedulability of task split myopic
algorithm is always higher than myopic and improved myopic algorithm.
(Paper) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
Shohei Gotoda, Naoki Shibata and Minoru Ito : "Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault," Proceedings of IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2012), pp.260-267, DOI:10.1109/CCGrid.2012.23, May 15, 2012.
In this paper, we propose a task scheduling al-gorithm for a multicore processor system which reduces the
recovery time in case of a single fail-stop failure of a multicore
processor. Many of the recently developed processors have
multiple cores on a single die, so that one failure of a computing
node results in failure of many processors. In the case of a failure
of a multicore processor, all tasks which have been executed
on the failed multicore processor have to be recovered at once.
The proposed algorithm is based on an existing checkpointing
technique, and we assume that the state is saved when nodes
send results to the next node. If a series of computations that
depends on former results is executed on a single die, we need
to execute all parts of the series of computations again in
the case of failure of the processor. The proposed scheduling
algorithm tries not to concentrate tasks to processors on a die.
We designed our algorithm as a parallel algorithm that achieves
O(n) speedup where n is the number of processors. We evaluated
our method using simulations and experiments with four PCs.
We compared our method with existing scheduling method, and
in the simulation, the execution time including recovery time in
the case of a node failure is reduced by up to 50% while the
overhead in the case of no failure was a few percent in typical
scenarios.
(Slides) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
Shohei Gotoda, Naoki Shibata and Minoru Ito : "Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault," Proceedings of IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2012), pp.260-267, DOI:10.1109/CCGrid.2012.23, May 15, 2012.
In this paper, we propose a task scheduling al-gorithm for a multicore processor system which reduces the
recovery time in case of a single fail-stop failure of a multicore
processor. Many of the recently developed processors have
multiple cores on a single die, so that one failure of a computing
node results in failure of many processors. In the case of a failure
of a multicore processor, all tasks which have been executed
on the failed multicore processor have to be recovered at once.
The proposed algorithm is based on an existing checkpointing
technique, and we assume that the state is saved when nodes
send results to the next node. If a series of computations that
depends on former results is executed on a single die, we need
to execute all parts of the series of computations again in
the case of failure of the processor. The proposed scheduling
algorithm tries not to concentrate tasks to processors on a die.
We designed our algorithm as a parallel algorithm that achieves
O(n) speedup where n is the number of processors. We evaluated
our method using simulations and experiments with four PCs.
We compared our method with existing scheduling method, and
in the simulation, the execution time including recovery time in
the case of a node failure is reduced by up to 50% while the
overhead in the case of no failure was a few percent in typical
scenarios.
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...Naoki Shibata
Yosuke Wakisaka, Naoki Shibata, Keiichi Yasumoto, Minoru Ito, and Junji Kitamichi : Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost and Hyper-Threading, In Proc. of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA'14), pp. 229-235
In this paper, we propose a task scheduling algorithm for multiprocessor systems with Turbo Boost and Hyper-Threading technologies. The proposed algorithm minimizes the total computation time taking account of dynamic changes of the processing speed by the two technologies, in addition to the network contention among the processors. We constructed a clock speed model with which the changes of processing speed with Turbo Boost and Hyper-threading can be estimated for various processor usage patterns. We then constructed a new scheduling algorithm that minimizes the total execution time of a task graph considering network contention and the two technologies. We evaluated the proposed algorithm by simulations and experiments with a multiprocessor system consisting of 4 PCs. In the experiment, the proposed algorithm produced a schedule that reduces the total execution time by 36% compared to conventional methods which are straightforward extensions of an existing method.
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems iosrjce
Imprecise computation model is used in dynamic scheduling algorithm having heuristic function to
schedule task sets. A task is characterized by ready time, worst case computation time, deadline and resource
requirements. A task failing to meet its deadline and resource requirements on time is split into mandatory part
and optional part. These sub-tasks of a task can execute concurrently on multiple cores, thus achieving
parallelization provided by the multi-core system. Mandatory part produces acceptable results while optional
part refines the result further. To study the effectiveness of proposed scheduling algorithm, extensive simulation
studies have been carried out. Performance of proposed scheduling algorithm is compared with myopic and
improved myopic scheduling algorithm. The simulation studies shows that schedulability of task split myopic
algorithm is always higher than myopic and improved myopic algorithm.
(Paper) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
Shohei Gotoda, Naoki Shibata and Minoru Ito : "Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault," Proceedings of IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2012), pp.260-267, DOI:10.1109/CCGrid.2012.23, May 15, 2012.
In this paper, we propose a task scheduling al-gorithm for a multicore processor system which reduces the
recovery time in case of a single fail-stop failure of a multicore
processor. Many of the recently developed processors have
multiple cores on a single die, so that one failure of a computing
node results in failure of many processors. In the case of a failure
of a multicore processor, all tasks which have been executed
on the failed multicore processor have to be recovered at once.
The proposed algorithm is based on an existing checkpointing
technique, and we assume that the state is saved when nodes
send results to the next node. If a series of computations that
depends on former results is executed on a single die, we need
to execute all parts of the series of computations again in
the case of failure of the processor. The proposed scheduling
algorithm tries not to concentrate tasks to processors on a die.
We designed our algorithm as a parallel algorithm that achieves
O(n) speedup where n is the number of processors. We evaluated
our method using simulations and experiments with four PCs.
We compared our method with existing scheduling method, and
in the simulation, the execution time including recovery time in
the case of a node failure is reduced by up to 50% while the
overhead in the case of no failure was a few percent in typical
scenarios.
(Slides) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
Shohei Gotoda, Naoki Shibata and Minoru Ito : "Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault," Proceedings of IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2012), pp.260-267, DOI:10.1109/CCGrid.2012.23, May 15, 2012.
In this paper, we propose a task scheduling al-gorithm for a multicore processor system which reduces the
recovery time in case of a single fail-stop failure of a multicore
processor. Many of the recently developed processors have
multiple cores on a single die, so that one failure of a computing
node results in failure of many processors. In the case of a failure
of a multicore processor, all tasks which have been executed
on the failed multicore processor have to be recovered at once.
The proposed algorithm is based on an existing checkpointing
technique, and we assume that the state is saved when nodes
send results to the next node. If a series of computations that
depends on former results is executed on a single die, we need
to execute all parts of the series of computations again in
the case of failure of the processor. The proposed scheduling
algorithm tries not to concentrate tasks to processors on a die.
We designed our algorithm as a parallel algorithm that achieves
O(n) speedup where n is the number of processors. We evaluated
our method using simulations and experiments with four PCs.
We compared our method with existing scheduling method, and
in the simulation, the execution time including recovery time in
the case of a node failure is reduced by up to 50% while the
overhead in the case of no failure was a few percent in typical
scenarios.
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...Naoki Shibata
Yosuke Wakisaka, Naoki Shibata, Keiichi Yasumoto, Minoru Ito, and Junji Kitamichi : Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost and Hyper-Threading, In Proc. of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA'14), pp. 229-235
In this paper, we propose a task scheduling algorithm for multiprocessor systems with Turbo Boost and Hyper-Threading technologies. The proposed algorithm minimizes the total computation time taking account of dynamic changes of the processing speed by the two technologies, in addition to the network contention among the processors. We constructed a clock speed model with which the changes of processing speed with Turbo Boost and Hyper-threading can be estimated for various processor usage patterns. We then constructed a new scheduling algorithm that minimizes the total execution time of a task graph considering network contention and the two technologies. We evaluated the proposed algorithm by simulations and experiments with a multiprocessor system consisting of 4 PCs. In the experiment, the proposed algorithm produced a schedule that reduces the total execution time by 36% compared to conventional methods which are straightforward extensions of an existing method.
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...ITIIIndustries
This paper proposes the algorithms for optimization of Remote Core Locking (RCL) synchronization method in multithreaded programs. The algorithm of initialization of RCL-locks and the algorithms for threads affinity optimization are developed. The algorithms consider the structures of hierarchical computer systems and non-uniform memory access (NUMA) to minimize execution time of RCLprograms. The experimental results on multi-core computer systems represented in the paper shows the reduction of RCLprograms execution time.
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...IJCNCJournal
Rapid development of diverse computer architectures and hardware accelerators caused that designing parallel systems faces new problems resulting from their heterogeneity. Our implementation of a parallel
system called KernelHive allows to efficiently run applications in a heterogeneous environment consisting
of multiple collections of nodes with different types of computing devices. The execution engine of the
system is open for optimizer implementations, focusing on various criteria. In this paper, we propose a new
optimizer for KernelHive, that utilizes distributed databases and performs data prefetching to optimize the
execution time of applications, which process large input data. Employing a versatile data management
scheme, which allows combining various distributed data providers, we propose using NoSQL databases
for our purposes. We support our solution with results of experiments with real executions of our OpenCL
implementation of a regular expression matching application in various hardware configurations.
Additionally, we propose a network-aware scheduling scheme for selecting hardware for the proposed
optimizer and present simulations that demonstrate its advantages.
This slide will help to understand how to use WEKA tool for association rule mining. It has a brief overview of how to prepare dataset for using it in WEKA and how to visualize it.
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...Editor IJCATR
Grid computing provides a framework and deployment environment that enables resource
sharing, accessing, aggregation and management. It allows resource and coordinated use of various
resources in dynamic, distributed virtual organization. The grid scheduling is responsible for resource
discovery, resource selection and job assignment over a decentralized heterogeneous system. In the
existing system, primary-backup approach is used for fault tolerance in a single environment. In this
approach, each task has a primary copy and backup copy on two different processors. For dependent
tasks, precedence constraint among tasks must be considered when scheduling backup copies and
overloading backups. Then, two algorithms have been developed to schedule backups of dependent and
independent tasks. The proposed work is to manage the resource failures in grid job scheduling. In this
method, data source and resource are integrated from different geographical environment. Faulttolerant
scheduling with primary backup approach is used to handle job failures in grid environment.
Impact of communication protocols is considered. Communication protocols such as Transmission
Control Protocol (TCP), User Datagram Protocol (UDP) which are used to distribute the message of
each task to grid resources.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
The performance of analytical query processing in data management systems depends primarily on the capabilities of the system's query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer.
In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture.
In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.
Implementation of linear regression and logistic regression on SparkDalei Li
This presentation was developed for a course project at Technical University of Madrid. The course is massively parallel machine learning supervised by Alberto Mozo and Bruno Ordozgoiti.
Multi-cores have become ubiquitous both in the general-purpose computing and the
embedded domain. The current technology trends show that the number of on-chip cores is
rapidly increasing, while their complexity is decreasing due to power and thermal constraints.
Increasing number of simple cores enable parallel applications benefit from abundant thread-level parallelism (TLP), while sequential fragments suffer from poor exploitation of instruction-level parallelism (ILP). Recent research has proposed adaptive multi-core architectures that are
capable of coalescing simple physical cores to create more complex virtual cores so as to
accelerate sequential code. Such adaptive architectures can seamlessly exploit both ILP and TLP.
The goal of this paper is to quantitatively characterize the performance potential of adaptive
multi-core architectures. Previous research have primarily focused on only sequential
Workload on adaptive multi-cores. We address a more realistic scenario where parallel and
sequential applications co-exist on an adaptive multi-core platform. Scheduling tasks on adaptive
architectures reveal challenging resource allocation problems for the existing schedulers. We
construct offline and online schedulers that intelligently reconfigure and allocate the cores to the
applications so as to minimize the overall makespan under the constraints of a realistic adaptive
multi-core architecture. Experimental results reveal that adaptive multi-core architectures can
substantially decrease the makespan compared to both static symmetric and asymmetric multi-core architectures.
As the demand for computing power is quickly
increasing in the automotive domain, car manufactur-ers and tier-one suppliers are gradually introducing mul-ticore ECUs in their electronic architectures. Additionally, these multicore ECUs offer new features such as higher levels of parallelism which eases the respect of
the safety requirements introduced by the ISO 26262 and can be taken advantage of in various other automotive use-cases. These new features involve also more complexity in the design, development and verification of the software applications. Hence, OEMs and suppliers will require new tools and methodologies for deployment and
validation. In this paper, we present the main use cases
for multicore ECUs and then focus on one of them. Pre-
cisely, we address the problem of scheduling numerous
elementary software components (called runnables) on
a limited set of identical cores. In the context of an au-
tomotive design, we assume the use of the static task
partitioning scheme which provides simplicity and bet-
ter predictability for the ECU designers by comparison
with a global scheduling approach. We show how the
global scheduling problem can be addressed as two sub-
problems: partitioning the set of runnables and building
the schedule on each core. At that point, we prove that
each of the sub-problems cannot be solved optimally due
to their algorithmic complexity. We then present low com-
plexity heuristics to partition and build a schedule of the
runnable set on each core before discussing schedula-
bility verification methods. Finally, we assess the perfor-
mance of our approach on realistic case-studies.
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systemsknowdiff
PhD Candidate,
Department of Computer science
Mälardalen University
Time: Tuesday, Dec. 30, 2014, 11:30 a.m.
Location: Computer Engineering Department, Urmia University
Abstract:
The processor is the brain of a computer system. Usually, one or more programs run on a processor where each program is typically responsible for performing a particular task or function of the system. The performance of all the tasks together results in the system functionality. In many computer systems, it is not only enough that all tasks deliver correct output, but it is also crucial that these activities are delivered in a proper time. This type of systems that have timing requirements are known as real-time systems. A scheduler is responsible for scheduling all tasks on the processor, i.e., it dictates which task to run and when to run to ensure that all tasks are carried out on time. Typically, such tasks/programs need to use the computer system’s hardware and software resources to perform their calculation. Examples of such type of resources that are shared among programs are I/O devices, buffers and memories. Technology that is used for the management of shared resources is known as resource sharing synchronization protocol.
In recent years, a shift from single-processor platforms to multiprocessor platforms has become inevitable due to availability of processor chips and requirements on increased performance. Scheduling and resource sharing protocols have been well studied for uniprocessor systems. However, in the context of multiprocessors, still such techniques are not fully mature. The shift towards multi-core technology has revealed the demand for real-time scheduling algorithms along with synchronization protocols to support real-time applications on multiprocessors, both with and without dependencies.
In this talk, we first have an introduction to real-time embedded systems. Next, we look at scheduling and resource sharing policies in uniprocessor platforms. Further, we discuss the extension of scheduling and resource sharing policies for multiprocessor platforms and present the recent challenges arisen in this context.
Biography:
Sara Afshar is a PhD student at Mälardalen University. She has received her B.Sc. degree in Electrical Engineering from Tabriz University, Iran in 2002. She worked at different engineering companies until 2009. In the year 2010 she started her M.Sc. in Embedded Systems at Mälardalen University. She obtained her Master degree in 2012 and at the same year she started her PhD studies in Mälardalen University. Currently she is working on the topic of resource sharing in multiprocessor systems. She is part of the Complex Real-Time Embedded Systems group at Mälardalen University.
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...ITIIIndustries
This paper proposes the algorithms for optimization of Remote Core Locking (RCL) synchronization method in multithreaded programs. The algorithm of initialization of RCL-locks and the algorithms for threads affinity optimization are developed. The algorithms consider the structures of hierarchical computer systems and non-uniform memory access (NUMA) to minimize execution time of RCLprograms. The experimental results on multi-core computer systems represented in the paper shows the reduction of RCLprograms execution time.
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...IJCNCJournal
Rapid development of diverse computer architectures and hardware accelerators caused that designing parallel systems faces new problems resulting from their heterogeneity. Our implementation of a parallel
system called KernelHive allows to efficiently run applications in a heterogeneous environment consisting
of multiple collections of nodes with different types of computing devices. The execution engine of the
system is open for optimizer implementations, focusing on various criteria. In this paper, we propose a new
optimizer for KernelHive, that utilizes distributed databases and performs data prefetching to optimize the
execution time of applications, which process large input data. Employing a versatile data management
scheme, which allows combining various distributed data providers, we propose using NoSQL databases
for our purposes. We support our solution with results of experiments with real executions of our OpenCL
implementation of a regular expression matching application in various hardware configurations.
Additionally, we propose a network-aware scheduling scheme for selecting hardware for the proposed
optimizer and present simulations that demonstrate its advantages.
This slide will help to understand how to use WEKA tool for association rule mining. It has a brief overview of how to prepare dataset for using it in WEKA and how to visualize it.
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...Editor IJCATR
Grid computing provides a framework and deployment environment that enables resource
sharing, accessing, aggregation and management. It allows resource and coordinated use of various
resources in dynamic, distributed virtual organization. The grid scheduling is responsible for resource
discovery, resource selection and job assignment over a decentralized heterogeneous system. In the
existing system, primary-backup approach is used for fault tolerance in a single environment. In this
approach, each task has a primary copy and backup copy on two different processors. For dependent
tasks, precedence constraint among tasks must be considered when scheduling backup copies and
overloading backups. Then, two algorithms have been developed to schedule backups of dependent and
independent tasks. The proposed work is to manage the resource failures in grid job scheduling. In this
method, data source and resource are integrated from different geographical environment. Faulttolerant
scheduling with primary backup approach is used to handle job failures in grid environment.
Impact of communication protocols is considered. Communication protocols such as Transmission
Control Protocol (TCP), User Datagram Protocol (UDP) which are used to distribute the message of
each task to grid resources.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
The performance of analytical query processing in data management systems depends primarily on the capabilities of the system's query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer.
In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture.
In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.
Implementation of linear regression and logistic regression on SparkDalei Li
This presentation was developed for a course project at Technical University of Madrid. The course is massively parallel machine learning supervised by Alberto Mozo and Bruno Ordozgoiti.
Multi-cores have become ubiquitous both in the general-purpose computing and the
embedded domain. The current technology trends show that the number of on-chip cores is
rapidly increasing, while their complexity is decreasing due to power and thermal constraints.
Increasing number of simple cores enable parallel applications benefit from abundant thread-level parallelism (TLP), while sequential fragments suffer from poor exploitation of instruction-level parallelism (ILP). Recent research has proposed adaptive multi-core architectures that are
capable of coalescing simple physical cores to create more complex virtual cores so as to
accelerate sequential code. Such adaptive architectures can seamlessly exploit both ILP and TLP.
The goal of this paper is to quantitatively characterize the performance potential of adaptive
multi-core architectures. Previous research have primarily focused on only sequential
Workload on adaptive multi-cores. We address a more realistic scenario where parallel and
sequential applications co-exist on an adaptive multi-core platform. Scheduling tasks on adaptive
architectures reveal challenging resource allocation problems for the existing schedulers. We
construct offline and online schedulers that intelligently reconfigure and allocate the cores to the
applications so as to minimize the overall makespan under the constraints of a realistic adaptive
multi-core architecture. Experimental results reveal that adaptive multi-core architectures can
substantially decrease the makespan compared to both static symmetric and asymmetric multi-core architectures.
As the demand for computing power is quickly
increasing in the automotive domain, car manufactur-ers and tier-one suppliers are gradually introducing mul-ticore ECUs in their electronic architectures. Additionally, these multicore ECUs offer new features such as higher levels of parallelism which eases the respect of
the safety requirements introduced by the ISO 26262 and can be taken advantage of in various other automotive use-cases. These new features involve also more complexity in the design, development and verification of the software applications. Hence, OEMs and suppliers will require new tools and methodologies for deployment and
validation. In this paper, we present the main use cases
for multicore ECUs and then focus on one of them. Pre-
cisely, we address the problem of scheduling numerous
elementary software components (called runnables) on
a limited set of identical cores. In the context of an au-
tomotive design, we assume the use of the static task
partitioning scheme which provides simplicity and bet-
ter predictability for the ECU designers by comparison
with a global scheduling approach. We show how the
global scheduling problem can be addressed as two sub-
problems: partitioning the set of runnables and building
the schedule on each core. At that point, we prove that
each of the sub-problems cannot be solved optimally due
to their algorithmic complexity. We then present low com-
plexity heuristics to partition and build a schedule of the
runnable set on each core before discussing schedula-
bility verification methods. Finally, we assess the perfor-
mance of our approach on realistic case-studies.
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systemsknowdiff
PhD Candidate,
Department of Computer science
Mälardalen University
Time: Tuesday, Dec. 30, 2014, 11:30 a.m.
Location: Computer Engineering Department, Urmia University
Abstract:
The processor is the brain of a computer system. Usually, one or more programs run on a processor where each program is typically responsible for performing a particular task or function of the system. The performance of all the tasks together results in the system functionality. In many computer systems, it is not only enough that all tasks deliver correct output, but it is also crucial that these activities are delivered in a proper time. This type of systems that have timing requirements are known as real-time systems. A scheduler is responsible for scheduling all tasks on the processor, i.e., it dictates which task to run and when to run to ensure that all tasks are carried out on time. Typically, such tasks/programs need to use the computer system’s hardware and software resources to perform their calculation. Examples of such type of resources that are shared among programs are I/O devices, buffers and memories. Technology that is used for the management of shared resources is known as resource sharing synchronization protocol.
In recent years, a shift from single-processor platforms to multiprocessor platforms has become inevitable due to availability of processor chips and requirements on increased performance. Scheduling and resource sharing protocols have been well studied for uniprocessor systems. However, in the context of multiprocessors, still such techniques are not fully mature. The shift towards multi-core technology has revealed the demand for real-time scheduling algorithms along with synchronization protocols to support real-time applications on multiprocessors, both with and without dependencies.
In this talk, we first have an introduction to real-time embedded systems. Next, we look at scheduling and resource sharing policies in uniprocessor platforms. Further, we discuss the extension of scheduling and resource sharing policies for multiprocessor platforms and present the recent challenges arisen in this context.
Biography:
Sara Afshar is a PhD student at Mälardalen University. She has received her B.Sc. degree in Electrical Engineering from Tabriz University, Iran in 2002. She worked at different engineering companies until 2009. In the year 2010 she started her M.Sc. in Embedded Systems at Mälardalen University. She obtained her Master degree in 2012 and at the same year she started her PhD studies in Mälardalen University. Currently she is working on the topic of resource sharing in multiprocessor systems. She is part of the Complex Real-Time Embedded Systems group at Mälardalen University.
In this presentation, Dr. Cliff Click describes a totally lock-free hashtable with extremely low-cost and near perfect scaling. Readers pay no more than HashMap readers: just the cost of computing the hash, loading and comparing the key, and returning the value. Writers must use AtomicUpdate instead of a simple assignment but otherwise pay the same as readers. In particular, there is no required order between loads and stores; correctness is assured, no matter how the hardware orders memory operations. A state-based technique demonstrates the correctness of the algorithm. This novel approach is very straightforward and much easier to understand than the usual "happens-before" memory-order-based reasoning.
Real-Time Inverted Search NYC ASLUG Oct 2014Bryan Bende
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr’s full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012TEST Huddle
EuroSTAR Software Testing Conference 2012 presentation on Innovations for Testing Parallel Software by Mike Bartley.
See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/
Public vs. Private Cloud Performance by FlexStackIQ
This is a presentation given by Hugh Ma and Michael O'Rourke from Flex at the Stacki San Jose Meetup on September 15, 2016. Learn about the differences between public and private cloud performance, their OpenStack-Ansible & FlexBench environment, and how they use Stacki.
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxfaithxdunce63732
CS 301 Computer Architecture
Student # 1
E
ID: 09
Kingdom of Saudi Arabia Royal Commission at Yanbu Yanbu University College Yanbu Al-Sinaiyah
Student # 2
H
ID: 09
Kingdom of Saudi Arabia Royal Commission at Yanbu Yanbu University College Yanbu Al-Sinaiyah
1
1. Introduction
High-performance processor design has recently taken two distinct approaches. One approach is to increase the execution rate by increasing the clock frequency of the processor or by reducing the execution latency of the operations. While this approach is important, much of its performance gain comes as a consequence of circuit and layout improvements and is beyond the scope of this research. The other approach is to directly exploit the instruction-level parallelism (ILP) in the program and to issue and execute multiple operations concurrently. This approach requires both compiler and microarchitecture support.
Traditional processor designs that issue and execute at most one operation per cycle are often called scalar designs. Static and dynamic scheduling techniques have been used to achieve better-than scalar performance by issuing and executing more than one operation per cycle. While Johnson[7] defines a superscalar processor as a design that achieves better-than scalar performance, popular usage of this term refers exclusively to those processors that use dynamic scheduling techniques. For clarity, we use instruction-level parallel processors to refer to the general class of processors that execute more than one operation per cycle of the computer both at the personal level, or the level of a small network of computers to do not require more of these types.
The primary static scheduling technique uses the compiler to determine sets of operations that have their source operands ready and have no dependencies within the set. These operations can then be scheduled within the same instruction subject only to hardware resource limits. Since each of the operations in an instruction is guaranteed by the compiler to be independent, the hardware is able to is- sue and execute these operations directly with no dynamic analysis. These multi-operation instructions are very long in comparison with traditional single-operation instructions and processors using .
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
During the last years, except from the traditional CPU based hardware servers, hardware accelerators are widely used in various HPC application areas. More specifically, Graphics Processing Units (GPUs), Many Integrated Cores (MICs) and Field-Programmable Gate Arrays (FPGAs) have shown a great potential in HPC and have been widely mobilised in supercomputing and in HPC-Clouds. This presentation focuses on the development of a cloud simulation framework that supports hardware accelerators. The design and implementation of the framework are also discussed.
This presentation was given by Dr. Konstantinos Giannoutakis (CERTH) at the CloudLightning Conference on 11th April 2017.
Scott Callaghan from the Southern California Earthquake Center presented this deck in a recent Blue Waters Webinar.
"I will present an overview of scientific workflows. I'll discuss what the community means by "workflows" and what elements make up a workflow. We'll talk about common problems that users might be facing, such as automation, job management, data staging, resource provisioning, and provenance tracking, and explain how workflow tools can help address these challenges. I'll present a brief example from my own work with a series of seismic codes showing how using workflow tools can improve scientific applications. I'll finish with an overview of high-level workflow concepts, with an aim to preparing users to get the most out of discussions of specific workflow tools and identify which tools would be best for them."
Watch the video: http://wp.me/p3RLHQ-gtH
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This chapter discusses various classification attributed to parallel architectures. It also introduces related parallel programming models and presents the actions of these models on parallel architectures. Notions such as Data parallelism Task parallelism, Tighty and Coupled system, UMA/NUMA, Multicore computing, Symmetric multiprocessing, Distributed Computing, Cluster computing, Shared memory without thread/Thread, etc..
Similar to Wait-free data structures on embedded multi-core systems (20)
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
20240520 Planning a Circuit Simulator in JavaScript.pptx
Wait-free data structures on embedded multi-core systems
1. Tobias Fuchs
Evaluation of Task Scheduling
Algorithms and Wait-Free Data
Structures for Embedded Multi-Core
Systems
• Vortrag zur Masterarbeit
• Aufgabensteller: Prof. Dr. Dieter Kranzlmüller
• Betreuer: Dr. Karl Fürlinger (LMU)
Dr. Tobias Schüle (Siemens CT)
• Datum des Vortrags: 05.11.2014
2. Structure of this talk
1. Introduction
1. Motivation
2. Problem Statement and Objectives
2. Wait-free data structures
1. Foundations
2. Pools
3. Queues
4. Stacks
3. Task Scheduling
1. Work stealing
2. Prioritized work stealing in EMBB
4. Conclusion
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 2
4. Motivation
Wait-free algorithms
• Strongest possible fault tolerance
• Guarantee progress and upper bound for execution time
Gains:
+ Progress is potentially a formal constraint in real-time
computing
+ Wait-freedom eliminates the classic concurrency problems:
Deadlocks, Priority Inversion, Convoying, Kill-Intolerance
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 4
5. Problem statement
State of the art
No suitable wait-free data structures for embedded systems:
• Employing mechanisms such as garbage collection
• Not designed for restricted resources
• No evaluation for latency
Challenges:
- Transforming data structures to wait-free equivalents is
non-trivial, usually from-scratch redesign
- Implementations depend on platform architecture
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 5
6. Objectives
1. Review and evaluation of state of the art approaches for
suitability on embedded systems
2. Real-time compliant implementations of wait-free data
structures
3. Definition, implementation and evaluation of suitable
benchmark scenarios for wait-free data structures and
task scheduling algorithms
+ Automated verification derived from semantic definition
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 6
8. Progress conditions
Classification of progress
On the Nature of Progress (Herlihy, Shavit 2011)
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 8
9. Real-time requirements
Performance priorities on real-time systems
Guarantees on worst-case runtime behavior
Aim for latency / jitter-reduction, neglecting throughput
Avoid non-determinism, as in malloc / new (see: MISRA)
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 9
10. Evaluation methodology
Real-time applications are designed to optimize latency
Related work does not evaluate latency, but only mean or
median throughput
Evaluation of worst-case latency is tough:
• In related work, measurements outside of 97.5% confidence
interval are considered outliers and ignored
• These outliers are our data
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 10
11. Pools
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 11
12. Wait-free data structures:
Pools
Pools
… realize dynamic memory allocation
… while eliminating heap fragmentation
• Fundamental data structure of any concurrent container
• Fixed number of objects in static or automatic memory
• Pools manage concurrent removal and reclamation of
objects
RemoveAny(pool, er) Remove and return element er
Add(pool, e) Add element e back to the pool
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 12
13. Pools:
Related work
Related work
Close to none:
• Several lock-free pools, e.g. tree-based
• Wait-free pools: array-based, simple yet inefficient
Why are wait-free pools hard to design?
Common wait-free paradigms require dynamic memory
allocation …
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 13
14. Array-based pools
Array-based wait-free pools
• Consists of array holding atomic reservation flags
• Threads traverse reservation array from the beginning
and try to reserve a flag atomically (CAS)
• Index of successfully toggled flag is acquired element index
• Worst-case complexity: O(n)
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 14
15. Compartment pool
Wait-free pool with thread-specific compartments
• Array-based pool with additional range of elements that
can only be acquired by a specific thread
• Threads acquire elements from their private compartment
first
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 15
16. Wait-free data structures:
Pools - Evaluation
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 16
17. Wait-free data structures:
Pools - Evaluation
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 17
18. Wait-free data structures:
Pools - Evaluation
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 18
19. Queues
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 19
20. Queues:
Related work
Related work
Kogan and Petrank presented the first wait-free queue for
multiple enqueuers and dequeuers
Wait-Free Queues With Multiple Enqueuers and Dequeuers (Kogan, Petrank, 2011)
- Implemented in Java
- Relying on garbage collection
- Requires monotonic counter (phase)
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 20
21. Kogan-Petrank queue
Adapting the Kogan-Petrank wait-free queue
Redesign helping scheme to remove phase counter
• In original publication, new phase value is greater than all
phases of any announced operation (including non-pending)
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 21
22. Kogan-Petrank queue
Adapting the Kogan-Petrank wait-free queue
Redesign helping scheme to remove phase counter
• Modification: Help all other non-pending operations first
• Possibly helping operations that are newer than the thread‘s
own operation
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 22
23. Kogan-Petrank queue
Adapting the Kogan-Petrank wait-free queue
Redesign helping scheme to remove phase counter
• Fairness is maintained: all other threads are guaranteed
to help this thread’s operation before engaging in their own
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 23
24. Kogan-Petrank queue
Adapting the Kogan-Petrank wait-free queue
Memory reclamation
Hazard pointers scheme typically presented as a solution
Hazard pointers: Safe memory reclamation for lock-free objects (Michael, 2004)
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 24
25. Kogan-Petrank queue
Adapting the Kogan-Petrank wait-free queue
Introduce hazard pointers
Step 1: Find upper memory bound for hazard pointers
Step 2: Guard queue nodes using hazard pointers
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 25
26. Kogan-Petrank queue
Adapting the Kogan-Petrank wait-free queue
Introduce hazard pointers
Step 2: Guard queue nodes using hazard pointers
Culprit: Guarding is not wait-free
pointer p = node.Next;
// -- possible change of node.Next –
while(hp.GuardPointer(p) && p != node.Next) {
// Release and retry, unbounded number of retries
hp.ReleaseGuard(p);
}
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 26
27. Kogan-Petrank queue
Adapting the Kogan-Petrank wait-free queue
Introduce hazard pointers
Step 2: Guard queue nodes using hazard pointers
Culprit: Guarding is not wait-free
Fortunately, retry loops can be avoided in the Kogan-
Petrank queue, but the implementation is not trivial
see implementation at
https://github.com/fuchsto/embb/tree/benchmark/
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 27
28. Queues - Evaluation
Queue benchmark scenarios
In addition to scenarios for bag semantics
• Buffer latency
Elements enqueued with current timestamp, difference from
timestamp at dequeue is buffer latency
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 28
29. Queues - Evaluation
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 29
30. Queues - Evaluation
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 30
31. Stacks
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 31
32. Stacks:
Related work
Related work
Fatourou presented a wait-free “universal” construction
that is applicable for stacks
Wait-Free Queues With Multiple Enqueuers and Dequeuers (Kogan, Petrank, 2011)
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 32
33. Elimination stack
Fatourou’s universal construction SIM
A highly efficient universal construction (Fatourou, 2011)
Principle
• Optimized helping scheme
• Threads apply operations to a local copy of the stack
• Every thread tries to replace the global shared object with
its local copy via CAS
• Only applicable for shared objects with small state
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 33
34. Elimination stack
Fatourou’s universal construction SIM
A highly efficient universal construction (Fatourou, 2011)
Elimination
• Push and Pop have reverse semantics:
Push(Pop(stack)) = Pop(Push(stack)) = stack
• Eliminated operations are completed immediately
if they do not alter the object’s state
Significantly improves performance if applicable
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 34
35. Elimination stack
Fatourou’s universal construction SIM
A highly efficient universal construction (Fatourou, 2013)
Original version is not suitable for real-time applications:
- ABA problem is prevented using tagged pointers
- Thread-local pools with unbounded capacity
- No deallocation in published algorithm
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 35
36. Elimination stack
Fatourou’s universal construction SIM
A highly efficient universal construction (Fatourou, 2013)
Modified version of Fatourou’s stack
- Uses hazard pointers for safe reclamation
- Uses compartment pool with limited capacity
- Employs the elimination scheme from the original
publication
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 36
37. Stacks:
Evaluation
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 37
38. Stacks:
Evaluation
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 38
39. Task scheduling
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 39
40. Task Scheduling:
Objectives
Task Scheduling
• Intra-process task scheduling with priority queues
• Low-overhead, fine-grained scheduling of thousands of
small tasks
Priorities:
Focus on low latency and jitter reduction (i.e. predictability),
thus regarding maximum throughput as a secondary
benchmark.
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 40
41. Task scheduling:
Work stealing
Work stealing
• One worker thread per
SMP core, no migration
• Tasks passed as &func
• Load-balancing on task
queues
• Many flavors of concrete
implementations
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 41
42. Task scheduling:
Work stealing
Work stealing with task priorities
• Extended work-stealing
by queues for every
priority
•
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 42
44. Conclusion
Revisiting the objective
• Wait-free implementations of pools, queues and stacks now
available for real-time applications
• Benchmark framework and evaluation tools (R) are
published as open source
• Reproducible evaluation of real-time performance
• Verification tool chain on the way
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 44
45. Conclusion
Recommendations
• Wait-free data structures can rival performance of lock-free
implementations
• But are hard to maintain
• Formal wait-freedom is practically not achievable
Employ wait-free data structures for fault-tolerance, not as a
guarantee for critical deadlines
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 45
46. Thank You
Source code (data structures, benchmarks, R scripts):
https://github.com/fuchsto/embb/tree/benchmark/
Official development source base of embb:
https://github.com/siemens/embb/tree/development/
Wiki to this thesis:
http://wiki.coreglit.ch
Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems 46