This document is about operating system scheduling on multi-core architectures. It discusses that traditional scheduling approaches do not work well for multi-core CPUs due to shared caches, memory controllers, and smaller caches per core. Current research focuses on improving cache fairness, balancing core assignment, handling performance asymmetry between cores, and scheduling for many-core chips. The document reviews scheduling in Linux, Windows, and Solaris operating systems, and ongoing research topics like cache-fair scheduling, balancing tasks across cores, and scheduling for future many-core processors.
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...Naoki Shibata
Yosuke Wakisaka, Naoki Shibata, Keiichi Yasumoto, Minoru Ito, and Junji Kitamichi : Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost and Hyper-Threading, In Proc. of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA'14), pp. 229-235
In this paper, we propose a task scheduling algorithm for multiprocessor systems with Turbo Boost and Hyper-Threading technologies. The proposed algorithm minimizes the total computation time taking account of dynamic changes of the processing speed by the two technologies, in addition to the network contention among the processors. We constructed a clock speed model with which the changes of processing speed with Turbo Boost and Hyper-threading can be estimated for various processor usage patterns. We then constructed a new scheduling algorithm that minimizes the total execution time of a task graph considering network contention and the two technologies. We evaluated the proposed algorithm by simulations and experiments with a multiprocessor system consisting of 4 PCs. In the experiment, the proposed algorithm produced a schedule that reduces the total execution time by 36% compared to conventional methods which are straightforward extensions of an existing method.
(Slides) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
Shohei Gotoda, Naoki Shibata and Minoru Ito : "Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault," Proceedings of IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2012), pp.260-267, DOI:10.1109/CCGrid.2012.23, May 15, 2012.
In this paper, we propose a task scheduling al-gorithm for a multicore processor system which reduces the
recovery time in case of a single fail-stop failure of a multicore
processor. Many of the recently developed processors have
multiple cores on a single die, so that one failure of a computing
node results in failure of many processors. In the case of a failure
of a multicore processor, all tasks which have been executed
on the failed multicore processor have to be recovered at once.
The proposed algorithm is based on an existing checkpointing
technique, and we assume that the state is saved when nodes
send results to the next node. If a series of computations that
depends on former results is executed on a single die, we need
to execute all parts of the series of computations again in
the case of failure of the processor. The proposed scheduling
algorithm tries not to concentrate tasks to processors on a die.
We designed our algorithm as a parallel algorithm that achieves
O(n) speedup where n is the number of processors. We evaluated
our method using simulations and experiments with four PCs.
We compared our method with existing scheduling method, and
in the simulation, the execution time including recovery time in
the case of a node failure is reduced by up to 50% while the
overhead in the case of no failure was a few percent in typical
scenarios.
This document discusses resource management for computer operating systems. It argues that traditional OS architecture is outdated given changes in hardware and software. The authors propose an approach where the OS allocates resources like CPU cores, memory, and bandwidth to processes to optimize responsiveness based on penalty functions that model how run time affects user experience. The goal is to continuously minimize the total penalty by adjusting resource allocations over time as user needs and process requirements change.
This document discusses resource management techniques in distributed systems. It covers three main scheduling techniques: task assignment approach, load balancing approach, and load sharing approach. It also outlines desirable features of good global scheduling algorithms such as having no a priori knowledge about processes, being dynamic in nature, having quick decision-making capability, balancing system performance and scheduling overhead, stability, scalability, fault tolerance, and fairness of service. Finally, it discusses policies for load estimation, process transfer, state information exchange, location, priority assignment, and migration limiting that distributed load balancing algorithms employ.
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systemsknowdiff
PhD Candidate,
Department of Computer science
Mälardalen University
Time: Tuesday, Dec. 30, 2014, 11:30 a.m.
Location: Computer Engineering Department, Urmia University
Abstract:
The processor is the brain of a computer system. Usually, one or more programs run on a processor where each program is typically responsible for performing a particular task or function of the system. The performance of all the tasks together results in the system functionality. In many computer systems, it is not only enough that all tasks deliver correct output, but it is also crucial that these activities are delivered in a proper time. This type of systems that have timing requirements are known as real-time systems. A scheduler is responsible for scheduling all tasks on the processor, i.e., it dictates which task to run and when to run to ensure that all tasks are carried out on time. Typically, such tasks/programs need to use the computer system’s hardware and software resources to perform their calculation. Examples of such type of resources that are shared among programs are I/O devices, buffers and memories. Technology that is used for the management of shared resources is known as resource sharing synchronization protocol.
In recent years, a shift from single-processor platforms to multiprocessor platforms has become inevitable due to availability of processor chips and requirements on increased performance. Scheduling and resource sharing protocols have been well studied for uniprocessor systems. However, in the context of multiprocessors, still such techniques are not fully mature. The shift towards multi-core technology has revealed the demand for real-time scheduling algorithms along with synchronization protocols to support real-time applications on multiprocessors, both with and without dependencies.
In this talk, we first have an introduction to real-time embedded systems. Next, we look at scheduling and resource sharing policies in uniprocessor platforms. Further, we discuss the extension of scheduling and resource sharing policies for multiprocessor platforms and present the recent challenges arisen in this context.
Biography:
Sara Afshar is a PhD student at Mälardalen University. She has received her B.Sc. degree in Electrical Engineering from Tabriz University, Iran in 2002. She worked at different engineering companies until 2009. In the year 2010 she started her M.Sc. in Embedded Systems at Mälardalen University. She obtained her Master degree in 2012 and at the same year she started her PhD studies in Mälardalen University. Currently she is working on the topic of resource sharing in multiprocessor systems. She is part of the Complex Real-Time Embedded Systems group at Mälardalen University.
Dynamic load balancing in distributed systems in the presence of delays a re...Mumbai Academisc
This document summarizes a research paper on dynamic load balancing in distributed systems. It develops an optimal one-shot load balancing policy to reallocate incoming external loads at each node. This is extended into an autonomous and distributed load balancing policy that adapts to the dynamic environment. The performance of this proposed dynamic policy is evaluated in a two-node system and compared to static policies and existing dynamic policies by considering task completion time and system processing rate with random load arrivals.
Allocation of processors to processes in Distributed Systems. Strategies or algorithms for processor allocation. Design and Implementation Issues of Strategies.
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...Naoki Shibata
Yosuke Wakisaka, Naoki Shibata, Keiichi Yasumoto, Minoru Ito, and Junji Kitamichi : Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost and Hyper-Threading, In Proc. of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA'14), pp. 229-235
In this paper, we propose a task scheduling algorithm for multiprocessor systems with Turbo Boost and Hyper-Threading technologies. The proposed algorithm minimizes the total computation time taking account of dynamic changes of the processing speed by the two technologies, in addition to the network contention among the processors. We constructed a clock speed model with which the changes of processing speed with Turbo Boost and Hyper-threading can be estimated for various processor usage patterns. We then constructed a new scheduling algorithm that minimizes the total execution time of a task graph considering network contention and the two technologies. We evaluated the proposed algorithm by simulations and experiments with a multiprocessor system consisting of 4 PCs. In the experiment, the proposed algorithm produced a schedule that reduces the total execution time by 36% compared to conventional methods which are straightforward extensions of an existing method.
(Slides) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
Shohei Gotoda, Naoki Shibata and Minoru Ito : "Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault," Proceedings of IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2012), pp.260-267, DOI:10.1109/CCGrid.2012.23, May 15, 2012.
In this paper, we propose a task scheduling al-gorithm for a multicore processor system which reduces the
recovery time in case of a single fail-stop failure of a multicore
processor. Many of the recently developed processors have
multiple cores on a single die, so that one failure of a computing
node results in failure of many processors. In the case of a failure
of a multicore processor, all tasks which have been executed
on the failed multicore processor have to be recovered at once.
The proposed algorithm is based on an existing checkpointing
technique, and we assume that the state is saved when nodes
send results to the next node. If a series of computations that
depends on former results is executed on a single die, we need
to execute all parts of the series of computations again in
the case of failure of the processor. The proposed scheduling
algorithm tries not to concentrate tasks to processors on a die.
We designed our algorithm as a parallel algorithm that achieves
O(n) speedup where n is the number of processors. We evaluated
our method using simulations and experiments with four PCs.
We compared our method with existing scheduling method, and
in the simulation, the execution time including recovery time in
the case of a node failure is reduced by up to 50% while the
overhead in the case of no failure was a few percent in typical
scenarios.
This document discusses resource management for computer operating systems. It argues that traditional OS architecture is outdated given changes in hardware and software. The authors propose an approach where the OS allocates resources like CPU cores, memory, and bandwidth to processes to optimize responsiveness based on penalty functions that model how run time affects user experience. The goal is to continuously minimize the total penalty by adjusting resource allocations over time as user needs and process requirements change.
This document discusses resource management techniques in distributed systems. It covers three main scheduling techniques: task assignment approach, load balancing approach, and load sharing approach. It also outlines desirable features of good global scheduling algorithms such as having no a priori knowledge about processes, being dynamic in nature, having quick decision-making capability, balancing system performance and scheduling overhead, stability, scalability, fault tolerance, and fairness of service. Finally, it discusses policies for load estimation, process transfer, state information exchange, location, priority assignment, and migration limiting that distributed load balancing algorithms employ.
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systemsknowdiff
PhD Candidate,
Department of Computer science
Mälardalen University
Time: Tuesday, Dec. 30, 2014, 11:30 a.m.
Location: Computer Engineering Department, Urmia University
Abstract:
The processor is the brain of a computer system. Usually, one or more programs run on a processor where each program is typically responsible for performing a particular task or function of the system. The performance of all the tasks together results in the system functionality. In many computer systems, it is not only enough that all tasks deliver correct output, but it is also crucial that these activities are delivered in a proper time. This type of systems that have timing requirements are known as real-time systems. A scheduler is responsible for scheduling all tasks on the processor, i.e., it dictates which task to run and when to run to ensure that all tasks are carried out on time. Typically, such tasks/programs need to use the computer system’s hardware and software resources to perform their calculation. Examples of such type of resources that are shared among programs are I/O devices, buffers and memories. Technology that is used for the management of shared resources is known as resource sharing synchronization protocol.
In recent years, a shift from single-processor platforms to multiprocessor platforms has become inevitable due to availability of processor chips and requirements on increased performance. Scheduling and resource sharing protocols have been well studied for uniprocessor systems. However, in the context of multiprocessors, still such techniques are not fully mature. The shift towards multi-core technology has revealed the demand for real-time scheduling algorithms along with synchronization protocols to support real-time applications on multiprocessors, both with and without dependencies.
In this talk, we first have an introduction to real-time embedded systems. Next, we look at scheduling and resource sharing policies in uniprocessor platforms. Further, we discuss the extension of scheduling and resource sharing policies for multiprocessor platforms and present the recent challenges arisen in this context.
Biography:
Sara Afshar is a PhD student at Mälardalen University. She has received her B.Sc. degree in Electrical Engineering from Tabriz University, Iran in 2002. She worked at different engineering companies until 2009. In the year 2010 she started her M.Sc. in Embedded Systems at Mälardalen University. She obtained her Master degree in 2012 and at the same year she started her PhD studies in Mälardalen University. Currently she is working on the topic of resource sharing in multiprocessor systems. She is part of the Complex Real-Time Embedded Systems group at Mälardalen University.
Dynamic load balancing in distributed systems in the presence of delays a re...Mumbai Academisc
This document summarizes a research paper on dynamic load balancing in distributed systems. It develops an optimal one-shot load balancing policy to reallocate incoming external loads at each node. This is extended into an autonomous and distributed load balancing policy that adapts to the dynamic environment. The performance of this proposed dynamic policy is evaluated in a two-node system and compared to static policies and existing dynamic policies by considering task completion time and system processing rate with random load arrivals.
Allocation of processors to processes in Distributed Systems. Strategies or algorithms for processor allocation. Design and Implementation Issues of Strategies.
This is a presentation for Chapter 7 Distributed system management
Book: DISTRIBUTED COMPUTING , Sunita Mahajan & Seema Shah
Prepared by Students of Computer Science, Ain Shams University - Cairo - Egypt
Introduction: What is clock synchronization?
The challenges of clock synchronization.
Basic Concepts: Software and hardware clocks. Basic clock synchronization algorithm
Algorithms: Deep dive into landmark papers
NTP: Internet scale time synchronization
Scheduling in distributed systems - Andrii VozniukAndrii Vozniuk
My EPFL candidacy exam presentation: http://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/vozniuk_andrii_candidacy_writeup.pdf
Here I present how schedulers work in three distributed data processing systems and their possible optimizations. I consider Gamma - a parallel database, MapReduce - a data-intensive system and Condor - a compute-intensive system.
This talk is based on the following papers:
1) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt
2) Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica
3) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt
Real time Scheduling in Operating System for Msc CSThanveen
This document discusses real-time scheduling and algorithms. It defines real-time systems as systems where correctness depends on both functional and temporal aspects. Key aspects of real-time scheduling include determinism, responsiveness, reliability and meeting deadlines. Common real-time scheduling algorithms discussed include rate monotonic scheduling, earliest deadline first, and dynamic scheduling approaches. The document also covers priority inversion which can occur when higher priority tasks must wait for lower priority tasks in preemptive multitasking systems.
Parallelization of Graceful Labeling Using Open MPIJSRED
This document summarizes research on parallelizing the graceful graph labeling problem using OpenMP on multi-core processors. It introduces the concepts of parallelization, multi-core architecture, and OpenMP. An algorithm is designed to parallelize graceful labeling by distributing graph vertices across processor cores. Execution time and speedup are measured for graphs of increasing size, showing improved speedup and reduced time with parallelization. Results show consistent performance gains as graph size increases due to better utilization of the multi-core architecture.
This document discusses multiprocessor scheduling. It describes generating synthetic tasks, assigning priorities using static and dynamic methods like Rate Monotonic and Earliest Deadline First algorithms, and allocating tasks to processors using static and dynamic allocation strategies. Real-time scheduling is important for applications like patient monitoring, smart environments, and mobile devices. The work aims to find better processor utilization by evaluating different task-processor assignment policies and scheduling tasks dynamically.
The document discusses various techniques for resource management in distributed systems. It describes approaches like task assignment, load balancing, and load sharing. It also covers desirable features of scheduling algorithms and discusses techniques like task assignment in detail with an example. Furthermore, it discusses concepts like load balancing approaches, task assignment, location policies, state information exchange policies, and priority assignment policies.
Bounded ant colony algorithm for task Allocation on a network of homogeneous ...ijcsit
This document summarizes a research paper that proposes a bounded ant colony algorithm (BTS-ACO) for task scheduling on a network of homogeneous processors using a primary site. The algorithm uses an initial bound on each processor's load to control task allocation. It investigates scheduling tasks from a sorted list (SLoT) versus a random list (RLoT). Simulation results show that BTS-ACO with a sorted task list achieves better performance than a random list in terms of scheduling time, makespan, and load balancing.
An efficient approach for load balancing using dynamic ab algorithm in cloud ...bhavikpooja
This document outlines a proposed approach for efficient load balancing using a dynamic Ant-Bee algorithm in cloud computing. It discusses limitations of existing ant colony and bee colony algorithms for load balancing. The author aims to develop a new AB algorithm approach that combines aspects of ant colony optimization and bee colony algorithms to improve load balancing optimization and overcome issues like slow convergence and tendency to stagnate in ant colony algorithms. The proposed approach would leverage both the dynamic path finding of ants and pheromone updating of bees for more effective load balancing in cloud environments.
Featuring a brief overview of fault-tolerant mechanisms across various Big Data systems such as Google File system (GFS), Amazon Dynamo, Bigtable, Hadoop - Map Reduce, Facebook Cassandra along with description of an existing fault tolerant model
Clock synchronization in distributed systemSunita Sahu
This document discusses several techniques for clock synchronization in distributed systems:
1. Time stamping events and messages with logical clocks to determine partial ordering without a global clock. Logical clocks assign monotonically increasing sequence numbers.
2. Clock synchronization algorithms like NTP that regularly adjust system clocks across the network to synchronize with a time server. NTP uses averaging to account for network delays.
3. Lamport's logical clocks algorithm that defines "happened before" relations and increments clocks between events to synchronize logical clocks across processes.
This document discusses processes, threads, interprocess communication, and scheduling in operating systems. It begins by defining processes and threads, explaining process creation and termination, and comparing user-space and kernel-based thread implementations. Interprocess communication methods like semaphores, monitors, and message passing are then introduced. The final section covers CPU scheduling algorithms and goals like throughput, turnaround time, and response time optimization.
About real time system task scheduling basic concepts.It deals with task, instance,data sharing and their types.It also covers various important terminologies regarding scheduling algorithms.
This document provides an overview of concepts related to time and clock synchronization in distributed systems. It discusses the need to synchronize clocks across different computers to accurately timestamp events. Physical clocks drift over time so various clock synchronization algorithms like Cristian's algorithm and Berkeley algorithm are presented to synchronize clocks within a known bound. The Network Time Protocol (NTP) used on the internet to synchronize client clocks to UTC sources through a hierarchy of time servers is also summarized. Logical clocks provide an alternative to physical clock synchronization by assigning timestamps to events based on their order of occurrence.
The document discusses process migration as a way to balance workload across systems. It describes how a process can be transferred between machines and resume where it left off. Key aspects covered include kernel modules, ELF files, advantages of process migration like load balancing and fault tolerance, and potential applications in distributed and multi-user systems.
Load balancing aims to distribute work evenly across multiple computing resources like CPUs, servers, or disk drives. It helps maximize resource utilization, minimize response times, and avoid overloading or crashing systems. There are two main approaches: push migration moves processes from overloaded to underloaded CPUs, while pull migration moves processes from busy to idle CPUs. Load balancing can be static or dynamic. Dynamic load balancing continually monitors and adjusts to keep the workload balanced. Round robin load balancing schedules tasks to resources based on their weighted capacities to better handle differences in processing power. Load balancing principles can also apply to life by balancing responsibilities to manage everything effectively.
This document discusses process management in distributed systems. It describes how distributed operating systems aim to make the best use of processing resources across an entire system by sharing processors among all processes. Key concepts discussed include processor allocation, process migration, and threads. Process migration involves transferring a running process from one machine to another to achieve goals like load balancing and fault tolerance. The challenges and mechanisms of freezing, transferring, and restarting a migrating process's address space and forwarding messages are also covered.
This document discusses various concepts related to CPU scheduling. It begins with definitions of scheduling and explains that the CPU requires a mechanism to allocate time to different processes in a fair manner. It then covers key scheduling concepts like scheduling levels (high, intermediate, low), types (preemptive vs non-preemptive), objectives, and algorithms like FCFS, SJF, priority scheduling, and round robin. The document provides examples and comparisons of different scheduling techniques.
The document discusses operating system concepts including CPU scheduling, process states, and scheduling algorithms. It covers historical perspectives on CPU scheduling and bursts, preemptive vs. nonpreemptive scheduling, and scheduling criteria. Common scheduling algorithms like first-come, first-served (FCFS), shortest-job-first (SJF), priority, and round robin are described. The roles of long-term and short-term schedulers are defined.
This is a presentation for Chapter 7 Distributed system management
Book: DISTRIBUTED COMPUTING , Sunita Mahajan & Seema Shah
Prepared by Students of Computer Science, Ain Shams University - Cairo - Egypt
Introduction: What is clock synchronization?
The challenges of clock synchronization.
Basic Concepts: Software and hardware clocks. Basic clock synchronization algorithm
Algorithms: Deep dive into landmark papers
NTP: Internet scale time synchronization
Scheduling in distributed systems - Andrii VozniukAndrii Vozniuk
My EPFL candidacy exam presentation: http://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/vozniuk_andrii_candidacy_writeup.pdf
Here I present how schedulers work in three distributed data processing systems and their possible optimizations. I consider Gamma - a parallel database, MapReduce - a data-intensive system and Condor - a compute-intensive system.
This talk is based on the following papers:
1) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt
2) Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica
3) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt
Real time Scheduling in Operating System for Msc CSThanveen
This document discusses real-time scheduling and algorithms. It defines real-time systems as systems where correctness depends on both functional and temporal aspects. Key aspects of real-time scheduling include determinism, responsiveness, reliability and meeting deadlines. Common real-time scheduling algorithms discussed include rate monotonic scheduling, earliest deadline first, and dynamic scheduling approaches. The document also covers priority inversion which can occur when higher priority tasks must wait for lower priority tasks in preemptive multitasking systems.
Parallelization of Graceful Labeling Using Open MPIJSRED
This document summarizes research on parallelizing the graceful graph labeling problem using OpenMP on multi-core processors. It introduces the concepts of parallelization, multi-core architecture, and OpenMP. An algorithm is designed to parallelize graceful labeling by distributing graph vertices across processor cores. Execution time and speedup are measured for graphs of increasing size, showing improved speedup and reduced time with parallelization. Results show consistent performance gains as graph size increases due to better utilization of the multi-core architecture.
This document discusses multiprocessor scheduling. It describes generating synthetic tasks, assigning priorities using static and dynamic methods like Rate Monotonic and Earliest Deadline First algorithms, and allocating tasks to processors using static and dynamic allocation strategies. Real-time scheduling is important for applications like patient monitoring, smart environments, and mobile devices. The work aims to find better processor utilization by evaluating different task-processor assignment policies and scheduling tasks dynamically.
The document discusses various techniques for resource management in distributed systems. It describes approaches like task assignment, load balancing, and load sharing. It also covers desirable features of scheduling algorithms and discusses techniques like task assignment in detail with an example. Furthermore, it discusses concepts like load balancing approaches, task assignment, location policies, state information exchange policies, and priority assignment policies.
Bounded ant colony algorithm for task Allocation on a network of homogeneous ...ijcsit
This document summarizes a research paper that proposes a bounded ant colony algorithm (BTS-ACO) for task scheduling on a network of homogeneous processors using a primary site. The algorithm uses an initial bound on each processor's load to control task allocation. It investigates scheduling tasks from a sorted list (SLoT) versus a random list (RLoT). Simulation results show that BTS-ACO with a sorted task list achieves better performance than a random list in terms of scheduling time, makespan, and load balancing.
An efficient approach for load balancing using dynamic ab algorithm in cloud ...bhavikpooja
This document outlines a proposed approach for efficient load balancing using a dynamic Ant-Bee algorithm in cloud computing. It discusses limitations of existing ant colony and bee colony algorithms for load balancing. The author aims to develop a new AB algorithm approach that combines aspects of ant colony optimization and bee colony algorithms to improve load balancing optimization and overcome issues like slow convergence and tendency to stagnate in ant colony algorithms. The proposed approach would leverage both the dynamic path finding of ants and pheromone updating of bees for more effective load balancing in cloud environments.
Featuring a brief overview of fault-tolerant mechanisms across various Big Data systems such as Google File system (GFS), Amazon Dynamo, Bigtable, Hadoop - Map Reduce, Facebook Cassandra along with description of an existing fault tolerant model
Clock synchronization in distributed systemSunita Sahu
This document discusses several techniques for clock synchronization in distributed systems:
1. Time stamping events and messages with logical clocks to determine partial ordering without a global clock. Logical clocks assign monotonically increasing sequence numbers.
2. Clock synchronization algorithms like NTP that regularly adjust system clocks across the network to synchronize with a time server. NTP uses averaging to account for network delays.
3. Lamport's logical clocks algorithm that defines "happened before" relations and increments clocks between events to synchronize logical clocks across processes.
This document discusses processes, threads, interprocess communication, and scheduling in operating systems. It begins by defining processes and threads, explaining process creation and termination, and comparing user-space and kernel-based thread implementations. Interprocess communication methods like semaphores, monitors, and message passing are then introduced. The final section covers CPU scheduling algorithms and goals like throughput, turnaround time, and response time optimization.
About real time system task scheduling basic concepts.It deals with task, instance,data sharing and their types.It also covers various important terminologies regarding scheduling algorithms.
This document provides an overview of concepts related to time and clock synchronization in distributed systems. It discusses the need to synchronize clocks across different computers to accurately timestamp events. Physical clocks drift over time so various clock synchronization algorithms like Cristian's algorithm and Berkeley algorithm are presented to synchronize clocks within a known bound. The Network Time Protocol (NTP) used on the internet to synchronize client clocks to UTC sources through a hierarchy of time servers is also summarized. Logical clocks provide an alternative to physical clock synchronization by assigning timestamps to events based on their order of occurrence.
The document discusses process migration as a way to balance workload across systems. It describes how a process can be transferred between machines and resume where it left off. Key aspects covered include kernel modules, ELF files, advantages of process migration like load balancing and fault tolerance, and potential applications in distributed and multi-user systems.
Load balancing aims to distribute work evenly across multiple computing resources like CPUs, servers, or disk drives. It helps maximize resource utilization, minimize response times, and avoid overloading or crashing systems. There are two main approaches: push migration moves processes from overloaded to underloaded CPUs, while pull migration moves processes from busy to idle CPUs. Load balancing can be static or dynamic. Dynamic load balancing continually monitors and adjusts to keep the workload balanced. Round robin load balancing schedules tasks to resources based on their weighted capacities to better handle differences in processing power. Load balancing principles can also apply to life by balancing responsibilities to manage everything effectively.
This document discusses process management in distributed systems. It describes how distributed operating systems aim to make the best use of processing resources across an entire system by sharing processors among all processes. Key concepts discussed include processor allocation, process migration, and threads. Process migration involves transferring a running process from one machine to another to achieve goals like load balancing and fault tolerance. The challenges and mechanisms of freezing, transferring, and restarting a migrating process's address space and forwarding messages are also covered.
This document discusses various concepts related to CPU scheduling. It begins with definitions of scheduling and explains that the CPU requires a mechanism to allocate time to different processes in a fair manner. It then covers key scheduling concepts like scheduling levels (high, intermediate, low), types (preemptive vs non-preemptive), objectives, and algorithms like FCFS, SJF, priority scheduling, and round robin. The document provides examples and comparisons of different scheduling techniques.
The document discusses operating system concepts including CPU scheduling, process states, and scheduling algorithms. It covers historical perspectives on CPU scheduling and bursts, preemptive vs. nonpreemptive scheduling, and scheduling criteria. Common scheduling algorithms like first-come, first-served (FCFS), shortest-job-first (SJF), priority, and round robin are described. The roles of long-term and short-term schedulers are defined.
The document discusses different scheduling algorithms used by operating systems. It begins by explaining that the scheduler decides which process to activate from the ready queue when multiple processes are runnable. There are long-term, medium-term, and short-term schedulers that control admission of jobs, memory allocation, and CPU sharing respectively. The goal of scheduling is to optimize system performance and resource utilization while providing responsive service. Common algorithms include first-come first-served (FCFS), shortest job first (SJF), priority scheduling, and round-robin. FCFS schedules processes in the order of arrival while SJF selects the shortest job first. Priority scheduling preempts lower priority jobs.
This document contains chapters from the textbook "Discovering Computers 2010: Living in a Digital World" that discuss computer hardware components. It describes the various parts inside a computer system unit including the motherboard, processor, memory, ports, and expansion slots. The processor contains a control unit and arithmetic logic unit. Memory comes in volatile and non-volatile types. Various ports and connectors are explained such as USB, FireWire, and Bluetooth. Buses and bays are also summarized. Input and output devices are introduced along with biometrics.
This document provides an introduction to computer hardware components. It discusses the processor, RAM, motherboard, hard disk, video cards, ports, BIOS, peripherals, cabinet, and troubleshooting. For each topic, it describes the basic functions and key concepts. For example, it explains that the processor is the computer's brain, RAM is volatile memory, and the motherboard connects different devices. It also provides details on commercially available processors, types of RAM, motherboard components, hard disk specifications, and common troubleshooting steps.
This document provides an overview of multi-core processors, including their history, architecture, advantages, disadvantages, applications and future aspects. It discusses how multi-core processors work with multiple independent processor cores on a single chip to improve performance over single-core processors. Some key points covered include the introduction of dual-core chips by IBM, Intel and AMD in the early 2000s; comparisons of single-core, multi-core and other architectures; advantages like improved multi-tasking and security; and challenges for software to fully utilize multi-core capabilities.
The document outlines a technology guide that discusses the major components of computer hardware, including the central processing unit, memory, storage, input/output devices, and trends in hardware technology. It provides learning objectives about identifying hardware components, describing how CPUs and memory work, differentiating storage types, and discussing strategic issues related to hardware design and business needs. General concepts, technologies, and trends in computer hardware are examined.
Operating system introduction to operating systemjaydeesa17
This document introduces operating systems and their history. It defines an operating system as software that manages computer hardware and provides a simpler interface for user programs. Operating systems are discussed from the user and system perspectives. The history of operating systems is covered in generations from vacuum tubes to personal computers. Three main types of operating systems are described: batch, multiprogramming, and multi-user. Batch systems ran jobs in batches while the other two allowed more concurrent usage of hardware through time-sharing and memory sharing.
This document discusses different types of computer systems including batch processing systems, single-user systems, multi-user systems, single-tasking systems, multi-tasking systems, multiprogramming systems, and distributed systems. It provides examples of batch processing systems for payroll and utility billing. It also describes how batch processing, single-user systems, multi-user systems, multi-tasking, and multiprogramming systems work with no direct user interaction and by allocating short time slices to different programs or users.
The operating system controls the computer by providing an interface between the user and hardware to make the computer more convenient to use. It manages processes, memory, files, security, and interprets commands. The operating system allows users to start and stop processes, allocate memory, create and manage files and directories, implement security measures like passwords and firewalls, and interacts with users through either a command line or graphical user interface.
This document provides an introduction to basic computer hardware components, including the processor, RAM, motherboard, hard disk, cards, ports, BIOS, peripherals, and cabinet. It describes the processor as the brain of the computer and lists common types. It defines RAM as volatile random access memory that comes in static and dynamic varieties. It also briefly outlines hard disks, video cards, sound cards, network interface cards, ports, the BIOS, and various peripherals that connect to the computer, concluding with form factors for computer cabinets.
The document discusses different types of operating systems. It defines an operating system as software that allows computer hardware and software to communicate and function. It then describes GUI operating systems as using graphics and icons navigated by a mouse. It also covers multi-user systems that allow multiple users to access a computer simultaneously or at different times, as well as multiprocessing systems that support more than one processor, and multitasking and multithreading systems that run multiple processes concurrently. Finally, it mentions embedded systems designed for devices like PDAs with limited resources.
The document provides an overview of operating systems, including what constitutes an OS (kernel, system programs, application programs), storage device hierarchy, system calls, process creation and states, process scheduling, inter-process communication methods like shared memory and pipes, synchronization techniques like mutexes and semaphores, readers-writers problem, and potential for deadlocks. Key concepts covered include kernel mode vs user mode, process control blocks, context switching, preemption, and requirements for deadlock situations.
This document provides an overview of the basic hardware components of a personal computer, including input devices, the processing unit, storage devices, and output devices. It discusses what each component is and examples such as keyboards, mice, and monitors as input devices; CPUs from Intel and AMD as the processing unit; hard disks, flash drives, and DVDs as storage devices; and monitors, printers, and speakers as output devices. It also provides some specifications and considerations for different components.
This document lists and briefly describes the main hardware components of a computer system. It includes the motherboard, CPU, RAM, keyboard, mouse, monitor, and various storage drives like floppy disk drives, CD-ROM drives, hard disk drives, and DVD drives. The motherboard contains connectors for additional components and controllers to interface with peripheral devices. RAM provides temporary storage while the computer is on. Hard disks provide high-capacity permanent storage. DVD and CD drives can read optical discs for data access or multimedia playback.
The document discusses the architecture and functions of operating systems. It describes operating systems as system software that acts as an interface between hardware and application software. The key functions of operating systems include managing memory, files, devices, and providing common services for application programs. Examples of common operating systems like Windows, UNIX, and VAX/VMS are given.
The document discusses various CPU scheduling algorithms including first come first served, shortest job first, priority, and round robin. It describes the basic concepts of CPU scheduling and criteria for evaluating algorithms. Implementation details are provided for shortest job first, priority, and round robin scheduling in C++.
The document presents an overview of operating systems. It begins with an introduction that defines an operating system as software that controls computer resources and provides an interface for users. The document then discusses the structure of operating systems, including their role in managing resources and acting as an interface between hardware and users/programs. It outlines the main functions of operating systems such as process management, memory management, file management, security, and command interpretation. Finally, it briefly describes some popular operating systems like DOS, Unix, and Windows NT and concludes with the importance of operating systems for running applications and using computers.
The document discusses operating systems and real-time operating systems. It defines an operating system as software that manages computer hardware resources and provides common services for programs. It then describes the main functions of an operating system including managing resources and devices, running applications, and providing a user interface. The document also discusses different types of operating systems including single-user/single-tasking, single-user/multi-tasking, and multi-user/multi-tasking. It defines a real-time operating system as one intended for real-time applications that has advanced scheduling algorithms to ensure deterministic timing behavior.
This research paper aims at comparing two multi-core processors machines, the Intel core i7-4960X processor (Ivy Bridge E) and the AMD Phenom II X6. It starts by introducing a single-core processor machine to motivate the need for multi-core processors. Then, it explains the multi-core processor machine and the issues that rises in implementing them. It also provides a real life example machines such as TILEPro64 and Epiphany-IV 64-core 28nm Microprocessor (E64G401). The methodology that was used in comparing the Intel core i7 and AMD phenom II processors starts by explaining how processors' performance are measured, then by listing the most important and relevant technical specification to the comparison. After that, running the comparison by using different metrics such as power, the use of HyperThreading technology, the operating frequency, the use of AES encryption and decryption, and the different characteristics of cache memory such as the size, classification, and its memory controller. Finally, reaching to a roughly decision about which one of them has a better over all performance.
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSijcsit
This research paper aims at comparing two multi-core processors machines, the Intel core i7-4960X
processor (Ivy Bridge E) and the AMD Phenom II X6. It starts by introducing a single-core processor machine to motivate the need for multi-core processors. Then, it explains the multi-core processor machine and the issues that rises in implementing them. It also provides a real life example machines such as
TILEPro64 and Epiphany-IV 64-core 28nm Microprocessor (E64G401). The methodology that was used in comparing the Intel core i7 and AMD phenom II processors starts by explaining how processors' performance are measured, then by listing the most important and relevant technical specification to the
comparison. After that, running the comparison by using different metrics such as power, the use of HyperThreading
technology, the operating frequency, the use of AES encryption and decryption, and the different characteristics of cache memory such as the size, classification, and its memory controller. Finally, reaching to a roughly decision about which one of them has a better over all performance.
HISTORY AND FUTURE TRENDS OF MULTICORE COMPUTER ARCHITECTUREijcga
The multicore technology concept is centered on the parallel computing possibility that can boost computer
efficiency and speed by integrating two or more CPUs (Central Processing Units) in a single chip. A
multicore architecture places multiple processor cores and groups them as one physical processor. The
primary goal is to develop a system that can handle and complete more than one task at the same time,
thereby getting a better system performance in general. This paper will describe the history and future
trends of multicore computer architecture.
History and Future Trends of Multicore Computer Architectureijcga
The multicore technology concept is centered on the parallel computing possibility that can boost computer efficiency and speed by integrating two or more CPUs (Central Processing Units) in a single chip. A multicore architecture places multiple processor cores and groups them as one physical processor. The primary goal is to develop a system that can handle and complete more than one task at the same time, thereby getting a better system performance in general. This paper will describe the history and future trends of multicore computer architecture.
This document discusses computer architecture and trends in computer performance. It covers topics like computer architecture definitions, design goals that influence architecture choices, factors that affect performance like instructions per cycle and clock speed, different types of memory and their speeds, trends in processor architecture over time including increasing transistor counts and multicore processors, and how latency and bandwidth impact network and system performance. It also provides some interesting historical facts about competition between Intel and AMD in the CPU market.
TIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KITijesajournal
This paper presents the research work on multicore microcontrollers using parallel, and time critical
programming for the embedded systems. Due to the high complexity and limitations, it is very hard to work
on the application development phase on such architectures. The experimental results mentioned in the
paper are based on xCORE multicore microcontroller form XMOS®
. The paper also imitates multi-tasking
and parallel programming for the same platform. The tasks assigned to multiple cores are executed
simultaneously, which saves the time and energy. The relative study for multicore processor and multicore
controller concludes that micro architecture based controller having multiple cores illustrates better
performance in time critical multi-tasking environment. The research work mentioned here not only
illustrates the functionality of multicore microcontroller, but also express the novel technique of
programming, profiling and optimization on such platforms in real time environments.
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMSijcsit
Overall performance of computer systems are better investigated and evaluated when its various components are considered, components such as the hardware, software and firmware. The comparative analysis of single-core and multi-core systems was carried out using Intel Pentium G640T 2.4GHz dualcore,Intel Pentium IV 2.4GHz single-core and Intel Pentium IV 2.8GHz single-core systems. The approach
method was using hi-tech benchmarking and stress testing software(s) to examine systems’ CPU and RAM
for performance and stability. In all the tests, the components of dual-core had better rating when compared with single-core components; GFLOP result, and execution time for various processes rank G640T 2.4GHz dual-core above Pentium IV 2.4GHz and 2.8GHz single-core respectively.
Time critical multitasking for multicoreijesajournal
This paper presents the research work on multicore microcontrollers using parallel, and time critical
programming for the embedded systems. Due to the high complexity and limitations, it is very hard to work
on the application development phase on such architectures. The experimental results mentioned in the
paper are based on xCORE multicore microcontroller form XMOS®. The paper also imitates multi-tasking
and parallel programming for the same platform. The tasks assigned to multiple cores are executed
simultaneously, which saves the time and energy. The relative study for multicore processor and multicore
controller concludes that micro architecture based controller having multiple cores illustrates better
performance in time critical multi-tasking environment. The research work mentioned here not only
illustrates the functionality of multicore microcontroller, but also express the novel technique of
programming, profiling and optimization on such platforms in real time environments.
This document summarizes a paper on deploying CPU load balancing in a Linux cluster. It discusses:
1) Maintaining load balancing in computing clusters is challenging due to unpredictable load variations. The paper addresses this challenge by designing a dynamic load balancing algorithm.
2) The algorithm designates one node as the master server to maintain CPU and IP information for all nodes. Nodes report their status every 30 seconds.
3) A load balancer node selects the least busy node to process new tasks in a non-repetitive way, maintaining an even load distribution. It solves the readers-writers problem using sockets instead of file locking.
A multi-core processor contains two or more independent processing units called cores that can execute program instructions simultaneously. This increases overall speed for programs that can be parallelized. Manufacturers integrate multiple cores onto a single chip using chip multiprocessing. Each core functions similarly to a single-core processor, running threads through time-slicing, while the operating system perceives each core as a separate processor and maps threads across cores. Multi-core architecture provides performance gains through parallel computing but software must be optimized for parallelism.
This document provides an overview of advancements in microprocessor technology from 1965 to 2015. It discusses how the focus has shifted from increasing clock speeds to improving efficiency through methods like multi-core designs that enable parallel processing. The document outlines key developments such as the introduction of multi-core chips and Intel's move toward multi-core architectures. It also discusses how software tools are becoming increasingly important to optimize performance and how the Itanium processor family was designed to take advantage of instruction-level parallelism.
Top 10 Supercomputers With Descriptive Information & AnalysisNomanSiddiqui41
Top 10 Supercomputers Report
What is Supercomputer?
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS
Supercomputers play an important role in the field of computational science, and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion). They have been essential in the field of cryptanalysis.
1. The Fugaku Supercomputer
Introduction:
Fugaku is a petascale supercomputer (while only at petascale for mainstream benchmark), at the Riken Center for Computational Science in Kobe, Japan. It started development in 2014 as the successor to the K computer, and started operating in 2021. Fugaku made its debut in 2020, and became the fastest supercomputer in the world in the June 2020 TOP500 list, as well as becoming the first ARM architecture-based computer to achieve this. In June 2020, it achieved 1.42 exaFLOPS (in HPL-AI benchmark making it the first ever supercomputer that achieved 1 exaFLOPS. As of November 2021, Fugaku is the fastest supercomputer in the world. It is named after an alternative name for Mount Fuji.
Block Diagram:
Functional Units:
Functional Units, Co-Design and System for the Supercomputer “Fugaku”
1. Performance estimation tool: This tool, taking Fujitsu FX100 (FX100 is the previous Fujitsu supercomputer) execution profile data as an input, enables the performance projection by a given set of architecture parameters. The performance projection is modeled according to the Fujitsu microarchitecture. This tool can also estimate the power consumption based on the architecture model.
2. Fujitsu in-house processor simulator: We used an extended FX100 SPARC instruction-set simulator and compiler, developed by Fujitsu, for preliminary studies in the initial phase, and an Armv8þSVE simulator and compiler afterward.
3. Gem5 simulator for the Post-K processor: The Post-K processor simulator3 based on an opensource system-level processor simulator, Gem5, was developed by RIKEN during the co-design for architecture verification and performance tuning. A fundamental problem is the scale of scientific applications that are expected to be run on Post-K. Even our target applications are thousands of lines of code and are written to use complex algorithms and data structures. Altho
Multicore processor technology advantages and challengeseSAT Journals
Abstract Until recent times, we have worked with processors having a single computing/processing unit (CPU), also called a core. The clock frequency of the processor, which determines the speed of it, cannot be exceeded beyond a certain limit as with the increasing frequency, the power dissipation increases and therefore the amount of heating. So manufacturers came up with a new design of processors, called Multicore processors. A multicore processor has two or more independent computing/processing units (cores) on the same chip. Multiple cores have advantage that they run on lower frequency as compared to the single processing unit, which reduces the power dissipation or temperature. These multiple cores work together to increase the multitasking capability or performance of the system by operating on multiple instructions simultaneously in an efficient manner. This also means that with multithreaded applications, the amount of parallel computing or parallelism is increased. The applications or algorithms must be designed in such a way that their subroutines take full advantage of the multicore technology. Each core or computing unit has its own independent interface with the system bus.. But along with all these advantages, there are certain issues or challenges that must be addressed carefully when we add more cores. In this paper, we discuss about multicore processor technology. In addition to this, we also discuss various challenges faced such as power and temperature (thermal issue), interconnect issue etc. when more cores are added. Key Words: CMP (Chip Multiprocessor), Clock, Core, ILP (Instructions Level parallelism), TLP (Thread level Parallelism).
This document reviews parallel computing and compares different parallel programming models. It discusses CPU and GPU architectures, highlighting that GPUs are designed for massive parallelism while CPUs balance computing power and flexibility. The document evaluates programming models based on supported system architectures, programming interfaces, workload partitioning, task assignment, synchronization methods, and communication models.
This document discusses using OpenCL to accelerate numerical modeling of gravitational wave sources on hardware accelerators like GPUs and the Cell BE. It summarizes the EMRI Teukolsky Code, which models gravitational waves generated by a compact object orbiting a supermassive black hole by solving the Teukolsky equation. The authors parallelized this code using OpenCL to run on GPUs and the Cell BE, achieving performance comparable to using each vendor's native SDK while only writing code once for both architectures.
The document discusses several types of processors including Pentium 4, dual-core, and quad-core processors, explaining their features and advantages. Pentium 4 used the NetBurst architecture but faced challenges scaling to higher speeds. Dual-core and quad-core processors place multiple processor cores on a single chip to improve performance through parallel processing while reducing power needs.
Applying Cloud Techniques to Address Complexity in HPC System Integrationsinside-BigData.com
In this video from the HPC User Forum at Argonne, Arno Kolster from Providentia Worldwide presents: Applying Cloud Techniques to Address Complexity in HPC System Integrations.
"The Oak Ridge Leadership Computing Facility (OLCF) and technology consulting company Providentia Worldwide recently collaborated to develop an intelligence system that combines real-time updates from the IBM AC922 Summit supercomputer with local weather and operational data from its adjacent cooling plant, with the goal of optimizing Summit’s energy efficiency. The OLCF proposed the idea and provided facility data, and Providentia developed a scalable platform to integrate and analyze the data."
Watch the video: https://wp.me/p3RLHQ-kOg
Learn more: http://www.providentiaworldwide.com/
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Multicore processor by Ankit Raj and Akash PrajapatiAnkit Raj
A multi-core processor is a single computing component with two or more independent processing units called cores. This development arose in response to the limitations of increasing clock speeds in single-core processors. By incorporating multiple cores that can execute multiple tasks simultaneously, multi-core processors provide greater performance with less heat and power consumption than single-core processors. Programming for multi-core requires spreading workloads across cores using threads or processes to take advantage of the parallel processing capabilities.
This document is a project report submitted by three students (Amit Kumar, Ankit Singh, and Sushant Bhadkamkar) for their Bachelor of Engineering degree in Computer Science. The report describes their work on a parallel computing cluster called Parallex. Parallex aims to create a high-performance computing system without requiring modifications to operating system kernels. It allows different operating systems and processor architectures to work together in parallel without using existing parallel libraries. The students implemented new distribution algorithms and parallel algorithms for Parallex to make administration and usage simple while maintaining efficiency.
Similar to 4838281 operating-system-scheduling-on-multicore-architectures (20)
1. To make asynchronous serial communication using a microcontroller's USART, the transmitter must configure the baud rate generator and enable transmission by writing data to the transmit register, while the receiver must configure the baud rate generator and enable reception to read incoming data from the receive register.
2. Key steps include setting the SPBRG register and BRGH bit to determine the baud rate, enabling the serial port and transmission/reception, handling 9-bit data if needed, and checking status registers for transmission completion or errors.
3. Asynchronous serial communication allows microcontrollers to transmit data bit by bit over a single line using start and stop bits for synchronization instead of a separate clock line.
The document discusses analog to digital conversion. It explains that analog signals are continuous while digital signals are discrete in both time and amplitude. It describes how analog signals are converted to digital using sample and hold circuits, quantization, and encoding. The conversion process filters the analog signal, takes samples at regular time intervals, rounds samples to the nearest digital value, and encodes samples into binary format. The document also provides examples of analog to digital converters and discusses considerations like resolution, dynamic range, and signal conditioning.
This document discusses timers and how they are used to count time in digital hardware. Timers use counters and prescalars to determine the time interval for an overflow interrupt. A timer overflow can be used to trigger an action every specific time period. The document explains counters, prescalars, and how to calculate the number of counts and time to overflow for a timer. It provides an example of using the 8-bit Timer0 peripheral in a microcontroller to generate an interrupt every 4 seconds using a 32.768 kHz oscillator. Real-time applications of timers are discussed, along with an assignment to generate a 100 kHz square wave using Timer0.
An interrupt is an asynchronous signal that indicates an event needs the processor's immediate attention, preempting the current instruction. Interrupts save processing time by allowing other tasks to execute while waiting for an event, providing faster response. When an interrupt occurs, the processor stores its state, jumps to the interrupt service routine to handle the event, then restores its state and returns to the original program. The document discusses interrupts in microprocessors and various interrupt sources for PIC microcontrollers like external pins, timers, and peripherals. It provides an example of using the RB0 pin interrupt to light an LED when a button is pressed.
This document discusses I/O ports, how to use them, and handling the bouncing problem with switches. It explains that I/O ports allow communication between a microcontroller and the outside world by reading and writing voltage levels on pins. The direction of pins is set by a TRIS register. Switches connected to pins can bounce, so software reads the pin multiple times with a delay to filter out false readings. LEDs are used as simple outputs, requiring current limiting resistors. Sample code is provided to output patterns on one port based on inputs to another, including a function to handle switch bouncing.
Introduction to Embedded Systems and MicrocontrollersIslam Samir
The document provides an introduction to microcontrollers and embedded systems. It discusses prerequisites for the course including digital logic design and C programming. Microcontrollers allow implementing algorithms with minimized cost and power by writing efficient programs. Studying embedded systems is important for electrical engineers in Egypt to develop technical skills and compete globally. The course agenda covers topics such as embedded systems, microcontrollers, architecture, PIC microcontrollers, memory organization, and C programming.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
1. Seminar “Parallel Computing“ Summer term 2008
Seminar paper
Parallel Computing (703525)
Optimisation: Operating System Scheduling on multi-core
architectures
Lehrveranstaltungsleiter: T. Fahringer
Name Matrikelnummer
Thomas Zangerl
1
2. Seminar “Parallel Computing“ Summer term 2008
Abstract
As multi-core architectures begin to emerge in every area of computing, operating system
scheduling that takes the peculiarities of such architectures into account will become
mandatory. Due to architectural differences to traditional multi-processors, such as shared
caches, memory controllers and smaller cache sizes available per computational unit, it does
not suffice to simply schedule tasks on multi-core processors in the same way as on SMP
systems.
Furthermore, current research motivates architectural changes in CPU design, such as multi-
core processors with asymmetric core-performance and so called many-core architectures that
integrate up to 100 cores in one package. Such architectures will exhibit a fundamentally
different behaviour with regard to shared resource utilization and performance of non-
parallelizable code compared to current CPUs. It will be the responsibility of the operating
system to spare the programmer as much platform-specific knowledge as possible and
optimize overall performance by employing intelligent and configurable scheduling
mechanisms.
2
3. Seminar “Parallel Computing“ Summer term 2008
Abstract......................................................................................................................................2
1. Introduction............................................................................................................................4
1.1 Why use multi-core processors at all?..............................................................................4
1.2 What’s so different about multi-core scheduling?............................................................5
2. OS process scheduling state of the art...................................................................................7
2.1 Linux scheduler................................................................................................................7
2.1.1 The Completely Fair Scheduler ................................................................................7
2.1.2 Scheduling domains...................................................................................................7
2.2 Windows scheduler..........................................................................................................9
2.3 Solaris scheduler.............................................................................................................10
3. Ongoing research on the topic.............................................................................................11
3.1 Cache-Fairness................................................................................................................11
3.2 Balancing core assignment..............................................................................................12
3.3 Performance asymmetry.................................................................................................13
3.4 Scheduling on many-core architectures..........................................................................15
4. Conclusion...........................................................................................................................18
5. References............................................................................................................................20
3
4. Seminar “Parallel Computing“ Summer term 2008
1. Introduction
1.1Why use multi-core processors at all?
In the last few years, multi-core CPUs have become a standard component in nearly all sorts
of computers – not only servers and high-end workstations but also desktop and laptop PCs
for consumers and even game consoles nowadays usually come with CPUs with more than
one core.
This development is not surprising; already in 1965 Gordon Moore predicted that the number
of transistors that can be cost-effectively built onto integrated circuits were going to double
every year ([1]). In 1975, Moore corrected that assumption to a period of two years; nowadays
this period is frequently assumed to be 18 months).
Moore's projection has more or less been accurate up to today and consumers have gotten
used to the constant speedup of computer hardware – it is expected by buyers that a new
computer shows a significant speedup to a two years older model (even though an increase in
transistor density does not always lead to an equal in increase in computing speed). For chip
manufacturers, it has become increasingly difficult however to keep up with Moore's law. In
order to implement the exponential increase of integrated circuits, the transistor structures
have to become steadily smaller. On the one hand, the extra transistors were used for the
integration of more and more specialized instruction sets on CISC chips. On the other hand,
smaller transistor sizes led to higher clock rates of the CPUs, because due to physical factors,
the gates in the transistors could perform faster state switches.
However, since electronic activity always produces heat as an unwanted by-product, the more
transistors are packed together in a small CPU die area, the higher the resulting heat
dissipation per unit area becomes ([2]). With the higher switching frequency, the electronic
activity was performed in smaller intervals, and hence more and more heat-dissipation
emerged. The cooling of the processor components became more and more a crucial factor in
design considerations and it became clear, that the increasing clock frequency could no longer
serve as the primary reason for processor speedup.
Hence, there had to be a shift in paradigm in order to still make applications run faster; on the
one hand the amazing abundance of transistors on processor chips was used to increase the
cache sizes. This alone, however, would not result in an adequate performance gain, since it
only helps memory intensive applications to a certain degree. In order to effectively
counteract the heat problem while making use of the small structures and high number of
transistors on a chip, the notion of multi-core processors for consumer PCs was introduced.
Since the CMOS technology met its limits for the further increase of CPU clock frequency
and the number of transistors that could be integrated on a single die allowed for it, the idea
emerged, that multiple processing cores could be placed in a single processor die.
In 2006, Intel released the Core™ microprocessor, a die package with two processor cores
with their own level 1 caches and a shared level 2 cache ([3]).
Also in 2006, AMD, the second major CPU manufacturer for the consumer market, released
the Athlon™ X2, a processor with quite similar architecture to the Core platform, but
additionally featuring the concept of also sharing a CPU-integrated memory-controller among
the cores ([[4]).
4
5. Seminar “Parallel Computing“ Summer term 2008
Both architectures have been improved and sold with a range of consumer desktop and laptop
computers - but also servers and workstations - up to today; therefore the presence of multi-
core processors in a large number of today's PCs can be assumed.
1.2What’s so different about multi-core scheduling?
One could assume that the scheduling process on such multi-core processors wouldn’t differ
much from conventional scheduling – intuitively the run-queue would just have to be replaced
by n run-queues, where n is the number of cores and processes would simply be scheduled to
the currently shortest run-queue (with some additional process-priority treatment, maybe).
While that might seem reasonable, there are some properties of current multi-core
architectures that speak strongly against such a naïve approach. First, in many multi core
architectures, each core manages its own level 1 cache (Figure 1). By just naïvely
rescheduling interrupted processes to a shorter queue which belongs to another core (task
migration), parts of the processes cache working set may become unnecessarily lost and the
overall performance may slow down. This effect becomes even worse if the underlying
architecture is not a multi-core but a NUMA system where memory access can become very
costly if the process is scheduled on the “wrong” node.
Figure 1: Typical multi-core architecture
A second important point is that the performance of different cores in a multi-core system
may be asymmetric regarding the performance of the different cores ([5]). This effect can
emerge due to several reasons:
5
6. Seminar “Parallel Computing“ Summer term 2008
• Design considerations. Many slow cores can be used for increasing the throughput in
parallel computation while a few faster cores contribute to the efficient processing of
costly tasks which can not be parallelized. ([6]).
Even algorithms that are parallelizable contain parts that have to be executed
sequentially, which will benefit from the higher speed of the fastest core. Hence
performance-asymmetry has been shown to be a very efficient approach in multi-core
architectures ([7]).
• Transistor failures. Some parts of the CPU may get damaged over time and become
automatically disabled by the CPU. Since such components may fail in certain cores
independently from the other cores, performance asymmetry may arise in symmetric
cores over time ([5]).
• Power-saving policies. Different cores may switch to different P- or C-power-states at
different times in order to save power. At different P-states, equal cores show a
different clock-frequency. If an OS scheduler manages to take this into account for
processes not in need of all system resources, the system can remain more energy-
efficient over the execution time while giving away only little or no performance at
all. ([8]).
Hence, performance asymmetry, the fact that various CPU components can be shared among
cores, and non-uniform access to computation resources such as memory, mandate the design
of efficient multi-core scheduling mechanisms or scheduling frameworks at the operating
system level.
Multi-core processors have gone mainstream and while there may be the demand that they are
efficiently used in terms of performance, the currently fashionable term Green-IT also
motivates the energy-efficient use of the CPU cores.
Section 2 will explore how far current operating systems have evolved in support of the new
architectures.
6
7. Seminar “Parallel Computing“ Summer term 2008
2. OS process scheduling state of the art
2.1 Linux scheduler
2.1.1The Completely Fair Scheduler
The Linux scheduler in versions prior to 2.6.23 performed its tasks in complexity O(1) by
basically just using per-CPU run-queues and priority arrays ([9]). Kernel version 2.6.23,
which was released on October 9 2007, introduced the so-called completely fair scheduler
(CFS). The change in scheduler was mainly motivated by the failure of the old scheduler to
correctly predict whether applications are interactive (I/O-bound) or CPU-intensive ([10]).
Therefore the new scheduler has completely abandoned the notion of different kinds of
processes and treats them all equally. The data-structure of a red-black tree is used to align the
tasks according to their “right” to use the processor resources for a predefined interval until
context-switch. The process positioned at the leftmost node in the data structure is entitled
most to use the processor resources at the time it occupies that position in the tree. The
position of a process in the tree is only dependent on the wait-time of the process in the run-
queue (including the time the process is actually waiting for events) and the process priority
([11]). This concept is fairly simple, but works with all kinds of processes, especially
interactive ones, since they get a boost just by getting account for their I/O-waiting time.
However, the total scheduling complexity is increased to O(log n) where n is the number of
processes in the run-queue, since at every context-switch, the process has to be reinserted into
the red-black tree.
The scheduling algorithm itself has not been designed in special consideration of multi-core
architectures. When Ingo Molnar, the designer of the scheduler, was asked, what the
implications on HT/SMP/NUMA architectures would be, he answered, that there would
inevitably be some effect, and if it is negative, he will fix it. He admits that the fairness-
approach can result in increased cache-coldness for processes in some situations. ([12]).
However, the red-black trees of CFS are managed per runqueue ([13]), which assists in
cooperation with the Linux load-balancer.
2.1.2Scheduling domains
Linux load-balancing takes care of different cache models and computing architectures but at
the moment not necessarily of performance asymmetry. The underlying model of the Linux
load balancer is the concept of scheduling domains, which was introduced in Kernel version
2.6.7 due to the unsatisfying performance of Linux scheduling on SMP and NUMA systems
in prior versions ([14]).
Basically, scheduling domains are hierarchical sets of computation units on which scheduling
is possible; the scheduling domain architecture is constructed based on the actual hardware
resources of a computing element ([9]). Scheduling domains contain lists with scheduling
groups that share common properties.
7
8. Seminar “Parallel Computing“ Summer term 2008
For example, the way scheduling should be done on two logical processors of a HT-systems
and two physical processors of a SMP system is different; the HT “cores” share a common
cache and memory hierarchy, therefore task migration is not a problem, if one of the logical
cores becomes idle. However, in a SMP or multi-core system, in which the cache or parts of
the cache is administrated by each core separately, migrating tasks with a large working set
may become problematic.
This applies even more to NUMA machine, where different CPU may be closer or more
remote to the memory the process is using. Therefore all this architectures have to be treated
differently.
The scheduling domain concept introduces scheduling domains, a logical union of computing
resources that share common properties, with whom it is reasonable to treat them equally and
CPU groups within these domains. Those groups contain hardware-addressable computing
resources that are part of the domain on which the balancer can try to even the domain load
out.
Scheduling domains are hierarchically nested – there is a top-level domain containing all
other domains of the physical system the Linux is running on. Depending on the actual
architecture, the sub-domains represent NUMA node groups, physical CPU groups, multi-
core groups or SMT groups in a respective hierarchical nesting. This structure is built
automatically based on the actual topology of the system and for reasons of efficiency each
CPU keeps a copy of every domain it belongs to. For example, a logical SMT processor that
at the same time is a core in a physical multi-core processor on a NUMA node with multiple
(SMP) processors would totally administer 4 sched_domain structures, one for each level of
parallel computing it is involved in ([15]).
Figure 2: Example hierarchy in the Linux scheduling domains
Load-balancing takes place at scheduling domain level, between the different groups. Each
domain level is sensitive with respect to the constraints set by its properties regarding load
balancing. For example, load balancing happens very often between logical simultaneous
8
9. Seminar “Parallel Computing“ Summer term 2008
multithreading cores, but very rarely on the NUMA level, where remote memory access is
costly.
The scheduling domain for multi-core processors was added with Kernel version 2.6.17 ([16])
and especially considers the shared last level cache that multi-core architectures frequently
possess. Hence, on a SMP machine with two multi-core packages, two tasks will be scheduled
on different packages, if both packages are currently idle, in order to make use of the overall
larger cache.
In recent Linux-kernels, the multi-core processor scheduling domain also offers support for
energy-efficient scheduling, which can be used if e.g. the powersave governor is set in the
cpufreq tool. Saving energy can be achieved by changing the P- and the C-states of the cores
in the different packages. However, P-states are transitions are made by adjusting the voltage
and the frequency of a core and since there is only one voltage regulator per socket on the
mainboard, the P-state is dependent on the busiest core. So, as long as any core in a package is
busy, the P-state will be relatively low, which corresponds to a high frequency and voltage.
While the P-states remain relatively fixed, the C-states can be manipulated. Adjusting the C-
states means turning off parts of the registers, blocking interrupts to the processor, etc. ([17])
and can be done on each core independently. However, the shared cache features its own C-
state regulator and will always stay in the lowest C-state that any of the cores has.
Therefore, energy-efficiency is often limited to adjusting the C-state of a non-busy core while
leaving other C-states and the packages’ P-state low.
Linux scheduling within the multi-core domain with the powersave-governor turned on will
attempt to schedule multiple tasks on one physical package as long as it is feasible. This way,
other multi-core packages will be allowed to transition into higher P- and C-states. The author
of [9] claims, that generally the performance impact will be relatively low and that the
performance loss/power saving trade-off will be rewarding, if the energy-efficient scheduling
approach is used.
2.2 Windows scheduler
In Windows, scheduling is conducted on threads. The scheduler is priority-based with
priorities ranging from 0 to 31. Timeslices are allocated to threads in a round-robin fashion;
these timeslices are assigned to highest priority threads first and only if know thread of a
given priority is ready to run at a certain time, lower priority threads may receive the
timeslice. However, if higher-priority threads become ready to run, the lower priority threads
are preempted.
In addition to the base priority of a thread, Windows dynamically changes the priorities of
low-prioritized threads in order to ensure “felt” responsiveness of the operating system. For
example, the thread associated with the foreground window on the Windows desktop receives
a priority boost. After such a boost, the thread-priority gradually decays back to the base
priority ([21]).
Scheduling on SMP-systems is basically the same, except that Windows keeps the notion of a
thread’s processor affinity and an ideal processor for a thread. The ideal processor is the
processor with for example the highest cache-locality for a certain thread. However, if the
ideal processor is not idle at the time of lookup, the thread may just run on another processor.
In [21] and other sources, no explicit information is given on scheduling mechanisms
especially specific to multi-core architectures, though.
9
10. Seminar “Parallel Computing“ Summer term 2008
2.3Solaris scheduler
In the Solaris scheduler terminology, processes are called kernel- or user-mode threads
dependant on the space in which they run. User threads don’t only exist in the user space –
whenever a user thread is created, a so-called lightweight process is set up that connects the
user thread to a kernel thread. These kernel threads are object to scheduling.
Solaris 10 offers a number of scheduling classes, which may co-exist on one system. The
scheduling classes provide an adaptive model for the specific types of applications which are
intended to run on top of the operating system. ([18]) mentions Solaris 10 scheduling classes
for:
• Standard (timeshare) threads, whose priority may be adapted by the scheduler.
• Interactive applications (the currently active window in the graphical user interface).
• Fair sharing of system resources (instead of priority-based sharing)
• Fixed-priority threads. The priority of threads scheduled with this scheduler does not
vary over the scheduling time.
• System (kernel) threads.
• Real-time threads, with a fixed priority and time share. Threads in this scheduling
class may preempt system threads.
The scheduling classes for timeshare, fixed-priority and fair sharing are not recommended for
simultaneous use on a system, while other combinations of scheduling classes on a set of
CPUs are possible.
The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before
CFS) in their attempt of trying to identify I/O bound processes and providing them with a
priority boost. Threads have a fixed time quantum they may use once they get the context and
receive a new priority based on whether they fully consume their time quantum and on their
waiting time for the context. Fair share scheduling uses a fixed time quantum (share) allocated
to processes ([19]) as a base for scheduling. Different processes (actually collection of
processes, or, in Solaris 10 terminology, projects) compete for quanta on a computing
resource and their position in that competition depends on how large the value they have been
assigned is in relation to the total quanta number on the computing resource.
Solaris explicitly deals with the scenario, that parts of the processor’s resources may be
shared, as it is likely with typical multi-core processors. There is a kernel abstraction called
“processor group” (pg_t), that is built according to the actual system topology and represents
logical CPUs that share some physical properties (like caches or a common socket). These
groupings can be investigated by the dispatcher e.g. in order to maintain logical CPU affinity
for the purpose of cache-hotness where it is reasonable. Quite similar to the concept of
Linux’s scheduling domains, Solaris 10 tries to simultaneously achieve load balancing on
multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20]).
10
11. Seminar “Parallel Computing“ Summer term 2008
3. Ongoing research on the topic
Research on multi-core scheduling deals with a number of different topics, many of which are
orthogonal (e.g. maximizing fairness and throughput). The purpose of this section is to
present an interesting selection of different approaches to multi-core scheduling. Sections 3.1
and 3.2 summarize proposals to improve fairness and load-balancing on current multi-core
architectures while sections 3.3 and 3.4 concentrate on approaches for scheduling on
promising new computing architectures, such as multi-core processors with asymmetric
performance and many-core CPUs.
3.1Cache-Fairness
Several studies (e.g. [22], [23]) suggest that operating system schedulers insufficiently deal
with threads that allocate large parts of the shared level 2 cache and thus slow-up threads
running on the other core that uses the same cache.
The situation is unsatisfactory due to several reasons: First, it can lead to unpredictable
execution times and throughput and second, scheduling priorities may loose their
effectiveness because of threads running on cores with aggressive “co-runners” (i.e. threads
running on another core in the same package).
Figure 3 shows such a scenario: Thread B uses the larger part of the shared cache and thus
maybe negatively influences the cycles per instruction that thread A achieves during its CPU
time share.
L2-cache-misses are more costly than L1-cache-misses, because the latency to the memory is
bigger than to the next cache-level. However, it is mostly the L2-cache that is shared among
different cores.
The authors of [22] try to mitigate the above mentioned effects by introducing a cache-
fairness aware scheduler for the Solaris 10 operating system.
Figure 3: Unfair cache utilization by thread B
In their scheduling algorithm, the threads on a system are grouped into a best effort class and
a cache-fair class. Best effort threads are penalized for the sake of performance stability of
cache-fair threads, if necessary, but not vide-versa. However, it is taken care, that this does
not result in inadequate discrimination of best effort threads.
Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from
cache-intensive co-runners at the expense of these co-runners, if they are best effort threads.
Figure 4 illustrates that process.
11
12. Seminar “Parallel Computing“ Summer term 2008
Figure 4: Restoring fairness by adjusting timeshares
In order to compute the quantum that the thread is entitled to, the algorithm uses an analytical
model to estimate a few reference values that would hold, if the thread had run under fair
circumstances, namely the fair L2 cache miss rate, the fair CPI rate and the fair number of
instructions (during a quantum). All those values are based on the assumption of fair
circumstances, so the difference to the actual values is computed and a quantum extension is
calculated which should serve as “compensation”.
Those calculations are done once for new cache-fair class threads – their L2 cache miss rates
and other data is measured with different co-runners. Subsequently, the dispatcher
periodically selects best-effort threads from whom he takes parts of their quanta and assigns
them as “compensation” to the cache-fair threads. New cache-fair threads are not explicitly
scheduled with different co-runners in the analysis phase, but whenever new combinations of
cache-fair threads with co-runners occur, analytical input is mined from the CPU’s hardware
counters.
The authors of [22] state that according to their measurements, the penalties on best-effort
threads are low, while the algorithm actually enforces priority better than standard scheduling
and improves fairness. These claims are supported by experimental data gathered using the
SPEC CPU2000 suite on an UltraSPARC T1 processor. The experiments measure the time it
takes a thread to complete a specific benchmark while a cache-intensive second thread is
executed in the same package. The execution times under these circumstances are compared
with the times of threads running with co-runners with low-cache requirements. This
comparison shows differences of up to 37% in execution time between the two scenarios on a
system with a standard scheduler, while using the cache-fair scheduler, the variability
decreases to 7%.
At the same time, however, measurements of the execution times of threads in the best-effort
scheduling class reveal a slowdown of up to 8% for some threads (while some even
experience a speed-up).
3.2Balancing core assignment
Fedorova et al. ([27]) argue, that during operating system scheduling, several aspects have to
be considered besides optimal performance. It is shown that scheduling tasks on the cores in
an imbalanced way results in jittery performance (i.e. unpredictability of a tasks completion
time) and, as a consequence, insufficient priority enforcement.
It could be part of the operating system scheduler to ensure that the jobs are evenly assigned
to cores. In the approach described in [27] this is done by using a self-tuning scheduling
algorithm based on a per-core benefit function. This benefit function is based on three input
components:
12
13. Seminar “Parallel Computing“ Summer term 2008
1) The normalized core preference of a thread, which is based on the instructions per
cycle that a thread j can achieve on a certain core i ( IPC j ,i ), normalized by max(
IPC j ,k ) (where k is an arbitrary CPU/core)
2) The cache-affinity, a value which is 1 if the thread j was scheduled on core i within a
tuneable time period and 0 otherwise
3) The average cache investment of a thread on a core which is determined by inspecting
the hardware cache miss counter from time to time
This benefit function can then be used to determine whether it would be feasible to migrate
threads from core i to core j. For each core, a benefit function Bi , 0 is computed that represents
the case of no thread migrations taking place. For each thread k on a core, and updated benefit
value for the hypothetical scenario of the migration of the thread onto another core Bi ,k − , is
computed. Of course, the benefit will increase, since fewer threads are executed on the core.
But the thread that would have been taken away in the hypothetical scenario would have to be
migrated to another core, which would influence the benefit value of the target core.
Therefore, also the updated benefit value of any other system core j to which the thread in
question would be migrated to, has to be computed and is called B j ,k + .
The hypothetical migration of thread k from core i to core j becomes reality if
Bi ,k − + B j ,k + > Bi , 0 + B j ,0 + a * DCAB + b * DRTF . DCAB represents a system-wide balance-
constraint, while DRTF ensures a per-job response-time-fairness (i.e. the slowdown that results
for the thread in question from the thread-migration does not exceed some maximum value).
These two constants, together with the criterions included in the benefit function itself (most
notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of
optimal performance, core-assignment balance and response-time-fairness.
However, the authors have not yet actually implemented the scheduling modification in Linux
and hence the statements on its effectiveness remain somehow theoretical.
3.3Performance asymmetry
It has been advocated that building multi-core chips with asymmetric performance of the
different cores can have advantages for the overall processing speed of a CPU. For example, it
can prove beneficial if one fast core can be used to speed up parts that can hardly be
parallelized while multiple slower cores come to play when parallel code parts are executed.
By keeping the cores for parallel execution slower than the core(s) for serial execution, die-
area and cost can be saved ([24]) while power consumption may be reduced.
[25] closely analyzes the impact of performance asymmetry on the average speedup of an
application with increase of cores. The paper concludes, that “[a]symmetric multicore chips
can offer maximum speedups that are much greater than symmetric multicore chips (and
never worse)”. Hence, performance asymmetry at the processor core level seems to be a
promising approach for future multi-core architectures.
[26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately
parallelized applications, but don’t scale much worse with highly parallel programs. (see
Figure 5)
13
14. Seminar “Parallel Computing“ Summer term 2008
Figure 5: Comparison of speedup with SMP and AMP using highly parallel programs (left), moderately
parallel programs (middle) and highly sequential programs (right)1
Apart from explicit design, performance asymmetry can also occur in initially symmetric
multi-core processors by either power-saving mechanisms (increasing the C-state) or failure
of transistors that leads to disabling of parts of the core’s components.
Problems arise by the fact that the programmer usually assumes symmetric performance of
the processor cores and designs her programs accordingly. Therefore, the operating system
scheduler should support processor performance asymmetry, which is currently not the case
for most schedulers. However, it would be imaginable to see this as a Linux scheduling
domain in the future.
[5] describes AMPS, an approach on how a performance-asymmetry aware scheduler for
Linux could look like. Basically the scheduler consists of three components: An asymmetry-
specific load balancing, a faster-core first scheduler and a migration mechanism specifically
for NUMA machines that will not be covered in detail here. The scheduling mechanism tries
to achieve a better performance, fairness (with respect to the thread priorities) and
repeatability of performance results.
In order to conduct load-balancing, the core performance is assessed in a first step. AMPS
measures core performance at boot time, using benchmarks, and sets the performance
quantifier for the slowest core to 1 and for faster cores, to a number higher than 1. Then the
scaled load of a core is the run-queue length of a core divided by its performance quantifier.
If this scaled load is within a maximum and a minimum threshold, then the system is
considered to be load-balanced, otherwise threads will be migrated. By the scaling factor,
faster cores will receive more workload than slower ones.
Besides the load-balancing, cores with lower scaled load are preferred in thread placement.
Whenever new threads are scheduled, the new scaled load that would apply if the thread was
scheduled to one specific core is computed. Then the thread is scheduled to the core with the
least new scaled load; in case of a tie, faster cores are preferred. Additionally, the load-
balancer may migrate threads even if the source core of migration can become idle by doing
it. AMPS only migrates threads, if the new scaled load of the destination core does not exceed
1
Image source: http://www.intel.com/technology/magazine/research/power-efficiency-0206.htm
14
15. Seminar “Parallel Computing“ Summer term 2008
the scaled load on the source-core. This way, the benefits of available fast cores can be fully
exploited while not overburdening them.
It can be expected, that frequent core migration results in performance loss by cache misses
(e.g. in the level cache). However, experimental results in [5] reveal no excessive
performance loss by the fact that task migration among cores occurs more often than in
standard schedulers.
Figure 6: Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six
slow cores2
Instead, performance on SMP systems measurably improves (see Figure 6, standard scheduler
speed would be 1, median speedup is 1.16); while fairness and repeatability is preserved
better than with standard Linux (there is less deviation in the execution times of different
processes).
3.4Scheduling on many-core architectures
Current research points at the potential of future CPU architectures that consist of a multitude
of different cores (tens to hundreds) in order to prolong the time the industry can keep up with
Moore’s prediction. Such CPU architectures are going to require novel approaches in thread
scheduling, since the available shared memory per core is potentially smaller, while main
memory access time increases and single-threaded performance decreases. Even more so than
with current CMP chips, schedulers that treat such large scale CMP architectures just like
SMP systems, are going to fail with respect to performance.
[28] identifies three major challenges in scheduling on many-core architectures, namely the
comparatively small cache sizes which render memory access costly, the fact that non-
specialized programmers will have to program code for them and the wide range of
application scenarios that have to be considered for such chips. The latter two challenges
result from the projected general-purpose use of future many-core architectures.
In order to deal with these challenges, an OS-independent experimental runtime-environment
called McRT is presented in [28]. The runtime environment was built from scratch and is
2
Picture taken from [5]
15
16. Seminar “Parallel Computing“ Summer term 2008
independent of operating system kernels – hence the performed operations occur at user-level.
The connection to the underlying operating system is established using so called host
adapters while programming environments like pthreads or OpenMP can invoke the
scheduling environment via client adaptors. For programmability it provides high level
transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice
among different runtime policies (which influence the scheduling behaviour).
The overall target of the scheduler is to maximize resource utilization and provide the user
with flexible scheduler configurability. Basically the scheduler is configured using three
parameters P, Q and T which respectively denote the number of (logical) processors, task
queues and threading abstractions. P and Q change the scheduling behaviour from strict
affinity to work stealing. T can be used to specify different behaviour for different threads.
This way, the concept of scheduler domains can be realized.
It is notable, that the scheduling system does not use pre-emption; instead threads are
expected to yield at certain times. This design choice has been motivated by the authors’
belief that pre-emption stems from the need of time-sharing expensive and fast resources
which will become obsolete with many-core architectures. The runtime actively supports
constructs such as barriers, which are often needed in parallel programming. Barriers are
designed to avoid busy waiting – for example, a thread yields once it has reached the barrier
but won’t be re-scheduled until all other threads have reached the barrier.
With a pre-emptive scheduling mechanism, the thread would receive the context from time to
time just to check whether other threads have reached the barrier – with the integrated barrier
support based on a co-operative scheduling approach used in McRT, this won’t happen.
The client-side adaptor, e.g. for OpenMP, promises to directly translate many OpenMP
constructs to the McRT API.
[28] also contains a large section on experimental results from benchmarks of typical desktop
and scientific applications, such as the XviD MPEG4 encoder, singular value decomposition
(SVD) and neural networks (SOM). The results were gathered on a cycle-accurate many-core
simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which
form one physical core. 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip
L3 cache. The simulator provides a cost-free MWait instruction that allows a thread to tell the
processor that it is blocked and the resource it is waiting for. Only if the resource becomes
available the CPU will execute the thread. Hence, threads that are waiting barriers and locks
don’t consume system resources. It is important to consider that such a mechanism does not
exist on current physical processors, when viewing the experimental speedup results for
McRT
16
17. Seminar “Parallel Computing“ Summer term 2008
Figure 7: Speedup of XviD encoding in the McRT framework
(compared to single core performance)3
The experiments reveal that XviD encoding scales very well on McRT (Figure 7, 1080p and
768p denote different video resolutions; the curve labelled “linear” models the ideal speedup).
However, the encoding process was explicitly tweaked for the many-core scenario – under the
condition, that only very little fast memory exists per logical core, parallelization wasn’t
conducted at frame level – instead single frames were partitioned into parts of frames which
were encoded in parallel using OpenMP.
The scalability of SVD and SOM is quite similar to that of XviD; more figures and
evaluations of different scheduling strategies can be found in [28].
3
Picture is from [28]
17
18. Seminar “Parallel Computing“ Summer term 2008
4. Conclusion
The probability is high, that processor architectures will undergo extensive changes in order
to keep up with Moore’s law in the future. AMPs and many-core CPUs are just two proposals
for innovative new architectures that may help in prolonging the time horizon within which
Moore’s law can stay valid. Operating system schedulers are going to have to adapt to the
changing underlying architectures.
The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility –
because computing architectures that exhibit different behaviours with regard to memory
access or single-threaded performance can be quite easily integrated into the load-balancing
hierarchy; however, it must be argued that in the future probably more will have to be done in
the scheduling algorithms themselves. But scheduler domains at least provide the required
flexibility at the level of threads.
Sadly, it has to be said, that current Windows schedulers won’t scale with the number of cores
or performance asymmetry at any rate. Basically, the Windows scheduler treats multi-core
architecture like SMP systems and hence can not give proper care to the peculiarities of such
CPUs, like the shared L2 cache (and possibly, the varying or simply bad single-threaded
performance that is going to be a characteristic of emerging future architectures). Ty Carlson,
director of technical strategy at Microsoft, even mentioned at a panel discussion that current
Windows releases (including Vista) were “designed to run on 1,2, maybe 4 processors”4 but
wouldn’t scale beyond. He seems to be perfectly right, when he says that future versions of
Windows would have to be fundamentally redesigned.
Current research shows the road such a redesign could follow. The approach described in
section 3.3 seems to perform pretty well for multi-core processors with asymmetric
performance. The advantage is that the load balancing algorithm can be (and was)
implemented as a modification to current operating system kernels and hence, can be made
available quickly, once asymmetric architectures gain widespread adaption. Experimental
evaluation of the scheduling algorithm reveals promising results, also with regard to fairness.
Section 3.4 presents the way, in which futuristic schedulers on upcoming many-core
architectures could operate. The runtime environment McRT makes use of interesting
techniques and the authors of the paper manage to intelligibly explain why pre-emptive
scheduling is going to be obsolete on many-core architectures. However, their implementation
is realized in user-space and burdens the programmer/user with a multitude of configuration
options and programming decisions that are required in order for the framework to guarantee
optimal performance.
[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT,
experimental assessments which could testify on its performance, although planned, haven’t
been conducted yet. It will be interesting to keep an eye on the further development of
scheduling approaches for many-core architectures, since they might gain fundamentally in
importance in the future.
Achieving fairness and repeatability on today’s available multi-core architectures are the
major design goals of the scheduling techniques detailed in sections 3.1 and 3.2. The first
approach is justified by a number of experimental results that show that priorities are actually
enforced much better than with conventional schedulers; however it remains to be seen,
4
See: http://www.news.com/8301-10784_3-9722524-7.html
18
19. Seminar “Parallel Computing“ Summer term 2008
whether the complexity of the scheduling approach and the amount of overhead potentially
introduced by it, justify that improvement. Maybe it would be advantageous to consider
implementing such mechanisms already at the hardware level, if possible.
The algorithm mentioned in section 3.2 hasn’t been implemented yet, so it is an open question
whether such a rather complicated load-balancing algorithm would be feasible in practice.
From the description one can figure, that it takes a lot of computations and thread migrations
in order to ensure the load-balance and it would be interesting to see the overhead from the
computations and the cache-misses imposed by the mechanism on the system. Without any
experimental data, those figures are hard to assess.
19
20. Seminar “Parallel Computing“ Summer term 2008
5. References
[1] G. E. Moore: „Cramming more components onto integrated circuits“, Electronics,
Volume 38, Number 8, 1965.
[2] „Why Parallel Processing“:
http://www.tc.cornell.edu/Services/Education/Topics/Parallel/Concepts/
2.+Why+Parallel+Processing.htm
[3] O. Wechsler: „Inside Intel® Core™ Microarchitecture“, Intel Technology Whitepaper,
http://download.intel.com/technology/architecture/new_architecture_06.pdf
[4] „Key Architectural Features AMD Athlon™ X2 Dual-Core Processors“,
http://www.amd.com/us-
en/Processors/ProductInformation/0,,30_118_9485_13041%5E13043,00.html
[5] Li et al.: „Efficient Operating System Scheduling for Performance-Asymmetric Multi-
Core Architectures“, In: International conference on high performance computing,
networking, storage, and analysis, 2007.
[6] Balakrishnan et al.: „The Impact of Performance Asymmetry in Emerging Multicore
Architectures”, In Proceedings of the 32nd Annual International Symposium on Com-
puter Architecture, pages 506–517, June 2005.
[7] M. Annavaram, E. Grochowski, and J. Shen: “Mitigating Amdahl’s law through EPI
throttling”. In Proceedings of the 32nd Annual International Symposium on Computer
Architecture, pages 298–309, June 2005.
[8] V. Pallipadi, S.B. Siddha: “Processor Power Management features and Process
Scheduler: Do we need to tie them together?” In: LinuxConf Europe 2007
[9] S.B. Siddha: “Multi-core and Linux Kernel”, http://oss.intel.com/pdf/mclinux.pdf
[10] http://kernelnewbies.org/Linux_2_6_23
[11] http://lwn.net/Articles/230574/
[12] J. Andrews: “Linux: The Completely Fair Scheduler“, http://kerneltrap.org/node/8059
[13] A. Kumar: “Multiprocessing with the Completely Fair Scheduler”,
http://www.ibm.com/developerworks/linux/library/l-cfs/index.html
[14] Kernel 2.6.7 Changelog:
http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.7
[15] Scheduling domains:
http://lwn.net/Articles/80911
[16] Kernel 2.6.17 Changelog:
http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.17
20
21. Seminar “Parallel Computing“ Summer term 2008
[17] T. Kidd: “C-states, C-states and even more C-states”,
http://softwareblogs.intel.com/2008/03/27/update-c-states-c-states-and-even-more-c-states/
[18] Solaris 10 Process Scheduling:
http://www.princeton.edu/~unix/Solaris/troubleshoot/schedule.html
[19] Solaris manpage FSS(7):
http://docs.sun.com/app/docs/doc/816-5177/fss-7?l=de&a=view&q=FSS
[20] Eric Saxe: “CMT and Solaris Performance”,
http://blogs.sun.com/esaxe/entry/cmt_performance_enhancements_in_solaris
[21] MSDN section on Windows scheduling:
http://msdn.microsoft.com/en-us/library/ms685096%28VS.85%29.aspx
[22] A. Fedorova, M. Seltzer and M. D. Smith: “Cache-Fair Thread Scheduling for Multicore
Processors”, Technical Report TR-17-06, Harvard University, Oct. 2006
[23] S. Kim, D. Chandra and Y. Solihin: “Fair Cache Sharing and Partitioning in a Chip
Multiprocessor Architecture”, In Proceedings of the International Conference on
Parallel Architectures and Compilation Techniques, 2004
[24] D. Menasce and V. Almeida: “Cost-Performance Analysis of Heterogeneity in
Supercomputer Architectures”, In: Proceedings of the 4th International Conference on
Supercomputing, June 1990.
[25] M. D. Hill and M. R. Marty: “Amdahl’s Law in the Multicore Era”, In: IEEE
Computer, 2008
[26] B. Crepps: “Improving Multi-Core Architecture Power Efficiency through EPI
Throttling and Asymmetric Multiprocessing”, Intel Technology Magazine,
http://www.intel.com/technology/magazine/research/power-efficiency-0206.htm
[27] A. Fedorova, D. Vengerov and D. Doucette: “Operating System Scheduling on
Heterogeneous Core Systems”, to appear in Proceedings of the First Workshop on
Operating System Support for Heterogeneous Multicore Architectures, 2007.
[28] B. Saha et al.: “Enabling Scalability and Performance in a Large Scale CMP
Environment”, Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference
on Computer Systems, 2007.
[29] M. Rajagopalan, B. T. Lewis and T. A. Anderson: “Thread Scheduling for Multi-Core
Platforms”, in: Proceedings of the Eleventh Workshop on Hot Topics in Operating
Systems, 2007.
21