This document provides an overview of techniques for architecting and managing asymmetric multicore processors (AMPs). It begins with definitions and terminology related to AMPs. It then discusses the motivation for AMPs and different classifications. It covers challenges in AMP design including thread scheduling and mapping. It also describes benefits and examples of static and reconfigurable AMPs. Finally, it discusses techniques for managing AMPs such as app/thread mapping strategies, use of DVFS, and approaches for reconfigurable AMPs like changing core count, trading resources between cores, and morphing core architecture.
This document summarizes a seminar on parallel computing. It defines parallel computing as performing multiple calculations simultaneously rather than consecutively. A parallel computer is described as a large collection of processing elements that can communicate and cooperate to solve problems fast. The document then discusses parallel architectures like shared memory, distributed memory, and shared distributed memory. It compares parallel computing to distributed computing and cluster computing. Finally, it discusses challenges in parallel computing like power constraints and programmability and provides examples of parallel applications like GPU processing and remote sensing.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Chip Multithreading Systems Need a New Operating System Scheduler Sarwan ali
This document discusses the need for a new operating system scheduler for chip multithreading (CMT) systems. CMT combines chip multiprocessing and hardware multithreading to improve processor utilization. The current schedulers do not scale well to the large number of hardware threads in CMT systems. A new scheduler is proposed that would model resource contention and use this to minimize contention and maximize throughput when assigning threads to processors. Experiments show that resource contention, especially in the processor pipeline, has a significant impact on performance and a CMT-aware scheduler could improve performance by up to 2x.
Multithreading allows exploiting thread-level parallelism (TLP) to improve processor utilization. There are several categories of multithreading:
- Superscalar simultaneous multithreading interleaves instructions from multiple threads within a single out-of-order processor core to reduce idle resources.
- Coarse-grained multithreading switches between threads on long-latency events like cache misses to hide latency.
- Fine-grained multithreading interleaves threads at a finer instruction granularity in in-order cores.
- Multiprocessing physically separates threads onto multiple processor cores.
This document discusses CPU scheduling and multithreaded programming. It covers key concepts in CPU scheduling like multiprogramming, CPU-I/O burst cycles, and scheduling criteria. It also discusses dispatcher role, multilevel queue scheduling, and multiple processor scheduling challenges. For multithreaded programming, it defines threads and their benefits. It compares concurrency and parallelism and discusses multithreading models, thread libraries, and threading issues.
Parallel computing involves using multiple processing units simultaneously to solve computational problems. It can save time by solving large problems or providing concurrency. The basic design involves memory storing program instructions and data, and a CPU fetching instructions from memory and sequentially performing them. Flynn's taxonomy classifies computer systems based on their instruction and data streams as SISD, SIMD, MISD, or MIMD. Parallel architectures can also be classified based on their memory arrangement as shared memory or distributed memory systems.
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
Your AMI is one of the core foundations for running applications and services effectively on Amazon EC2. In this session, you'll learn how to optimize your AMI, including how you can measure and diagnose system performance and tune parameters for improved CPU and network performance. We'll cover application-specific examples from Netflix on how optimized AMIs can lead to improved performance.
The document discusses parallel computing platforms and techniques for hiding memory latency. It covers the following key points:
1) Implicit parallelism in microprocessors has increased through pipelining and superscalar execution, but memory latency remains a bottleneck. Caches help reduce effective latency by exploiting data locality.
2) Multithreading and prefetching are approaches to hide memory latency by keeping the processor occupied while waiting for data, but they increase bandwidth demands and hardware costs.
3) Different applications utilize different types of parallelism, like data-level parallelism for throughput or task-level parallelism for aggregate performance. Understanding performance bottlenecks is important for parallelization.
This document summarizes a seminar on parallel computing. It defines parallel computing as performing multiple calculations simultaneously rather than consecutively. A parallel computer is described as a large collection of processing elements that can communicate and cooperate to solve problems fast. The document then discusses parallel architectures like shared memory, distributed memory, and shared distributed memory. It compares parallel computing to distributed computing and cluster computing. Finally, it discusses challenges in parallel computing like power constraints and programmability and provides examples of parallel applications like GPU processing and remote sensing.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Chip Multithreading Systems Need a New Operating System Scheduler Sarwan ali
This document discusses the need for a new operating system scheduler for chip multithreading (CMT) systems. CMT combines chip multiprocessing and hardware multithreading to improve processor utilization. The current schedulers do not scale well to the large number of hardware threads in CMT systems. A new scheduler is proposed that would model resource contention and use this to minimize contention and maximize throughput when assigning threads to processors. Experiments show that resource contention, especially in the processor pipeline, has a significant impact on performance and a CMT-aware scheduler could improve performance by up to 2x.
Multithreading allows exploiting thread-level parallelism (TLP) to improve processor utilization. There are several categories of multithreading:
- Superscalar simultaneous multithreading interleaves instructions from multiple threads within a single out-of-order processor core to reduce idle resources.
- Coarse-grained multithreading switches between threads on long-latency events like cache misses to hide latency.
- Fine-grained multithreading interleaves threads at a finer instruction granularity in in-order cores.
- Multiprocessing physically separates threads onto multiple processor cores.
This document discusses CPU scheduling and multithreaded programming. It covers key concepts in CPU scheduling like multiprogramming, CPU-I/O burst cycles, and scheduling criteria. It also discusses dispatcher role, multilevel queue scheduling, and multiple processor scheduling challenges. For multithreaded programming, it defines threads and their benefits. It compares concurrency and parallelism and discusses multithreading models, thread libraries, and threading issues.
Parallel computing involves using multiple processing units simultaneously to solve computational problems. It can save time by solving large problems or providing concurrency. The basic design involves memory storing program instructions and data, and a CPU fetching instructions from memory and sequentially performing them. Flynn's taxonomy classifies computer systems based on their instruction and data streams as SISD, SIMD, MISD, or MIMD. Parallel architectures can also be classified based on their memory arrangement as shared memory or distributed memory systems.
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
Your AMI is one of the core foundations for running applications and services effectively on Amazon EC2. In this session, you'll learn how to optimize your AMI, including how you can measure and diagnose system performance and tune parameters for improved CPU and network performance. We'll cover application-specific examples from Netflix on how optimized AMIs can lead to improved performance.
The document discusses parallel computing platforms and techniques for hiding memory latency. It covers the following key points:
1) Implicit parallelism in microprocessors has increased through pipelining and superscalar execution, but memory latency remains a bottleneck. Caches help reduce effective latency by exploiting data locality.
2) Multithreading and prefetching are approaches to hide memory latency by keeping the processor occupied while waiting for data, but they increase bandwidth demands and hardware costs.
3) Different applications utilize different types of parallelism, like data-level parallelism for throughput or task-level parallelism for aggregate performance. Understanding performance bottlenecks is important for parallelization.
The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
This document discusses task scheduling on adaptive multi-core architectures. It begins by introducing instruction-level parallelism (ILP) and thread-level parallelism (TLP). It then discusses different multi-core architectures like symmetric, asymmetric, and adaptive multi-cores which can seamlessly exploit both ILP and TLP. It presents the Bahurupi adaptive multi-core architecture and describes an optimal task scheduling algorithm to minimize makespan. Experimental results show that adaptive architectures outperform static symmetric and asymmetric architectures for mixed workloads.
This document provides an introduction to high performance computer architecture and multiprocessors. It discusses how initial improvements in computer performance came from innovative manufacturing techniques and exploitation of instruction level parallelism (ILP). More recently, exploiting thread and process level parallelism across multiple processors has become a focus. The key types of multiprocessor architectures discussed are symmetric multiprocessors (SMPs) and distributed memory computers which use message passing. SMPs connect multiple processors to a shared memory using a bus, while distributed memory computers require explicit message passing between separate processor memories.
This document provides an overview of system architecture and processor architectures. It discusses different types of system architecture like system-level building blocks, components of a system, hardware and software implementation, and instruction-level parallelism. It also describes various processor architectures like sequential, pipelined, superscalar, VLIW, SIMD, array, and vector processors. Additionally, it covers memory and addressing in systems-on-chip including memory considerations, virtual memory, and the process of determining physical memory addresses.
A multi-core processor contains two or more independent processing units called cores that can execute program instructions simultaneously. This allows multi-core processors to better perform multiple tasks at once, improve performance, reduce power consumption, and increase reliability compared to single-core processors. Each core on a multi-core processor can perform separate tasks, such as one core handling a movie while another handles a messaging app. The cores communicate through a shared pathway and the operating system distributes processes across the cores.
fundamentals of digital communication Unit 5_microprocessor.pdfshubhangisonawane6
The document discusses the evolution of microprocessors from single-core to multi-core architectures. It describes how multi-core processors have multiple processing cores on a single chip to improve performance and efficiency. Each core can independently execute threads simultaneously for parallel processing. The document outlines the key components involved in the instruction cycle of a microprocessor, including registers like the program counter and memory address registers. It also discusses how multicore processors benefit applications that can distribute processing across multiple threads.
This document describes SmartBalance, a sensing-driven Linux load balancer for heterogeneous multi-processor systems-on-chips (MPSoCs) that aims to improve energy efficiency. It does this through a predictive approach that balances tasks among cores based on ongoing performance and power measurements, with the goal of jointly addressing workload variability and hardware heterogeneity. The key aspects are on-chip sensing to monitor performance and power, online prediction of these metrics, and a simulated annealing-based allocation algorithm to optimize task distribution across cores in each scheduling epoch.
The document discusses various topics related to parallel and distributed computing including parallel computing resources and concepts, Flynn's taxonomy of parallel systems, parallel computer memory architectures like shared memory and distributed memory, parallel programming models such as shared memory, message passing and data parallel models, designing parallel programs including partitioning and load balancing, and different parallel computer architectures like vector processors, very long instruction word architecture, and superpipelined architecture.
This presentation discusses array processors, which are parallel computers composed of multiple identical processing elements that can operate simultaneously. The presentation covers the history of array processors, how they work, classifications, architectures, performance and scalability. It explains that array processors are well-suited for tasks involving repetitive arithmetic operations on large datasets, as they can improve performance for such workloads, but may not provide benefits for operations with data dependencies or decisions based on computations.
This document discusses different types of parallel processor architectures:
- SISD, SIMD, MISD, and MIMD refer to single instruction single data, single instruction multiple data, multiple instruction single data, and multiple instruction multiple data respectively.
- Symmetric multiprocessors (SMPs) have multiple similar processors that share memory and I/O. Clusters have groups of interconnected whole computers working together. NUMA systems have processors that access different regions of shared memory at different speeds.
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...Arun Joseph
1) The document describes empirically derived power models for uncore elements like the Power Bus and memory controllers of IBM's POWER8 server processor.
2) Using a small set of activity markers like read, write, retry and snoop events along with microbenchmarks, the models can predict uncore power with up to 6% error.
3) These abstract power models allow more accurate dynamic power management by the chip compared to using a constant worst-case uncore power, potentially enabling a 5% CPU frequency boost.
This document discusses optimizing Linux AMIs for performance at Netflix. It begins by providing background on Netflix and explaining why tuning the AMI is important given Netflix runs tens of thousands of instances globally with varying workloads. It then outlines some of the key tools and techniques used to bake performance optimizations into the base AMI, including kernel tuning to improve efficiency and identify ideal instance types. Specific examples of CFS scheduler, page cache, block layer, memory allocation, and network stack tuning are also covered. The document concludes by discussing future tuning plans and an appendix on profiling tools like perf and SystemTap.
This document discusses advance computer architectures including multi-core computers, multithreading, and GPUs. It provides information on multi-core systems and how they integrate multiple processor cores on a single chip to provide cheap parallel computing. It also discusses limitations of single core architectures and how multithreading enables parallelism through dividing instruction streams into threads. Finally, it covers GPUs and how they are optimized for parallel processing of graphics applications using thousands of simpler cores compared to CPUs.
This document discusses advance computer architectures including multi-core computers, multithreading, and GPUs. It provides information on multi-core systems having multiple processor cores on a single chip that share memory. It discusses how multi-core processors address limitations of single core designs by providing cheaper parallelism while increasing computation power. The document also covers multithreading, different approaches, and how programming must support multi-core through multiple threads or processes. Finally, it introduces GPUs, how they are optimized for graphics applications through parallelism and throughput, and how CUDA enables general purpose programming on GPUs.
Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth
Fast switching of threads between cores is a published research paper on Operating systems, This is our attempt to decode the research and present to the class
1. The document discusses research activities related to reducing energy consumption by at least 30% through the development of core source technologies for universal operating systems.
2. It describes four papers being presented, including ones on system and device latency modeling, power management frameworks for embedded systems, and automatic selection of power policies for operating systems.
3. It also summarizes four research topics from the National University, including performance evaluation of parallel applications using a power-aware paging method on next-generation memory architectures.
This document provides an introduction to multi-core processors. It discusses that a multi-core processor contains two or more processors on a single integrated circuit. This leads to enhanced performance, reduced power consumption, and more efficient simultaneous processing of multiple tasks. However, developing multithreaded applications for multi-core processors can be difficult, time-consuming, and error-prone. Adding more cores also introduces additional overheads and latencies between communicating and non-communicating cores. There are different types of multi-core architectures including symmetric multiprocessing (SMP) and asymmetric multiprocessing (AMP). Effective use of multi-core processors requires considerations around cache coherency, load balancing, interrupt handling, and concurrency management.
The document provides an overview of microprocessors and microcontrollers. It discusses the basic architecture of microprocessors, including the Von Neumann and Harvard architectures. It compares RISC and CISC instruction sets. Microcontrollers are defined as single-chip computers containing a CPU, memory, and I/O ports. Common PIC microcontrollers are described along with their characteristics such as speed, memory types, and analog/digital capabilities. The document also outlines best practices for selecting a suitable microcontroller for a project, including identifying hardware interfaces, memory needs, programming tools, and cost/power constraints.
The document discusses advancements in computer architecture, including multi-core computers, multithreading, and GPUs. It describes how multi-core processors integrate multiple processor cores on a single chip to provide cheap parallel processing and increase computation power. It also discusses how GPUs are optimized for graphics applications through massively parallel and highly multithreaded designs. Programming models like CUDA allow GPUs to be used for general purpose computing by addressing thread, data, and task parallelism. Overall, the document outlines how multi-core and GPU technologies enable computers to better utilize parallelism for improved performance.
The document discusses advancements in computer architecture, including multi-core computers, multithreading, and GPUs. It describes how multi-core processors integrate multiple processor cores on a single chip to provide cheap parallel processing and increase computation power. It also discusses how multithreading exploits thread-level parallelism and how GPUs are optimized for parallel graphics applications through thousands of simple processor cores focused on throughput over latency. The document provides examples of Intel's multi-core chips and the Polaris chip with 80 cores, and explains how applications can benefit from multi-core and multi-threaded programming.
Rainfall intensity duration frequency curve statistical analysis and modeling...bijceesjournal
Using data from 41 years in Patna’ India’ the study’s goal is to analyze the trends of how often it rains on a weekly, seasonal, and annual basis (1981−2020). First, utilizing the intensity-duration-frequency (IDF) curve and the relationship by statistically analyzing rainfall’ the historical rainfall data set for Patna’ India’ during a 41 year period (1981−2020), was evaluated for its quality. Changes in the hydrologic cycle as a result of increased greenhouse gas emissions are expected to induce variations in the intensity, length, and frequency of precipitation events. One strategy to lessen vulnerability is to quantify probable changes and adapt to them. Techniques such as log-normal, normal, and Gumbel are used (EV-I). Distributions were created with durations of 1, 2, 3, 6, and 24 h and return times of 2, 5, 10, 25, and 100 years. There were also mathematical correlations discovered between rainfall and recurrence interval.
Findings: Based on findings, the Gumbel approach produced the highest intensity values, whereas the other approaches produced values that were close to each other. The data indicates that 461.9 mm of rain fell during the monsoon season’s 301st week. However, it was found that the 29th week had the greatest average rainfall, 92.6 mm. With 952.6 mm on average, the monsoon season saw the highest rainfall. Calculations revealed that the yearly rainfall averaged 1171.1 mm. Using Weibull’s method, the study was subsequently expanded to examine rainfall distribution at different recurrence intervals of 2, 5, 10, and 25 years. Rainfall and recurrence interval mathematical correlations were also developed. Further regression analysis revealed that short wave irrigation, wind direction, wind speed, pressure, relative humidity, and temperature all had a substantial influence on rainfall.
Originality and value: The results of the rainfall IDF curves can provide useful information to policymakers in making appropriate decisions in managing and minimizing floods in the study area.
More Related Content
Similar to PPT_for_big_LITTLE_style_Asymmetric_Mult.pptx
The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
This document discusses task scheduling on adaptive multi-core architectures. It begins by introducing instruction-level parallelism (ILP) and thread-level parallelism (TLP). It then discusses different multi-core architectures like symmetric, asymmetric, and adaptive multi-cores which can seamlessly exploit both ILP and TLP. It presents the Bahurupi adaptive multi-core architecture and describes an optimal task scheduling algorithm to minimize makespan. Experimental results show that adaptive architectures outperform static symmetric and asymmetric architectures for mixed workloads.
This document provides an introduction to high performance computer architecture and multiprocessors. It discusses how initial improvements in computer performance came from innovative manufacturing techniques and exploitation of instruction level parallelism (ILP). More recently, exploiting thread and process level parallelism across multiple processors has become a focus. The key types of multiprocessor architectures discussed are symmetric multiprocessors (SMPs) and distributed memory computers which use message passing. SMPs connect multiple processors to a shared memory using a bus, while distributed memory computers require explicit message passing between separate processor memories.
This document provides an overview of system architecture and processor architectures. It discusses different types of system architecture like system-level building blocks, components of a system, hardware and software implementation, and instruction-level parallelism. It also describes various processor architectures like sequential, pipelined, superscalar, VLIW, SIMD, array, and vector processors. Additionally, it covers memory and addressing in systems-on-chip including memory considerations, virtual memory, and the process of determining physical memory addresses.
A multi-core processor contains two or more independent processing units called cores that can execute program instructions simultaneously. This allows multi-core processors to better perform multiple tasks at once, improve performance, reduce power consumption, and increase reliability compared to single-core processors. Each core on a multi-core processor can perform separate tasks, such as one core handling a movie while another handles a messaging app. The cores communicate through a shared pathway and the operating system distributes processes across the cores.
fundamentals of digital communication Unit 5_microprocessor.pdfshubhangisonawane6
The document discusses the evolution of microprocessors from single-core to multi-core architectures. It describes how multi-core processors have multiple processing cores on a single chip to improve performance and efficiency. Each core can independently execute threads simultaneously for parallel processing. The document outlines the key components involved in the instruction cycle of a microprocessor, including registers like the program counter and memory address registers. It also discusses how multicore processors benefit applications that can distribute processing across multiple threads.
This document describes SmartBalance, a sensing-driven Linux load balancer for heterogeneous multi-processor systems-on-chips (MPSoCs) that aims to improve energy efficiency. It does this through a predictive approach that balances tasks among cores based on ongoing performance and power measurements, with the goal of jointly addressing workload variability and hardware heterogeneity. The key aspects are on-chip sensing to monitor performance and power, online prediction of these metrics, and a simulated annealing-based allocation algorithm to optimize task distribution across cores in each scheduling epoch.
The document discusses various topics related to parallel and distributed computing including parallel computing resources and concepts, Flynn's taxonomy of parallel systems, parallel computer memory architectures like shared memory and distributed memory, parallel programming models such as shared memory, message passing and data parallel models, designing parallel programs including partitioning and load balancing, and different parallel computer architectures like vector processors, very long instruction word architecture, and superpipelined architecture.
This presentation discusses array processors, which are parallel computers composed of multiple identical processing elements that can operate simultaneously. The presentation covers the history of array processors, how they work, classifications, architectures, performance and scalability. It explains that array processors are well-suited for tasks involving repetitive arithmetic operations on large datasets, as they can improve performance for such workloads, but may not provide benefits for operations with data dependencies or decisions based on computations.
This document discusses different types of parallel processor architectures:
- SISD, SIMD, MISD, and MIMD refer to single instruction single data, single instruction multiple data, multiple instruction single data, and multiple instruction multiple data respectively.
- Symmetric multiprocessors (SMPs) have multiple similar processors that share memory and I/O. Clusters have groups of interconnected whole computers working together. NUMA systems have processors that access different regions of shared memory at different speeds.
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...Arun Joseph
1) The document describes empirically derived power models for uncore elements like the Power Bus and memory controllers of IBM's POWER8 server processor.
2) Using a small set of activity markers like read, write, retry and snoop events along with microbenchmarks, the models can predict uncore power with up to 6% error.
3) These abstract power models allow more accurate dynamic power management by the chip compared to using a constant worst-case uncore power, potentially enabling a 5% CPU frequency boost.
This document discusses optimizing Linux AMIs for performance at Netflix. It begins by providing background on Netflix and explaining why tuning the AMI is important given Netflix runs tens of thousands of instances globally with varying workloads. It then outlines some of the key tools and techniques used to bake performance optimizations into the base AMI, including kernel tuning to improve efficiency and identify ideal instance types. Specific examples of CFS scheduler, page cache, block layer, memory allocation, and network stack tuning are also covered. The document concludes by discussing future tuning plans and an appendix on profiling tools like perf and SystemTap.
This document discusses advance computer architectures including multi-core computers, multithreading, and GPUs. It provides information on multi-core systems and how they integrate multiple processor cores on a single chip to provide cheap parallel computing. It also discusses limitations of single core architectures and how multithreading enables parallelism through dividing instruction streams into threads. Finally, it covers GPUs and how they are optimized for parallel processing of graphics applications using thousands of simpler cores compared to CPUs.
This document discusses advance computer architectures including multi-core computers, multithreading, and GPUs. It provides information on multi-core systems having multiple processor cores on a single chip that share memory. It discusses how multi-core processors address limitations of single core designs by providing cheaper parallelism while increasing computation power. The document also covers multithreading, different approaches, and how programming must support multi-core through multiple threads or processes. Finally, it introduces GPUs, how they are optimized for graphics applications through parallelism and throughput, and how CUDA enables general purpose programming on GPUs.
Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth
Fast switching of threads between cores is a published research paper on Operating systems, This is our attempt to decode the research and present to the class
1. The document discusses research activities related to reducing energy consumption by at least 30% through the development of core source technologies for universal operating systems.
2. It describes four papers being presented, including ones on system and device latency modeling, power management frameworks for embedded systems, and automatic selection of power policies for operating systems.
3. It also summarizes four research topics from the National University, including performance evaluation of parallel applications using a power-aware paging method on next-generation memory architectures.
This document provides an introduction to multi-core processors. It discusses that a multi-core processor contains two or more processors on a single integrated circuit. This leads to enhanced performance, reduced power consumption, and more efficient simultaneous processing of multiple tasks. However, developing multithreaded applications for multi-core processors can be difficult, time-consuming, and error-prone. Adding more cores also introduces additional overheads and latencies between communicating and non-communicating cores. There are different types of multi-core architectures including symmetric multiprocessing (SMP) and asymmetric multiprocessing (AMP). Effective use of multi-core processors requires considerations around cache coherency, load balancing, interrupt handling, and concurrency management.
The document provides an overview of microprocessors and microcontrollers. It discusses the basic architecture of microprocessors, including the Von Neumann and Harvard architectures. It compares RISC and CISC instruction sets. Microcontrollers are defined as single-chip computers containing a CPU, memory, and I/O ports. Common PIC microcontrollers are described along with their characteristics such as speed, memory types, and analog/digital capabilities. The document also outlines best practices for selecting a suitable microcontroller for a project, including identifying hardware interfaces, memory needs, programming tools, and cost/power constraints.
The document discusses advancements in computer architecture, including multi-core computers, multithreading, and GPUs. It describes how multi-core processors integrate multiple processor cores on a single chip to provide cheap parallel processing and increase computation power. It also discusses how GPUs are optimized for graphics applications through massively parallel and highly multithreaded designs. Programming models like CUDA allow GPUs to be used for general purpose computing by addressing thread, data, and task parallelism. Overall, the document outlines how multi-core and GPU technologies enable computers to better utilize parallelism for improved performance.
The document discusses advancements in computer architecture, including multi-core computers, multithreading, and GPUs. It describes how multi-core processors integrate multiple processor cores on a single chip to provide cheap parallel processing and increase computation power. It also discusses how multithreading exploits thread-level parallelism and how GPUs are optimized for parallel graphics applications through thousands of simple processor cores focused on throughput over latency. The document provides examples of Intel's multi-core chips and the Polaris chip with 80 cores, and explains how applications can benefit from multi-core and multi-threaded programming.
Similar to PPT_for_big_LITTLE_style_Asymmetric_Mult.pptx (20)
Rainfall intensity duration frequency curve statistical analysis and modeling...bijceesjournal
Using data from 41 years in Patna’ India’ the study’s goal is to analyze the trends of how often it rains on a weekly, seasonal, and annual basis (1981−2020). First, utilizing the intensity-duration-frequency (IDF) curve and the relationship by statistically analyzing rainfall’ the historical rainfall data set for Patna’ India’ during a 41 year period (1981−2020), was evaluated for its quality. Changes in the hydrologic cycle as a result of increased greenhouse gas emissions are expected to induce variations in the intensity, length, and frequency of precipitation events. One strategy to lessen vulnerability is to quantify probable changes and adapt to them. Techniques such as log-normal, normal, and Gumbel are used (EV-I). Distributions were created with durations of 1, 2, 3, 6, and 24 h and return times of 2, 5, 10, 25, and 100 years. There were also mathematical correlations discovered between rainfall and recurrence interval.
Findings: Based on findings, the Gumbel approach produced the highest intensity values, whereas the other approaches produced values that were close to each other. The data indicates that 461.9 mm of rain fell during the monsoon season’s 301st week. However, it was found that the 29th week had the greatest average rainfall, 92.6 mm. With 952.6 mm on average, the monsoon season saw the highest rainfall. Calculations revealed that the yearly rainfall averaged 1171.1 mm. Using Weibull’s method, the study was subsequently expanded to examine rainfall distribution at different recurrence intervals of 2, 5, 10, and 25 years. Rainfall and recurrence interval mathematical correlations were also developed. Further regression analysis revealed that short wave irrigation, wind direction, wind speed, pressure, relative humidity, and temperature all had a substantial influence on rainfall.
Originality and value: The results of the rainfall IDF curves can provide useful information to policymakers in making appropriate decisions in managing and minimizing floods in the study area.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
artificial intelligence and data science contents.pptxGauravCar
What is artificial intelligence? Artificial intelligence is the ability of a computer or computer-controlled robot to perform tasks that are commonly associated with the intellectual processes characteristic of humans, such as the ability to reason.
› ...
Artificial intelligence (AI) | Definitio
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
An improved modulation technique suitable for a three level flying capacitor ...IJECEIAES
This research paper introduces an innovative modulation technique for controlling a 3-level flying capacitor multilevel inverter (FCMLI), aiming to streamline the modulation process in contrast to conventional methods. The proposed
simplified modulation technique paves the way for more straightforward and
efficient control of multilevel inverters, enabling their widespread adoption and
integration into modern power electronic systems. Through the amalgamation of
sinusoidal pulse width modulation (SPWM) with a high-frequency square wave
pulse, this controlling technique attains energy equilibrium across the coupling
capacitor. The modulation scheme incorporates a simplified switching pattern
and a decreased count of voltage references, thereby simplifying the control
algorithm.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
3. Motivation
• Modern processors are diverse in:
– Optimization objectives: perf, energy
– Workloads: multimedia, encryption, network …
– Scale: embedded system to data center
• A single monolithic core cannot fulfill all requirements
• This has led to two broad ranges of processors:
Narrow in-order (InO)
cores e.g. Xeon Phi
Wide out-of-order (OoO) cores
e.g. Sandybridge and Power7
IBM POWER7: 8 cores Intel Xeon Phi
3
4. Motivation
• Next step: use different types of core in same
processor => AMP
• AMPs can
– Provide better energy efficiency than SMPs and
per-core DVFS
– Can optimize for thread-level or instruction-level
parallelism
– Allow turning-off unused core for saving energy
4
5. Classification of AMPs
• Static AMP:
statically
configuration of cores is fixed
• Reconfigurable AMP: microarchitecture can be
reconfigured dynamically to provide cores of
different resources
5
6. 6
Examples of Static AMPs
Asymmetric
Symmetric
6
C1
C2
C3
C4 C4 C4 C4
C5 C5 C5 C5
C C C C
C C C C
C C C C
C C C C
7. Examples of Static AMPs
9 power-equivalent multi-cores
(B=big core, m=medium core, s=small core)
Generally, two core types are sufficient for providing most benefits of heterogeneity
Eyerman et al. ASPLOS'14
7
9. Terminology
Asymmetric multicore (AMC), asymmetric multicore systems (ASYMS),
asymmetric multiprocessor systems (ASMP), asymmetric chip
multiprocessors (ACMP), heterogeneous microarchitectures (HM),
heterogeneous multicore processor (HMP), heterogeneous CMP (HCMP),
asymmetric cluster CMP (ACCMP), big.LITTLE system
Big/little (or big/small), fast/slow, complex/simple, aggressive/lightweight,
strong/weak cores, application/low-power processor (AP/LP),
central/peripheral processor
Reconfigurable, configurable, adaptive, scalable, composable, composite,
coalition, conjoined, federated, polymorphous, morphable, core morphing,
core fusion, flexible, dynamic and united processors
9
Different terminologies for reconfigurable AMPs and/or
techniques for architecting them
Different terminologies for cores of an AMP
Different terminologies for an AMP
10. Types of Heterogeneity in AMPs
Types of heterogeneity
(basis: nature of asymmetry)
Srinivasan et
al. [2011]
Koufaty et
al. [2010]
Types of heterogeneity
(basis: nature of asymmetry)
Types of heterogeneity (basis:
how asymmetry is introduced)
Khan and
Kundu [2010]
uArch = microarchitecture, freq. = frequency, diff. = different
10
Extemporaneous heterogeneity
(performance of a core altered by
DVFS or hardware reconfiguration)
Deliberate heterogeneity (diff.
uArch, ISA and specialization,
e.g. CPU and GPU)
Functional
asymmetry
(diff. ISA and uArch)
Performance asymmetry
(same ISA, diff. uArch,
cache size, freq)
Virtual asymmetric
(same uArch & ISA,
diff. freq or cache
size)
Physical asymmetric
(same ISA, diff. uArch e.g.
InO vs OOO, and freq.)
Hybrid Cores
(diff. ISA and
uArch)
11. Classification based on performance ordering
core core core
X86
Performance of EV6 > EV5 for Neither Alpha nor x86 is optimal for all
all apps => AMP with
monotonic cores
apps => AMP with non-monotonic
cores
Configuration of Alpha processors
11
Alpha
core
Alpha
EV6
Alpha
EV5
12. Architectural configuration of four ARM processors
performance on XML parsing benchmark
• Cortex A15 and A7: Same ISA but different architecture
• Cortex A57 and A53: Same ISA but different architecture
All the four processors can have 1 to 4 cores per cluster
12
14. Benefits of AMPs
• AMPs are natural choice for systems with diverse
applications and usage scenarios
• Big core => better performance
• Small core => better energy efficiency
• However, no winner on EDP metric!
• Big core => better EDP for compute-intensive apps with
high data reuse
• Small core => better EDP for memory-intensive apps
with little data reuse and many atomic operations
14
15. Challenges of AMPs
• Conventional software are designed for SMPs. Many
changes required for supporting AMPs
• AMP cores should cover a wide and evenly spread
range of performance/complexity design space
• Scheduling complexity in AMP increases exponentially
with rising number of core types and applications
15
16. Challenges of AMPs
• In some AMPs, the ISA, OS and programming
model of different cores are also
present even more challenges
different => they
• AMPs are not widely available
• Some works use DVFS (or clock throttling) to
emulate asymmetric cores, however,
– it over-simplifies challenges of a real AMP =>
inaccurate conclusions
– cannot model non-monotonic cores
16
17. Thread migration overheads
• In static AMPs, thread migration may take millions
of cycles, e.g. in an AMP with Cortex A15 and A7:
• migration latency b/w A15 to A7: 3.75ms
• vice-versa: 2.10ms
• Flushing and warming of cache etc. => additional
overheads
• Hence, migration can be performed only once every
millions of instructions
17
18. Challenge of maintaining fairness
• Fairness: important for meeting QoS guarantees
• In AMP, some threads may be unfairly slowed-down =>
starvation & unpredictable per-task performance
• In a multithreaded app, performance advantage of big
core may be completely negated if thread running on it
stalls waiting for other threads
Big core Small cores
C0 C1 C2 C3
Thread 0 stalls
for other threads
Synchronization barrier
18
19. Challenges of AMPs
• Some AMP designs use non-standard ISAs or compiler
support => may not find wide adoption
• Unpredictability: An asymmetry-unaware scheduler
may schedule different threads to fast or slow cores in
different runs => variable performance.
19
21. App/thread mapping strategies
• The most important challenge in AMPs: finding the
right core for running a thread
• The right choice depends on:
– Optimization target
– Application property
– Core property
• We will discuss some mapping (scheduling)
strategies
21
22. Estimating performance for scheduling
To
on
make scheduling decisions, thread-performance
different core types must be known
Option 1
Estimate perf. of a thread
on a core type without
actually running the
thread on that core type,
e.g., using math models
HW-specific, error-prone
Option 2
Actually run threads on
each core type to sample
performance
• •
• High profiling overhead
•
22
23. App/thread mapping strategies
CPI breakdown for representative cases
(a) CPI dominated
by external stalls
(a) CPI dominated (a) CPI dominated
by execution cycles
by internal stalls
Suitable for big core
Suitable for small core
Koufaty et al. EuroSys’10
23
24. App/thread mapping strategies
• Loads on different thread is imbalanced
– Map slowest thread to big core
• Different VMs running on a host have different
resource requirements
– VM with higher number of `virtual CPUs' gets big core
• App with high ILP => map to a wide-issue
superscalar processor which can issue several
instructions every cycle
24
25. App/thread mapping strategies
Big core Small core
• Highly-parallel phases
• Compute-intensive apps
• App with low miss-rate
• Benefit from running on
big core is large
• Thread with largest
•
•
•
•
Sequential phases
I/O-intensive apps
App with high miss-rate
Benefit from running on
big core is small
Thread with small
remaining execution time
OS kernel code,
virtualization helper code
& device interrupts
•
remaining execution
• Application code
time
•
25
26. App/thread mapping strategies
Big core Small core
• High priority app
• Multimedia-intensive
• Low priority app
• Service daemons and
background processes,
apps
sensor sampling and
buffering tasks
26
27. Example of fairness-oriented scheduling schemes
• ‘Equal-time’: run each thread on each core type for
equal amount of time
• ‘Equal-progress’: It aims to get equal work done in all
threads.
– Idea 1: Schedule thread with currently largest
slowdown on big core.
– Idea 2: Whenever difference in progress of different
threads becomes too high, swap them
Van Craeynest et al. PACT’13
27
28. Use of DVFS along with thread scheduling
• Provides further opportunities to
performance/energy tradeoff
exercise
• Estimate throughput/Watt of program phase at
different voltage/frequency (V/F) levels on all core
types.
• Based on this, best thread-to-core mapping and V/F
values are selected
28
29. Challenges of different thread scheduling policies
Static scheduling Dynamic scheduling
• Works by collecting data
by offline analysis
• Cannot account for
different input sets and
application phases
• Becomes infeasible with
increasing number of co-
running applications
• Works by collecting data
at runtime
Incur thread migration
overhead
Ineffective for short-lived
threads since the profiling
phase itself may form a
large majority of their
lifetime
•
•
29
31. Motivation: Need of fine-grained switching
Variance of IPC in gcc over 300K instructions
31
32. Need of fine-grained switching
Coarse-grained vs. fine-grained heterogeneity
Fallin et al. ICCD’14
32
33. Reconfigurable AMPs
• Benefits: No thread migration overheads
• Challenges: Reconfiguration incurs latency and energy
overheads, e.g., I/D-cache flushes and data migration
• Avoiding this may require: a complex compiler, custom
ISA, 3D stacking, changes to OS and application binary.
• Tradeoffs:
– Centralized resources: saves area, but presents scalability
bottleneck
– High adaptation granularity: allows exploiting different
levels of ILP and TLP but precludes specialization for
accelerating specific applications
33
34. Benefits of reconfigurable AMPs
• Allow flexibly scaling up to exploit MLP and ILP in
single-threaded apps
• Allow scaling down to exploit TLP in multithreaded
apps
• Provide better HW utilization and resilience to errors
since one hard error may not disable entire processor
• They may achieve better performance and energy
proportionality than static AMPs.
34
35. Types of reconfigurable AMPs
1. Those that dynamically fuse or partition the cores
and thus change the core-count
2.
3.
Those
Those
which
which
share/trade resources between cores
transform the core architecture
In following slides, we show examples of each of
these through figures. See the survey for more details
35
36. 1. Changing core-count
An 8-core CMP with two independent cores, 2-core fused
group, and 4-core fused group
Ipek et al. ISCA’07
36
37. Static AMP
with big and
little cores
Reconfigurable AMP
with many little cores,
of which few can be
fused into a wide-issue
processor
Salverda et al. HPCA'08
37
41. 1. Changing core-count
Exploits fine-grain parallelism more effectively
Runs more applications effectively
PIM = processor
in memory
Wide-issue processors
with many ALUs each
Different granularities of parallel processing elements
Sankaralingam et al. ISCA'03
41
42. 1. Changing core-count
A reconfigurable AMP where multiple scalar cores can
be united to create a larger superscalar processor
Chiu et al. ICPP’10
42
43. 2. Trading resources between cores
Asymmetric
building blocks Faulty
A reconfigurable AMP
Gupta et al. MICRO’10
43
44. 2. Trading resources between cores
A 3D reconfigurable AMP: poolable resources (registers,
instruction queue, reorder buffer, cache space, load and store
queues, etc.) in another layer
Homayoun et al. HPCA'11
44
45. 2. Trading resources between cores
Dynamic core morphing (1/2)
Baseline configuration for two heterogeneous cores
Rodrigues et al. PACT’11
45
46. 2. Trading resources between cores
Dynamic core morphing (2/2)
Morphed configuration for two
heterogeneous cores.
RED: Connectivity for strong morphed core BLACK: Connectivity for weak core
46
47. 2. Trading resources between cores
Pipeline level view of the resource sharing
Rodrigues et al. VLSID’14
47