The document discusses challenges related to parallel processing and massive parallel architectures. It covers topics like pipeline processors, multiprocessors, processing in memory architectures like Cyclops and picoChip, and cellular architectures. It also discusses code generation issues that arise from massive parallelism and possible solutions using compilers or libraries.
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
The document summarizes a presentation on using data-parallelism in C++ to parallelize the SPEC2006 benchmark suite. It discusses the design of a data-flow library called Parallel Pixie Dust (PPD) and how it was used to parallelize some STL algorithms. Analysis of two SPEC2006 benchmarks found that only a small number of STL algorithm usages could be easily parallelized due to the functional decomposition of the code into small blocks and potential side effects. Larger functions and avoidance of side effects may enable more opportunities for parallelization.
A Domain-Specific Embedded Language for Programming Parallel Architectures.Jason Hearne-McGuiness
This document proposes a domain-specific embedded language (DSEL) for programming parallel architectures. The DSEL aims to enable parallelism while avoiding issues like deadlocks, race conditions, and complex APIs. It presents the grammar and properties of the proposed DSEL, including that it generates schedules that are deadlock-free and race-condition free. Examples demonstrating data flow and data parallelism using the DSEL are also provided.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
This document discusses energy-efficient hardware data prefetching. It begins with an introduction to data prefetching and why it is needed due to the growing gap between processor and memory speeds. It then covers different types of prefetching techniques including software-based, hardware-based, sequential, stride, and pointer prefetching. It also discusses the tradeoffs between software and hardware approaches. Finally, it introduces the concept of energy-aware data prefetching to reduce the increased energy consumption from aggressive prefetching techniques.
The document discusses changes to the ApplicationMaster (AM) component architecture in Hadoop to improve container reuse. It proposes separating container and task operations so that containers can be launched independently of tasks and reused for multiple tasks. This would facilitate features like common map output buffers for tasks in the same container and merging map outputs per node or rack. The current state of the AM changes is described, along with requirements for task side changes and a new reuse scheduler. Potential benefits highlighted include simplifying deployment, monitoring large clusters, integrating with other systems, and reducing costs through high availability and training/support services.
The document proposes a new multithreaded execution model and multi-ring architecture to exploit instruction-level parallelism. The model uses multiple instruction threads that are scheduled for execution based on data availability. The instructions from different threads are grouped together for parallel execution. The proposed multi-ring architecture features large resident thread activations, a high-speed buffer to avoid load/store stalls, and a dynamic instruction scheduler that selects instructions from multiple threads each cycle for execution on multiple pipelines. Initial simulation results show the architecture can achieve parallel execution of 7 instructions per cycle.
Parallel Computing: Perspectives for more efficient hydrological modelingGrigoris Anagnostopoulos
A presentation that introduces the basic concepts of parallel computing and gives some details on General Purpose GPU computing using the CUDA architecture.
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
The document summarizes a presentation on using data-parallelism in C++ to parallelize the SPEC2006 benchmark suite. It discusses the design of a data-flow library called Parallel Pixie Dust (PPD) and how it was used to parallelize some STL algorithms. Analysis of two SPEC2006 benchmarks found that only a small number of STL algorithm usages could be easily parallelized due to the functional decomposition of the code into small blocks and potential side effects. Larger functions and avoidance of side effects may enable more opportunities for parallelization.
A Domain-Specific Embedded Language for Programming Parallel Architectures.Jason Hearne-McGuiness
This document proposes a domain-specific embedded language (DSEL) for programming parallel architectures. The DSEL aims to enable parallelism while avoiding issues like deadlocks, race conditions, and complex APIs. It presents the grammar and properties of the proposed DSEL, including that it generates schedules that are deadlock-free and race-condition free. Examples demonstrating data flow and data parallelism using the DSEL are also provided.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
This document discusses energy-efficient hardware data prefetching. It begins with an introduction to data prefetching and why it is needed due to the growing gap between processor and memory speeds. It then covers different types of prefetching techniques including software-based, hardware-based, sequential, stride, and pointer prefetching. It also discusses the tradeoffs between software and hardware approaches. Finally, it introduces the concept of energy-aware data prefetching to reduce the increased energy consumption from aggressive prefetching techniques.
The document discusses changes to the ApplicationMaster (AM) component architecture in Hadoop to improve container reuse. It proposes separating container and task operations so that containers can be launched independently of tasks and reused for multiple tasks. This would facilitate features like common map output buffers for tasks in the same container and merging map outputs per node or rack. The current state of the AM changes is described, along with requirements for task side changes and a new reuse scheduler. Potential benefits highlighted include simplifying deployment, monitoring large clusters, integrating with other systems, and reducing costs through high availability and training/support services.
The document proposes a new multithreaded execution model and multi-ring architecture to exploit instruction-level parallelism. The model uses multiple instruction threads that are scheduled for execution based on data availability. The instructions from different threads are grouped together for parallel execution. The proposed multi-ring architecture features large resident thread activations, a high-speed buffer to avoid load/store stalls, and a dynamic instruction scheduler that selects instructions from multiple threads each cycle for execution on multiple pipelines. Initial simulation results show the architecture can achieve parallel execution of 7 instructions per cycle.
Parallel Computing: Perspectives for more efficient hydrological modelingGrigoris Anagnostopoulos
A presentation that introduces the basic concepts of parallel computing and gives some details on General Purpose GPU computing using the CUDA architecture.
This document summarizes a presentation about using the Task Parallel Library (TPL) for data flow tasks in .NET. It discusses how TPL can be used to parallelize image processing pipelines by modeling the stages as data flow blocks. The key TPL data flow blocks for sources, targets, buffering, transformations, and joins are explained. Code examples are provided for building a skeletal image processing program using these TPL data flow capabilities.
Optimizing your java applications for multi core hardwareIndicThreads
Session Presented at 5th IndicThreads.com Conference On Java held on 10-11 December 2010 in Pune, India
WEB: http://J10.IndicThreads.com
------------
Rising power dissipation in microprocessor chips is leading to a trend towards increasing the number of cores on a chip (multi-core processors) rather than increasing clock frequency as the primary basis for increasing system performance. Consequently the number of threads in commodity hardware has also exploded. This leads to complexity in designing and configuring high performance Java applications that make effective use of new hardware. In this talk we provide a summary of the changes happening in the multi-core world and subsequently discuss about some of the JVM features which exploit the multi-core capabilities of the underlying hardware. We also explain techniques to analyze and optimize your application for highly concurrent systems. Key topics include an overview of Java Virtual Machine features & configuration, ways to correctly leverage java.util.concurrent package to achieve enhanced parallelism for applications in a multi-core environment, operating system issues, virtualization, Java code optimizations and useful profiling tools and techniques.
Takeaways for the Audience
Attendees will leave with a better understanding of the new multi-core world, understanding of Java Virtual Machine features which exploit mulit-core and the techniques they can apply to ensure their Java applications run well in mulit-core environment.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
The document provides an introduction to parallel computing and parallel programming. It discusses Moore's Law and the need for parallelism to continue increasing performance. It outlines some common parallel architectures like SIMD and MIMD. It also describes different parallel programming models including message passing and shared memory, and different parallel algorithm patterns such as data parallel, task graph, and master-slave models. Finally, it briefly introduces MapReduce as a parallel programming paradigm.
The document discusses MapReduce, including its programming model, internal framework, and improvements. It describes MapReduce as a programming model and framework that allows parallel processing of large datasets across commodity machines. The map function processes input key-value pairs to generate intermediate pairs, and the reduce function combines values for each key. The framework automatically parallelizes jobs and provides fault tolerance.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
The document discusses multi-core and many-core processors and parallel programming models. It provides an overview of hardware trends including increasing numbers of cores in CPUs and GPUs. It also covers parallel programming approaches like shared memory, message passing, data parallelism and task parallelism. Specific APIs discussed include Win32 threads, OpenMP, and Intel TBB.
Java performance tuning involves diagnosing and addressing issues like slow application performance and out of memory errors. The document discusses Java performance problems and their solutions, tuning tips, and monitoring tools. Some tips include tuning JVM parameters like heap size, garbage collection settings, and enabling parallel garbage collection for multi-processor systems. Tools mentioned include JConsole, VisualVM, JProfiler, and others for monitoring memory usage, thread activity, and garbage collection.
Java Colombo Meetup on 22nd March 2018
Speaker: Isuru Perera, Technical Lead at WSO2
Flame graphs are a visualization of profiled software and it was developed by Brendan Gregg, an industry expert in computing performance and cloud computing. Finding out why CPUs are busy is an important task when troubleshooting performance issues and we often use a sampling profiler to see which code-paths are hot. However, a profiler will dump a lot of data with thousands of lines and it is not easy to go through all data. With Flame Graphs, we can identify the most frequent code-paths quickly and accurately. Basically, a Flame Graph can simply visualize the stack traces output of a sampling profiler.
There are many ways to profile Java applications and Java Flight Recorder (JFR) is a really good tool to profile a Java application with a very low overhead. I will show how we can generate a Flame Graph from a Java Flight Recording using the JFR Flame Graph tool (https://github.com/chrishantha/jfr-flame-graph) I developed.
Since Flame Graphs can visualize any stack profiles, we can also use a Linux system profiler (perf) and create a Java Mixed-Mode Flame Graph, which will show how much CPU time is spent in Java methods, system libraries and the kernel. We can troubleshoot performance issues related to high CPU usage easily with a flame graph showing profile information from both system code paths and Java code paths. I will discuss how we can use the -XX:+PreserveFramePointer option in JDK and the perf system profiler to generate a Java Mixed-mode flame graph.
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
This paper proposes a software-hardware co-design approach to improve GPU energy efficiency through the use of a hybrid memory system combining DRAM and PCM. The key aspects of the proposed approach include: (1) Using finer-grained data migration at the 256B level between DRAM and PCM, (2) Tracking reference counts and row buffer misses to determine hot data, (3) Batch migrating data to improve row buffer locality, and (4) A compiler that determines initial data placement to reduce migration overhead. Evaluation shows the approach improves energy efficiency by 6-49% compared to pure DRAM or PCM systems with less than 2% performance loss.
Software Profiling: Understanding Java Performance and how to profile in JavaIsuru Perera
Guest lecture at University of Colombo School of Computing on 27th May 2017
Covers following topics:
Software Profiling
Measuring Performance
Java Garbage Collection
Sampling vs Instrumentation
Java Profilers. Java Flight Recorder
Java Just-in-Time (JIT) compilation
Flame Graphs
Linux Profiling
This document discusses fundamentals of deep learning with Python. It begins with an introduction to deep learning and neural networks. It then covers setting up the Python deep learning environment, including installing key libraries like TensorFlow, Keras, NumPy and Matplotlib. The document provides an example of a first deep learning project in Python using the Keras API to build and train a neural network on a diabetes dataset. It discusses loading and preprocessing data, defining the model architecture, compiling and fitting the model, evaluating performance and making predictions. Finally, it covers additional topics like regularization, batch normalization, saving models and visualizing neural networks.
This document discusses parallel programming concepts in .NET 4.0 including task parallelism, data parallelism, and coordination data structures. It provides an overview of tasks and task parallelism in .NET 4.0, explaining how to create and start tasks using lambda expressions and delegates. The document also demonstrates continuing tasks by chaining additional actions to run after a task completes.
This document discusses enhancing cache coherent architectures for manycore embedded systems by taking advantage of regular memory access patterns. It proposes adding pattern storage and detection capabilities to cores to reduce coherence traffic. Called CoCCA (Codesigned Cache Coherent Architecture), it modifies the baseline cache coherence protocol to allow speculative fetching of cache lines according to detected patterns, defined during compilation. This could improve scalability over the baseline approach by reducing traffic from repetitive accesses to shared data following predictable patterns.
The document introduces the National Supercomputer Center in Tianjin (NSCC-TJ) and its TH-1A supercomputer system. It describes that NSCC-TJ is sponsored by the Chinese government to provide high performance computing services. It then provides details about the TH-1A system including its hybrid CPU and GPU architecture, proprietary interconnect network, 262TB of memory and 2PB of storage. It also summarizes the system's software stack including the Kylin Linux operating system, compilers, programming environment and visualization system.
The document is the table of contents for a book on C++ Neural Networks and Fuzzy Logic. It lists 17 chapters that cover topics like neural network models, learning and training algorithms, applications of neural networks to pattern recognition, financial forecasting and more. It also includes code examples in C++ to illustrate various neural network architectures.
Content delivery networks (CDNs) distribute much of the Internet content by caching and serving the objects requested by users. A major goal of a CDN is to maximize the hit rates of its caches, thereby enabling faster content downloads to the users. Content caching involves two components: an admission algorithm to decide whether to cache an object and an eviction algorithm to decide which object to evict from the cache when it is full. In this paper, we focus on cache admission and propose an algorithm called RL-Cache that uses model-free reinforcement learning (RL) to decide whether or not to admit a requested object into the CDN's cache. Unlike prior approaches that use a small set of criteria for decision making, RL-Cache weights a large set of features that include the object size, recency, and frequency of access. We develop a publicly available implementation of RL-Cache and perform an evaluation using production traces for the image, video, and web traffic classes from Akamai's CDN. The evaluation shows that RL-Cache improves the hit rate in comparison with the state of the art and imposes only a modest resource overhead on the CDN servers. Further, RL-Cache is robust enough that it can be trained in one location and executed on request traces of the same or different traffic classes in other locations of the same geographic region.
The document provides an outline for a lecture on massively parallel computing. It discusses how modeling and simulation problems require high-performance computing and are driving the development of new computing architectures. It mentions some of the world's most powerful supercomputers like Roadrunner and Tianhe-1A. It also discusses how cloud computing, data processing needs, and gaming are contributing to growth in parallel computing. The document outlines how data from science experiments and the web are exploding in size and driving the need for new parallel and distributed solutions.
This document discusses massively parallel architectures and processing in memory (PIM) as ways to overcome the memory wall problem. It describes several PIM and cellular architectures including Cyclops, Gilgamesh, Shamrock, picoChip and DIMES. DIMES is an FPGA implementation of a simplified cellular architecture that was used by Jason McGuiness to test programming approaches. The talk concludes with an invitation for questions.
This document summarizes a presentation about using the Task Parallel Library (TPL) for data flow tasks in .NET. It discusses how TPL can be used to parallelize image processing pipelines by modeling the stages as data flow blocks. The key TPL data flow blocks for sources, targets, buffering, transformations, and joins are explained. Code examples are provided for building a skeletal image processing program using these TPL data flow capabilities.
Optimizing your java applications for multi core hardwareIndicThreads
Session Presented at 5th IndicThreads.com Conference On Java held on 10-11 December 2010 in Pune, India
WEB: http://J10.IndicThreads.com
------------
Rising power dissipation in microprocessor chips is leading to a trend towards increasing the number of cores on a chip (multi-core processors) rather than increasing clock frequency as the primary basis for increasing system performance. Consequently the number of threads in commodity hardware has also exploded. This leads to complexity in designing and configuring high performance Java applications that make effective use of new hardware. In this talk we provide a summary of the changes happening in the multi-core world and subsequently discuss about some of the JVM features which exploit the multi-core capabilities of the underlying hardware. We also explain techniques to analyze and optimize your application for highly concurrent systems. Key topics include an overview of Java Virtual Machine features & configuration, ways to correctly leverage java.util.concurrent package to achieve enhanced parallelism for applications in a multi-core environment, operating system issues, virtualization, Java code optimizations and useful profiling tools and techniques.
Takeaways for the Audience
Attendees will leave with a better understanding of the new multi-core world, understanding of Java Virtual Machine features which exploit mulit-core and the techniques they can apply to ensure their Java applications run well in mulit-core environment.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
The document provides an introduction to parallel computing and parallel programming. It discusses Moore's Law and the need for parallelism to continue increasing performance. It outlines some common parallel architectures like SIMD and MIMD. It also describes different parallel programming models including message passing and shared memory, and different parallel algorithm patterns such as data parallel, task graph, and master-slave models. Finally, it briefly introduces MapReduce as a parallel programming paradigm.
The document discusses MapReduce, including its programming model, internal framework, and improvements. It describes MapReduce as a programming model and framework that allows parallel processing of large datasets across commodity machines. The map function processes input key-value pairs to generate intermediate pairs, and the reduce function combines values for each key. The framework automatically parallelizes jobs and provides fault tolerance.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
The document discusses multi-core and many-core processors and parallel programming models. It provides an overview of hardware trends including increasing numbers of cores in CPUs and GPUs. It also covers parallel programming approaches like shared memory, message passing, data parallelism and task parallelism. Specific APIs discussed include Win32 threads, OpenMP, and Intel TBB.
Java performance tuning involves diagnosing and addressing issues like slow application performance and out of memory errors. The document discusses Java performance problems and their solutions, tuning tips, and monitoring tools. Some tips include tuning JVM parameters like heap size, garbage collection settings, and enabling parallel garbage collection for multi-processor systems. Tools mentioned include JConsole, VisualVM, JProfiler, and others for monitoring memory usage, thread activity, and garbage collection.
Java Colombo Meetup on 22nd March 2018
Speaker: Isuru Perera, Technical Lead at WSO2
Flame graphs are a visualization of profiled software and it was developed by Brendan Gregg, an industry expert in computing performance and cloud computing. Finding out why CPUs are busy is an important task when troubleshooting performance issues and we often use a sampling profiler to see which code-paths are hot. However, a profiler will dump a lot of data with thousands of lines and it is not easy to go through all data. With Flame Graphs, we can identify the most frequent code-paths quickly and accurately. Basically, a Flame Graph can simply visualize the stack traces output of a sampling profiler.
There are many ways to profile Java applications and Java Flight Recorder (JFR) is a really good tool to profile a Java application with a very low overhead. I will show how we can generate a Flame Graph from a Java Flight Recording using the JFR Flame Graph tool (https://github.com/chrishantha/jfr-flame-graph) I developed.
Since Flame Graphs can visualize any stack profiles, we can also use a Linux system profiler (perf) and create a Java Mixed-Mode Flame Graph, which will show how much CPU time is spent in Java methods, system libraries and the kernel. We can troubleshoot performance issues related to high CPU usage easily with a flame graph showing profile information from both system code paths and Java code paths. I will discuss how we can use the -XX:+PreserveFramePointer option in JDK and the perf system profiler to generate a Java Mixed-mode flame graph.
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
This paper proposes a software-hardware co-design approach to improve GPU energy efficiency through the use of a hybrid memory system combining DRAM and PCM. The key aspects of the proposed approach include: (1) Using finer-grained data migration at the 256B level between DRAM and PCM, (2) Tracking reference counts and row buffer misses to determine hot data, (3) Batch migrating data to improve row buffer locality, and (4) A compiler that determines initial data placement to reduce migration overhead. Evaluation shows the approach improves energy efficiency by 6-49% compared to pure DRAM or PCM systems with less than 2% performance loss.
Software Profiling: Understanding Java Performance and how to profile in JavaIsuru Perera
Guest lecture at University of Colombo School of Computing on 27th May 2017
Covers following topics:
Software Profiling
Measuring Performance
Java Garbage Collection
Sampling vs Instrumentation
Java Profilers. Java Flight Recorder
Java Just-in-Time (JIT) compilation
Flame Graphs
Linux Profiling
This document discusses fundamentals of deep learning with Python. It begins with an introduction to deep learning and neural networks. It then covers setting up the Python deep learning environment, including installing key libraries like TensorFlow, Keras, NumPy and Matplotlib. The document provides an example of a first deep learning project in Python using the Keras API to build and train a neural network on a diabetes dataset. It discusses loading and preprocessing data, defining the model architecture, compiling and fitting the model, evaluating performance and making predictions. Finally, it covers additional topics like regularization, batch normalization, saving models and visualizing neural networks.
This document discusses parallel programming concepts in .NET 4.0 including task parallelism, data parallelism, and coordination data structures. It provides an overview of tasks and task parallelism in .NET 4.0, explaining how to create and start tasks using lambda expressions and delegates. The document also demonstrates continuing tasks by chaining additional actions to run after a task completes.
This document discusses enhancing cache coherent architectures for manycore embedded systems by taking advantage of regular memory access patterns. It proposes adding pattern storage and detection capabilities to cores to reduce coherence traffic. Called CoCCA (Codesigned Cache Coherent Architecture), it modifies the baseline cache coherence protocol to allow speculative fetching of cache lines according to detected patterns, defined during compilation. This could improve scalability over the baseline approach by reducing traffic from repetitive accesses to shared data following predictable patterns.
The document introduces the National Supercomputer Center in Tianjin (NSCC-TJ) and its TH-1A supercomputer system. It describes that NSCC-TJ is sponsored by the Chinese government to provide high performance computing services. It then provides details about the TH-1A system including its hybrid CPU and GPU architecture, proprietary interconnect network, 262TB of memory and 2PB of storage. It also summarizes the system's software stack including the Kylin Linux operating system, compilers, programming environment and visualization system.
The document is the table of contents for a book on C++ Neural Networks and Fuzzy Logic. It lists 17 chapters that cover topics like neural network models, learning and training algorithms, applications of neural networks to pattern recognition, financial forecasting and more. It also includes code examples in C++ to illustrate various neural network architectures.
Content delivery networks (CDNs) distribute much of the Internet content by caching and serving the objects requested by users. A major goal of a CDN is to maximize the hit rates of its caches, thereby enabling faster content downloads to the users. Content caching involves two components: an admission algorithm to decide whether to cache an object and an eviction algorithm to decide which object to evict from the cache when it is full. In this paper, we focus on cache admission and propose an algorithm called RL-Cache that uses model-free reinforcement learning (RL) to decide whether or not to admit a requested object into the CDN's cache. Unlike prior approaches that use a small set of criteria for decision making, RL-Cache weights a large set of features that include the object size, recency, and frequency of access. We develop a publicly available implementation of RL-Cache and perform an evaluation using production traces for the image, video, and web traffic classes from Akamai's CDN. The evaluation shows that RL-Cache improves the hit rate in comparison with the state of the art and imposes only a modest resource overhead on the CDN servers. Further, RL-Cache is robust enough that it can be trained in one location and executed on request traces of the same or different traffic classes in other locations of the same geographic region.
The document provides an outline for a lecture on massively parallel computing. It discusses how modeling and simulation problems require high-performance computing and are driving the development of new computing architectures. It mentions some of the world's most powerful supercomputers like Roadrunner and Tianhe-1A. It also discusses how cloud computing, data processing needs, and gaming are contributing to growth in parallel computing. The document outlines how data from science experiments and the web are exploding in size and driving the need for new parallel and distributed solutions.
This document discusses massively parallel architectures and processing in memory (PIM) as ways to overcome the memory wall problem. It describes several PIM and cellular architectures including Cyclops, Gilgamesh, Shamrock, picoChip and DIMES. DIMES is an FPGA implementation of a simplified cellular architecture that was used by Jason McGuiness to test programming approaches. The talk concludes with an invitation for questions.
- IBM's Tivoli Storage Manager (TSM) provides data protection, backup and recovery for both physical and virtual environments.
- TSM 6.4 includes enhancements like incremental 'forever' VMware backups, application-aware Microsoft backups, and SAP HANA support.
- The presentation discusses IBM's strategy to optimize storage infrastructure through virtualization, data reduction, analytics and automation.
Massively Parallel Processing with Procedural Python - Pivotal HAWQInMobi Technology
The document discusses massively parallel processing using procedural Python. It describes EMC Corporation and its subsidiaries which provide data storage, virtualization, security, and other software solutions. It also discusses Pivotal's open source contributions and the architecture of its HAWQ database which allows Python user-defined functions to perform parallel operations across clusters.
Greenplum is the first open source Massively Parallel Processing (MPP) data warehouse, built with over two million lines of code. MPP allows a program to run across multiple processors that each use their own memory and operating system. Greenplum was released under Apache software and differs functionally and architecturally from other open source data systems through its use of MPP to execute complex SQL analytics over large datasets at high speeds. As an open source system, Greenplum assures customers that their software needs will be met long-term.
La programmation parallèle est désormais une incontournable solution aux problèmes de performance. Ce n'est pas la seule, mais elle ne peut être ignorée. Les nombreux coeurs et CPUs qui peuplent nos serveurs en sont la preuve.
Elle peut aussi s'utiliser plus souvent qu'on pourrait le penser. Que ce soit pour diminuer les temps de réponse ou augmenter le débit.
Nous vous proposons un état des lieux. Quels sont les usages? Quel est le degré de facilité? Comment se prémunir de la complexité? CPU ou GPU?
À l'aide d'exemples de code, tout ce qui est nécessaire de mettre dans le cartable du développeur vivant dans l'air du temps.
The document discusses superscalar processors and provides details about the Pentium 4 architecture as an example of a superscalar CISC machine. It covers topics such as instruction issue policies, register renaming, branch prediction, and the 20 stage pipeline of the Pentium 4. The Pentium 4 decodes x86 instructions into micro-ops, allocates registers and resources out of order, and can dispatch up to 6 micro-ops per cycle to execution units.
The document discusses superscalar processors and provides details about the Pentium 4 architecture as an example of a superscalar CISC machine. It covers topics such as instruction issue policies, register renaming, branch prediction, and the 20 stage pipeline of the Pentium 4 which decodes x86 instructions into micro-ops and executes them out-of-order. Dependencies limit instruction level parallelism, requiring techniques like register renaming and out-of-order execution to achieve higher performance in superscalar designs.
This talk covers the Vault 8 team's journey at Capital One where we investigated a wide variety of stream processing solutions to build a next generation real-time decisioning platform to power Capital One's infrastructure.
The result of our analysis showed Apache Storm, Apache Flink, and Apache Apex as prime contenders for our use case with Apache Apex ultimately proving to be the solution of choice based on its present readiness for enterprise deployment and its excellent performance.
This document discusses instruction pipelining in computer processors. It begins by defining pipelining and explaining how it works like an assembly line to increase throughput. It then discusses different types of pipelines and introduces the MIPS instruction pipeline as an example. The document goes on to explain different types of pipeline hazards like structural hazards, control hazards, and data hazards. It provides examples of how to detect and resolve these hazards through techniques like forwarding, stalling, predicting, and delayed branching. Key concepts covered include pipeline registers, control signals, forwarding units, and branch prediction buffers.
The document discusses various topics related to parallel and distributed computing including parallel computing resources and concepts, Flynn's taxonomy of parallel systems, parallel computer memory architectures like shared memory and distributed memory, parallel programming models such as shared memory, message passing and data parallel models, designing parallel programs including partitioning and load balancing, and different parallel computer architectures like vector processors, very long instruction word architecture, and superpipelined architecture.
Capital One's Next Generation Decision in less than 2 msApache Apex
This document discusses using Apache Apex for real-time decision making within 2 milliseconds. It provides performance benchmarks for Apex, showing average latency of 0.25ms for over 54 million events with 600GB of RAM. It compares Apex favorably to other streaming technologies like Storm and Flink, noting Apex's self-healing capabilities, independence of operators, and ability to meet latency and throughput requirements even during failures. The document recommends Apex for its maturity, fault tolerance, and ability to meet the goals of latency under 16ms, 99.999% availability, and scalability.
This document outlines a general product direction for connected clouds middleware and is intended for informational purposes only. It may not be incorporated into any contracts and does not commit Oracle to deliver any functionality. The document discusses making globally distributed stateful applications appear and operate as a single application across multiple cloud regions, providers and data centers. It also provides an agenda on challenges of multi-site deployments and introduces Oracle Coherence as a solution.
Introduction to Cloud Computing Data Center and Network Issues to Internet Research Lab at NTU, Taiwan. Another definition of cloud computing and comparison of traditional IT warehouse and current cloud data center. (ppt slide for download.) Take a opensource data center management OS, OpenStack, as an example. Underlying network issues inside a cloud DC.
This document discusses instruction level parallelism and superscalar processors. It begins by defining superscalar processors as being able to initiate and execute common instructions independently and in parallel. It then discusses limitations to instruction level parallelism due to dependencies. Different instruction issue policies are covered, including in-order and out-of-order completion. Methods for addressing dependencies like register renaming are also summarized. Examples of superscalar processor implementations from Pentium 4 and PowerPC 601 are briefly described.
Stream Computing (The Engineer's Perspective)Ilya Ganelin
This document discusses stream computing from an engineer's perspective. It begins by contrasting batch and stream processing, noting that stream processing handles data one record at a time with an emphasis on latency over throughput. The document then explores how to achieve scalability, performance, durability and availability in stream processing systems. It notes the tradeoffs between these goals and discusses challenges like handling failures. Specific open-source stream processing systems like Storm, Flink and Apex are then analyzed in terms of how they work, strengths, weaknesses and failure handling. The document concludes by discussing using distributed databases for state management in stream processing applications.
Large-scale projects development (scaling LAMP)Alexey Rybak
This 8-hours tutorial was given at various conferences including Percona conference (London), DevConf (Moscow), Highload++ (Moscow).
ABSTRACT
During this tutorial we will cover various topics related to high scalability for the LAMP stack. This workshop is divided into three sections.
The first section covers basic principles of shared nothing architectures and horizontal scaling for the app//cache/database tiers.
Section two of this tutorial is devoted to MySQL sharding techniques, queues and a few performance-related tips and tricks.
In section three we will cover the practical approach for measuring site performance and quality, porviding a "lean" support philosophy, connecting buesiness and technology metrics.
In addition we will cover a very useful Pinba real-time statistical server, it's features and various use cases. All of the sections will be based on real-world examples built in Badoo, one of the biggest dating sites on the Internet.
IBM MQ High Availabillity and Disaster Recovery (2017 version)MarkTaylorIBM
This document discusses high availability and disaster recovery strategies for IBM MQ. It describes technologies like queue manager clusters, multi-instance queue managers, and HA clusters that can be used to provide high availability when failures occur across datacenters and clouds. Multi-instance queue managers provide basic failover of a queue manager between two systems without an HA cluster. HA clusters coordinate failover of resources like the queue manager, shared storage, and IP address across multiple machines for increased reliability. The IBM MQ Appliance also supports high availability between two appliances.
IBM MQ - High Availability and Disaster RecoveryMarkTaylorIBM
IBM MQ provides capabilities to keep data safe and businesses running in the event of failures. This includes solutions for high availability (HA) and disaster recovery (DR) whether running on-premises or in hybrid cloud environments. HA aims to keep systems running through failures while DR focuses on recovering after an HA failure. Key HA technologies in IBM MQ include queue manager clusters, queue sharing groups, multi-instance queue managers, and HA clusters. These solutions provide redundancy to prevent single points of failure and enable fast failover. DR requires replicating data to separate sites which IBM MQ supports through various backup and replication features.
This document discusses applications that can experience performance issues when virtualized due to expensive address translation costs. It describes how virtual machines require an additional level of memory virtualization that introduces shadow page tables or nested page tables to map guest virtual addresses to machine memory. While hardware-assisted virtualization reduces exit frequencies and overhead compared to software address translation, it also makes the translation lookup more expensive due to deeper page table walks. In rare cases with very poor memory locality and high translation miss rates, the cycle costs of the two-level address translation can significantly degrade application performance when virtualized.
This chapter discusses reduced instruction set computers (RISC). It provides background on major advances in computers that led to RISC designs, such as cache memory, microprocessors, and pipelining. The key features of RISC processors are described, including large general-purpose registers, a limited and simple instruction set, and an emphasis on optimizing the instruction pipeline. The chapter compares RISC to CISC processors and discusses the driving forces behind both approaches. It analyzes the execution characteristics of programs and implications for processor design, such as optimizing register usage and careful pipeline design.
This document discusses the evolution of computer architecture from CISC to RISC designs. It covers major advances like cache memory and microprocessors that enabled RISC. Key RISC features include large register files optimized via register allocation algorithms. Pipelining is optimized in RISC via techniques like delayed branching. While CISC aimed to simplify compilers, RISC focuses on optimizing instruction execution through techniques like register referencing and simplified instruction sets. The document also notes ongoing debates around quantitatively and qualitatively comparing RISC and CISC designs.
This document provides an introduction to instruction-level parallel (ILP) processors. It discusses how ILP processors improve performance by executing multiple instructions in parallel through techniques like pipelining and superscalar execution. It also covers dependencies between instructions like data dependencies, control dependencies, and resource dependencies that limit parallelism. The document discusses approaches for instruction scheduling used by compilers and processors to detect and resolve dependencies to expose more instruction-level parallelism. It notes that while ILP processors can provide significant speedups for scientific programs, dependencies limit speedups for general-purpose programs to around 2-4 times.
The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures
1. The University of Hertfordshire
The Challenges facing Libraries and
Imperative Languages from
Massively Parallel Architectures
Jason McGuiness Building Futures in Computer Science Colin Egan
Computer Science empowering people through technology Computer Science
University of Hertfordshire University of Hertfordshire
UK UK
2. Presentation Structure
• Parallel processing
– Pipeline processors, MII architectures, Multiprocessors
• Processing In Memory (PIM)
– Cellular Architectures: Cyclops/DIMES and picoChip
• Code-generation issues arising from massive parallelism
• Possible solutions to this issue:
– Use the compiler or some libraries
• An example implementation of a library, and the issues
• Questions?
– Ask as we go along, but we’ll also leave time for questions at
then end of this presentation
3. Parallel Processing
• How can parallel processing be achieved?
– By exploiting:
• Instruction Level Parallelism (ILP)
• Thread Level Parallelism (TLP)
• Multi-processing
• Data Level Parallelism (DLP)
• Simultaneous Multi-Processing (SMP)
• Concurrent processing
• etcetera
4. Pipelining
• Exploits ILP by overlapping instructions in different
stages of execution:
– ILP is the amount of operations in a computer program that
can be performed on at the same time (simultaneously)
• Improves overall program execution time by
increasing throughput:
– It does not improve individual instruction execution time
5. Pipelining
• A simple 5-stage pipeline
Instruction
Cache
Access;
Instruction
Increment Decode; Data
Program Cache Write Back
Counter; Register operand Access
set up
Branch
Prediction
Execution in ALU
6. Pipelining hazards
• Pipelining introduces hazards which can severely
impact on processor performance:
– Data (RAW, WAW and WAR)
– Control (conditional branch instructions)
– Structural (hardware contention)
• To overcome such hazards complex hardware
(dynamic scheduling) or complex software (static
scheduling) or a combination of both is required
7. Pipelining hazards
Data, Control, Structural
• Data hazards
– Read After Write (RAW dependency)
• The result of an operation is required as an operand of a
subsequent operation and that result is not yet available in the
register file
add $1,$2,$3 //$1=$2+$3
addi $4,$1,8 //$4=$1+8 data dependency ($1 is unavailable)
sub $5,$2,$1 //$5=$2-$1 data dependency ($1 is unavailable)
ld $6,8($1) //$6=Mem[$1+8]... $1 may be available
add $7,$6,$8 //load-use dependency...data dependency ($6 is
unavailable)
8. Pipelining hazards
Data, Control, Structural
• Data hazards
– Write After Write (WAW data dependency)
sub $1,$2,$3
addi $1,$4,8 //the addi cannot complete ahead of the sub
– Write After Read (WAR data dependency)
sub $2,$1,$3
addi $1,$4,8 //the addi cannot complete ahead of the sub
9. Pipelining hazards
Data, Control, Structural
• Data hazards
– Read After Write (RAW data dependency) is a
problem in single instruction issue architectures
and Multiple Instruction Issue (MII) architectures
– Write After Write (WAW data dependency) and
Write After Read (WAR data dependency) are a
problem MII architectures, not single instruction
issue architectures
10. Pipelining hazards
Data, Control, Structural
• Data hazards
– Read After Write (RAW data dependency) are True Data
Dependencies
• RAW are avoided:
– in single instruction issue architectures by forwarding (bypassing),
– in MII:
» Superscalar (dynamic) by result broadcast
» VLIW (static) by aggressive instruction (percolation) scheduling
– Write After Write (WAW data dependency) and Write
After Read (WAR data dependency) are Artificial Data
Dependencies
• WAW/WAR are eliminated by Register Renaming
11. Pipelining hazards
Data, Control, Structural
• Control hazards
– Are caused by (conditional) branch instructions
– The only instruction which can alter the smooth flow of
instruction execution
– The next instruction to be executed is not known until the
condition evaluation has been determined and in the case
that the condition is to be taken then the address of the next
instruction to be executed needs to be computed:
• branch resolution
beq $1,$2,label //if $1==$2 then pc=label; else pc=pc+4;
bnq $1,$2,label //if $1!=$2 then pc=label; else pc=pc+4;
12. Pipelining hazards
Data, Control, Structural
• Control hazards
– Are a major problem:
• Between 1 in 5 and 1 in 8 instructions from the dynamic stream of a
general purpose program are conditional branch instructions
• Approximately 66% of conditional branches are taken (target path)
• Are a particular problem in Superscalar Architectures where 100s of
instructions may be in flight
– The impact can be reduced by:
• Branch prediction (dynamic and static)
• Branch removal (scheduling - static)
• Delay region scheduling (static)
• Multithreading (dynamic and static)
13. Pipelining hazards
Data, Control, Structural
• Structural hazards
– More than one instruction attempting to access the
same piece of hardware at the same time
– These can be overcome by increasing the amount
of hardware:
• For example, use the Harvard Architecture (separate the
I-cache from the D-cache)
14. Multiple Instruction Issue
• MII
– A processor that is capable of fetching and issuing
more than one instruction during each processor cycle
– A program is executed in parallel, but the processor
maintains the outward appearance of sequential
execution
– The program binary must therefore be regarded as a
specification of what was done, not how it was done
– Minimise program execution time by:
• by reducing instruction latencies
• by exploiting additional ILP
15. Multiple Instruction Issue
• Superscalar Processor:
– An MII processor where the number of instructions issued
in each clock cycle is determined dynamically by hardware
at run-time
• Instruction issue may be in-order or out-of-order (Tomasulo or
equivalent).
• VLIW Processor (Very Long Instruction Word):
– An MII processor where the number of operations
(instructions) issued in each clock cycle is fixed and where
the operations are selected statically by the compiler at
compile time
16. Superscalars
• Fetch and Decode multiple instructions in parallel
• Dynamically predict conditional branches
• Initiate instructions for parallel execution based,
whenever possible, on the availability of operands
• Frequently issue/execute instructions out-of-order
• Re-sequence instructions into the original program order
after completion
17. Superscalars
1. Instruction fetch strategies
Multiple instruction fetch
Dynamic branch prediction
2. Detecting True Data Dependencies between registers
3. Initiating or issuing multiple instructions in parallel
4. Resources for parallel execution of instructions:
Multiple pipelined functional units
Efficient memory hierarchies
5. Communication of data values through memory:
Loads and stores (RISC)
Allowing for dynamic, unpredictable behaviour of memory hierarchies
6. Committing the process state in the correct order:
Maintaining the appearance of sequential instruction execution
18. Superscalars
• Since there is very little ILP within individual basic
blocks (i.e. between branches), branch prediction is
essential for high performance
• It is important to maintain an instruction window or
window of execution from which ILP can be
extracted
19. Superscalars
• The problems include:
– Conditional branches
– Data path complexity
– The memory wall (widening gap between CPU and
memory access times)
– Limited available parallelism within a finite sized
instruction window
• Is the superscalar approach at a point of diminishing
returns?
20. VLIWs
• Attempts to reduce hardware complexity by extracting
ILP at compile time instead of run time
• The processor issues a single very long instruction during
each machine cycle
• Each very long instruction consists of a large number of
independent operations packed into a single instruction
word by the compiler
• Each operation requires a small, statically predictable
number of cycles to execute
• Each operation is itself pipelined
• Execute instructions in-order
21. VLIWs
• Problems:
– Code expansion:
• The need to pad out instruction groups with NOPs can
double the code size
– Compatibility:
• Code must be recompiled (scheduled) for every new instance
of a particular VLIW architecture
22. Problems of MII
• To sustain multiple instruction fetch, MII
architectures require a complex memory hierarchy:
– Caches
• l1, l2, stream buffers, non-blocking caches
– Virtual Memory
• TLB
– Caches suffer from:
• Compulsory misses
• Capacity misses
• Collisions
23. Memory Problems
• Compulsory misses:
– The first time a processor address is requested it will not be
in cache memory and must be fetched from a slower level
of the memory hierarchy:
• Hopefully main (physical) memory
• If not from Virtual Memory
• If not from secondary storage
– This can result in long delay (latency) due to large access
time(s)
24. Memory Problems
• Capacity misses:
– There are more cache block requests than the size of the
cache
• Collisions:
– The processor makes a request to the same block but for
different instructions/data
• For both:
– Blocks therefore have to be replaced
– But a block that has been replaced might be referenced
again resulting in yet more replacements
25. Thread Level Parallelism
• A thread can be considered to be a ‘light weight
process’
– Where a thread consists of a short sequence of code, with
its own:
• registers, data, state and so on
• but shares process space
• TLP is exploited by simultaneously executing
different threads on different processors:
– TLP is therefore exploited by multiprocessors
26. Multiprocessors
• Should be:
– Easily scalable
– Fault tolerant
– Achieve higher performance than a uni-processor
• But …
– How many processors can we connect?
– How do parallel processors share data?
– How are parallel processors co-ordinated?
27. Multiprocessors
• Shared Memory Processors (SMP)
– All processors share a single global memory address space
– Communication is through shared variables in memory
– Synchronisation is via locks (hardware) or semaphores
(software)
28. Multiprocessors
• Uniform Memory Access (UMA)
– All memory accesses take the same time
– Do not scale well
• Non-uniform Memory Access (NUMA)
– Each processor has a private (local) memory
– Global memory access time can vary from processor to
processor
– Present more programming challenges
– Are easier to scale
29. Multiprocessors
• NUMA
– Communication and synchronization are achieved
through message passing:
• Processors could then, for example, communicate over
an interconnection network
• Processors use send and receive primitives
30. Multiprocessors
• The difficulty is in writing effective parallel programs:
– Parallel programs are inherently harder to develop
– Programmers need to understand the underlying hardware
– Programs tend not to be portable
– Amdahl’s law; a very small part of a program that is inherently
sequential can severely limit the attainable speedup
• “It remains to be seen how many important applications
will run on multiprocessors via parallel processing.”
• “The difficulty has been that too few important
application programs have been written to complete tasks
sooner on multiprocessors.”
31. Multiprocessors
• Multiprocessors suffer the same memory problems as
uni-processors and in addition:
– The problem of maintaining memory coherence
between the processors
32. Cache Coherence
• A read operation must return the value of the
latest write operation
• But in multiprocessors each processor will
(probably) have its own private cache memory
– There is no guarantee of data consistency between
private (local) cache memory and shared (global)
memory
33. The memory wall
• Today’s processors are 10 times faster than processors of only
5 years ago
• But today’s processors do not perform at 1/10th of the time of
processors from 5 years ago
• CPU speeds double approximately every eighteen months
(Moore's Law), while main memory speeds double only about
every ten years
– This causes a bottle-neck in the memory subsystem, which impacts on
overall system performance
– The “memory wall” or the “Von Neumann bottleneck”
34. The memory wall
• In today’s processors RAM roughly holds 1% of the
data stored on a hard disk drive
• And hard disk drives supply data nearly a thousand
times slower than a processor can use it
• This means that for many data requests, a processor is
idle while it waits waiting for data to be found and
transferred
35. The memory wall
• IBM’s vice president Mark Dean has said:
– "What's needed today is not so much the ability to
process each piece of data a great deal, it's the
ability to swiftly sort through a huge amount of
data."
36. The memory wall
• Mark Dean again:
– "The key idea is that instead of focusing on
processor micro-architecture and structure, as in
the past, we optimize the memory system's latency
and throughput — how fast we can access, search
and move data …"
• Leads to the concept of Processing In Memory
(PIM)
37. Processing in memory
• The idea of PIM is to overcome the bottleneck
between the processor and main memory by
combining a processor and memory on a single chip
• The benefits of a PIM architecture are:
– Reduced memory latency
– Increases memory bandwidth
– Simplifies the memory hierarchy
– Provides multi-processor scaling capabilities:
• Cellular architectures
– Avoids the Von Neumann bottleneck
38. Processing in memory
• This means that:
– Much of the expensive memory hierarchy can be
dispensed with
– CPU cores can be replaced with simpler designs
– Less power is used by PIM
– Less silicon space is used by PIM
39. Processing in memory
• But …
– Processor speed is reduced
– The amount of available memory is reduced
• However, PIM is easily scaled:
– Multiple PIM chips connected together forming a
network of PIM cells
– Such scaled architectures are called Cellular
architectures
40. Cellular architectures
• Cellular architectures consist of a high number of
cells (PIM units):
– With tens of thousands up to one million processors
– Each cell (PIM) is small enough to achieve extremely
large-scale parallel operations
– To minimise communication time between cells, each cell
is only connected to its neighbours
41. Cellular architectures
• Cellular architectures are fault tolerant:
– With so many cells, it is inevitable that some processors
will fail
– Cellular architecture simply re-route instructions and data
around failed cells
• Cellular architectures are ranked highly as today’s
Supercomputers:
– IBM BlueGene takes the top slots in the Top 500 list
42. Cellular architectures
• Cellular architectures are threaded:
– Each thread unit:
• Is independent of all other thread units
• Serves as a single in-order issue processor
• Shares computationally expensive hardware such as floating-point
units
– There can be a large number of thread units:
• 1,000s if not 100,000s of thousands
• Therefore they are massively parallel architectures
43. Cellular architectures
• Cellular architectures are NUMA
– Have irregular memory access:
• Some memory is very close to the thread units and is
extremely fast
• Some is off-chip and slow
• Cellular architectures, therefore, use caches
and have a memory hierarchy
44. Cellular architectures
• In Cellular architectures multiple thread units
perform memory accesses independently
• This means that the memory subsystem of
Cellular architectures do in fact require some
form of memory access model that permits
memory accesses to be effectively served
45. Cellular architectures
• Uses of Cellular architectures:
– Games machines (simple Cellular architecture)
– Bioinformatics (protein folding)
– Imaging
• Satellite
• Medical
• Etcetera
– Research
– Etcetera
46. Cellular architectures
• Examples:
– BlueGene Project:
• Cyclops (IBM) – next generation from BlueGene/P,
called BlueGene/C
– DIMES – a prototype implementation
– Gilgamesh (NASA)
– Shamrock (Notre Dame)
– picoChip (Bath, UK)
47. IBM Cyclops or BlueGene/C
• Developed by IBM at the Tom Watson Research
Center
• Also called BlueGene/C in comparison with the
earlier version of BlueGene/L and BlueGene/P
48. Cyclops
• The idea of Cyclops is to provide around one million
processors:
– Where each processor can perform a billion operations per
second
– Which means that Cyclops will be capable of one petaflop
of computations per second (a thousand trillion
calculations per second)
49. Cyclops
Board
Chip
Processor
Off-Chip
Memory
SP SP SP SP SP SP SP SP SP
4 GB/sec
TU TU … TU TU TU … TU … TU TU … TU
1 Gbits/sec
FPU SR FPU SR FPU SR
Off-Chip
Memory
4 GB/sec
Crossbar Network
6 * 4 GB/sec Other
Chips via
6 * 4 GB/sec 3D mesh
Off-Chip
Memory
50 MB/sec
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
BANK
BANK
BANK
BANK
BANK
BANK
BANK
BANK
Off-Chip
Memory
… IDE
SP SP SP SP SP SP SP SP HDD
50. DIMES
• DIMES:
– Is the first hardware implementation of a Cellular
architecture
– Is a simplified ‘cut-down’ version of Cyclops
– Is hardware validation tool for Cellular architectures
– Emulates Cellular architectures, in particular Cyclops,
cycle-by-cycle
– Is implemented on at least one FPGA
– Has been evaluated by Jason
51. DIMES
• The DIMES implementation that Jason evaluated:
– Supports a P-thread programming model
– Is a dual processor where each processor has four thread
units
– Has 4K of scratch-pad (local) memory per thread unit
– Has two banks of 64K global shared memory
– Has different memory models:
• Scratch pad memory obeys the program consistency model for all
of the eight thread units
• Global memory obeys the sequential consistency model for all of
the eight thread units
– Is called DIMES/P2
53. DIMES
• Jason’s concerns were:
– How to manage a potentially large number of
threads
– How to exploit parallelism from the input source
code in these threads
– How to manage memory consistency
54. DIMES
• Jason tested his concerns by using an “embarrassingly
parallel program which generated Mandelbrot sets”
• Jason’s approach was to distribute the work-load between
threads and he also implemented a work-stealing
algorithm to balance loads between threads:
– When a thread completed its ‘work-load’, rather than
remain idle that thread would ‘steal-work’ from another
‘busy’ thread
– This meant that he maximised parallelism and improved
thread performance and hence overall program execution
time
56. picoChip
• picoChip are based in Bath
– “… is dedicated to providing fast, flexible wireless
solutions for next generation telecommunications
systems.”
• picoArrayTM
– Is a tiled architecture
– 308 heterogeneous processors connected together,
– The interconnects consist of bus switches joined
by picoBusTM
– Each processor is connected to the picoBusTM
above and below it
58. picoChip and Parallelism
• The parallelism provided by picoChip is
synchronous
– This avoids many of the issues raised by the other
architectures that expose asynchronous parallelism
– But it is at the cost of the flexibility that
asynchronous parallelism provides
59. Issues arising ...
• Such massively parallel hardware creates a
serious problem for the programmer:
– Because it is computer science folklore that writing
parallel programs is very hard
• Writing fast and correct massively parallel
programs is a very major concern
60. Abstraction of the Parallelism
• This may be done in various ways:
– For example within the compiler:
• Using trace scheduling, list-based scheduling or other
data-flow based means amongst others
– Using language features:
• HPF, UPC or the additions to C++ in the IBM Visual
Age compiler and Microsoft's additions
– Most commonly using libraries:
• For example: Posix threads, Win32 threads, OpenMP,
boost.thread, home-grown wrappers
61. Parallelism using Libraries
• Using libraries has a major advantage over
implementing parallelism within the language:
– It does not require the design of a new language,
nor learning a new language
– Novel languages are traditionally seen as a burden
and often hinder adoption of new systems due to
the volume of existing source-code
– But libraries are especially prone to mis-use and
are traditionally hard to use
62. Issues of Libraries: part I
• The model exposed is very diverse:
– Message passing, e.g. OpenMP
– Exposes loop-level parallelism (e.g. “forall ...”
constructs) exposes very limited parallelism in
general-purpose code
– Is very low-level, e.g. the primitives exposed in
Posix Threads are extremely basic: mutexes,
condition variables, and basic thread operations
• The libraries require experience to use well, or
even correctly
63. Part II: Non-composability of atomic
operations!
• The fact that atomic operations do not compose is a
major concern when using libraries!
• The composition of thread-safe data structures does
not guarantee thread-safety:
– At each combination of the data structures, more locks are
required to ensure correct operation!
– This implies more and more layers of locks, that are
slow...
64. Parallelism in the Compiler
• Given the concerns with regards to libraries,
what about parallelising compilers?
• The fact is that auto parallelising compilers
exist, e.g. list-based scheduling implemented
in the EARTH-C (circa 1999) has been
proven to be optimal for that architecture
• Data-flow compilers have existed for years
– Why aren't they used?
65. Industrial Parallelising Compilers
• Microsoft is introducing OpenMP-based
constructs into their compiler, e.g. “forall”
• IBM Visual Age has similar functionality
• Java has a thread library
• C++0x: much work has been done regarding
standardisation with respect to multiple threads
66. C++ as an example
• The uses of C++ makes it an interesting target
for parallelisation:
– Although imperative, so arguably flawed for
implementing parallelism
– It has great market penetration, therefore there is
much demand for using parallelism
– Commonly used in high-performance, multi-
threaded systems
– General-purpose nature and quality libraries are
increasing the appeal to super-computers
67. Parallelism support in C++
• Libraries exist beyond the usual C libraries:
– boost.thread – exists now, requires standards-
compliant compilers
– C++0x: details of the threading support are
becoming apparent that appear to include:
• Atomic operations (memory consistency), exceptions,
machine-model underpins approach
• Threading models: thread-as-a-class, lock objects
• Thread pools – coming later – probably
• More details on the web, or at the ACCU - in flux
68. Experience using C++
• Recall DIMES:
– Prototype of massively parallel hardware
– Posix-threads style library implementing threads
– C++ thread-as-a-class wrapper implemented
• Summary of experience:
– Hardly object-orientated: no separation in design of the
Mandlebrot application with the work-stealing algorithm
and thread pool
– The software insufficiently separated the details of the
hardware features from the design
69. Further experiences using C++
• From this work and other experiences, I
developed a more interesting thread library:
– Traits abstract underlying OS thread API from
wrapper library
– Therefore has hardware abstractions too
– Provision of higher-level threading models:
• Primarily based on futures and thread pools
– Use of thread pools and futures creates a singly
rooted-tree of threads:
• Trivially deadlock free – a holy grail!
70. C++0x threads now? No!
• Included in libjmmcg:
– Relies upon non-standard behaviour and broken optimisers! For
example:
• Problems with global code-motion moving apparently const-objects past
locks
• Exception stack is unique to a thread, not global to program and currently
unspecified
• Implementation of std::list
– DSEL has syntax limitations due to design of C++
– Doesn't use boost ...
– Has example code and test cases
– Isn't complete! (e.g. Posix & sequential specialisations
incomplete, some inefficiencies)
– But get it from libjmmcg.sourceforge.net, under LGPL
71. Trivial example usage
struct res_t { int i; };
struct work_type {
typedef res_t result_type;
void process(result_type &) {}
};
pool_type pool(2);
async_work work(creator_t::it(work_type(1)));
execution_context context(pool<<joinable()<<time_critical()<<work);
pool.erase(context);
context->i;
• The devil is in the omitted details: the typedefs for:
– pool_type, async_work, execution_context, joinable, time_critical
• The library requires that the work to be mutated has the
items in italics defined
72. Explanation of the example
• The concept is:
– that asynchronous work (async_work) that should be
mutated (process) to the specified output type (result_type)
is transferred into a thread pool (pool_type) of some kind
• This transfer (<<) may, optionally (joinable), return a
future (execution_context)
– Which can be used to communicate (->) the result of the
mutation, executed at kernel priority (time_critical), back
to the caller
• The future also allows exceptions to be propagated
73. More details regarding the example
• The thread pool (pool_type) has many traits:
– Master-slave or work-stealing
– The thread API (Win32, Posix or sequential)
– The thread API costs in very rough terms
– Implies a work schedule that is a fifo baker's ticket schedule,
implementation of GSS(k) is in progress
• The library effectively implements a software
simulation of data-flow
• Wrapping a function call, plus parameters, in a class
converts Kevlin's threading model to this one
74. Time for controversy....
What faces programmers...
• Large-scale parallelism is here, now:
– Blade frames at work:
• 4 cores x 4 CPUs x 20 frames per rack = 320 thread units, in a
NUMA architecture
• The hardware is expensive!
• But so is the software ...
– It must be programmed, economically
– The programs must be maintained ...
• Or it will be an expensive failure?
75. Talk summary
• In this talk we have looked at parallel processing and
justified the reasons for Processing In Memory (PIM) and
therefore cellular architectures
• We have briefly looked at two example architectures:
– Cyclops and picoChip
• Jason has worked on DIMES, the first implementation of a
(cut-down) version of a cellular architecture
• The issues of programming for these massively parallel
architectures has been described
• We focussed on the future of threading in C++
76. The University of Hertfordshire
The Challenges facing Libraries and
Imperative Languages from
Massively Parallel Architectures
Questions?
Jason McGuiness Building Futures in Computer Science Colin Egan
Computer Science empowering people through technology Computer Science
University of Hertfordshire University of Hertfordshire
UK UK