Here are the key points about the C++11 memory model and ordering:
- The C++ memory model aims to balance performance and correctness for concurrent programs. It allows optimizations but prevents data races.
- Operations on atomic types have memory ordering properties that restrict how instructions can be reordered with respect to other threads.
- A release fence prevents writes from moving past the fence. An acquire fence prevents reads from moving before the fence.
- For the code snippet shown, a thread reading flag needs to ensure it sees the write to data. This requires an acquire fence after loading flag to prevent the load from moving above the write to data.
So the correct answer is that it needs an acquire fence after loading flag
The document discusses Java memory management and garbage collection. It describes the responsibilities of garbage collection as allocating memory, ensuring referenced objects remain in memory, and recovering memory from unreachable objects. It discusses generation collection and the Hotspot memory model with young and old generations. The major garbage collectors are described as serial, parallel, parallel compacting, and concurrent mark-sweep collectors. The document provides examples of VM options and logging output.
Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu
Highlighted notes on Optimizing Parallel Reduction in CUDA
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Interesting optimizations, i should try these soon as PageRank is basically lots of sums.
Paractical Solutions for Multicore ProgrammingGuy Korland
The document discusses practical solutions for multicore programming. It describes some challenges with concurrent programming such as lost updates, deadlocks, and lack of performance. It then presents several solutions to address these challenges, including software transactional memory (STM), lock-free data structures, and relaxed consistency models. STM approaches discussed include DSTM2, JVSTM, Atom-Java, and Deuce STM. It also discusses the need for fine-grained concurrent data structures like a lock-free pool. The document concludes by discussing how relaxing linearizability requirements through approaches like quasi-linearizability can improve concurrency.
This document discusses techniques for building efficient software transactional memory (STM) systems. It begins by demonstrating how STM avoids problems like lost updates and deadlocks that can occur with traditional locking. It then reviews several existing STM implementations and their approaches. Benchmark results showing the performance of STM systems are presented, along with analysis of overhead sources. The document concludes by discussing optimizations like static analysis that can reduce STM overhead and make data structures more friendly to concurrent transactions.
This document discusses Java threads and multithreading. It begins by defining multitasking as allowing multiple programs to run concurrently, while multithreading refers to multiple threads of control within a single program. It notes that threads allow maintaining responsiveness during long tasks and enabling cancellation of separable tasks. The document then covers creating and managing threads in Java by extending the Thread class or implementing Runnable, and describes thread states, priorities, and daemon threads.
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.
The document discusses optimizing parallel reduction in CUDA. It presents 7 versions of a parallel reduction kernel with optimizations including interleaved vs sequential addressing to avoid bank conflicts, unrolling the last warp, adding the first reduction during global memory loads, and completely unrolling the reduction using C++ templates. Performance is improved from 2 GB/s to over 30 GB/s for a 4 million element reduction, reaching over 15x speedup by eliminating instruction overhead. Templates allow specializing the kernel for different block sizes known only at launch time.
University of Virginia
cs4414: Operating Systems
http://rust-class.org
Scheduling in Linux, 2002-2014
Energy and Scheduling
OSX Mavericks Timer Coalescing
Scheduling Web Servers
Healthcare.gov
For embedded notes, see: http://rust-class.org/class-12-scheduling-in-linux-and-web-servers.html
The document discusses Java memory management and garbage collection. It describes the responsibilities of garbage collection as allocating memory, ensuring referenced objects remain in memory, and recovering memory from unreachable objects. It discusses generation collection and the Hotspot memory model with young and old generations. The major garbage collectors are described as serial, parallel, parallel compacting, and concurrent mark-sweep collectors. The document provides examples of VM options and logging output.
Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu
Highlighted notes on Optimizing Parallel Reduction in CUDA
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Interesting optimizations, i should try these soon as PageRank is basically lots of sums.
Paractical Solutions for Multicore ProgrammingGuy Korland
The document discusses practical solutions for multicore programming. It describes some challenges with concurrent programming such as lost updates, deadlocks, and lack of performance. It then presents several solutions to address these challenges, including software transactional memory (STM), lock-free data structures, and relaxed consistency models. STM approaches discussed include DSTM2, JVSTM, Atom-Java, and Deuce STM. It also discusses the need for fine-grained concurrent data structures like a lock-free pool. The document concludes by discussing how relaxing linearizability requirements through approaches like quasi-linearizability can improve concurrency.
This document discusses techniques for building efficient software transactional memory (STM) systems. It begins by demonstrating how STM avoids problems like lost updates and deadlocks that can occur with traditional locking. It then reviews several existing STM implementations and their approaches. Benchmark results showing the performance of STM systems are presented, along with analysis of overhead sources. The document concludes by discussing optimizations like static analysis that can reduce STM overhead and make data structures more friendly to concurrent transactions.
This document discusses Java threads and multithreading. It begins by defining multitasking as allowing multiple programs to run concurrently, while multithreading refers to multiple threads of control within a single program. It notes that threads allow maintaining responsiveness during long tasks and enabling cancellation of separable tasks. The document then covers creating and managing threads in Java by extending the Thread class or implementing Runnable, and describes thread states, priorities, and daemon threads.
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.
The document discusses optimizing parallel reduction in CUDA. It presents 7 versions of a parallel reduction kernel with optimizations including interleaved vs sequential addressing to avoid bank conflicts, unrolling the last warp, adding the first reduction during global memory loads, and completely unrolling the reduction using C++ templates. Performance is improved from 2 GB/s to over 30 GB/s for a 4 million element reduction, reaching over 15x speedup by eliminating instruction overhead. Templates allow specializing the kernel for different block sizes known only at launch time.
University of Virginia
cs4414: Operating Systems
http://rust-class.org
Scheduling in Linux, 2002-2014
Energy and Scheduling
OSX Mavericks Timer Coalescing
Scheduling Web Servers
Healthcare.gov
For embedded notes, see: http://rust-class.org/class-12-scheduling-in-linux-and-web-servers.html
This thesis implements the WHAM algorithm for estimating free energy profiles from molecular dynamics simulations using NVIDIA CUDA to run it on GPUs. WHAM iteratively calculates unbiased probability distributions and free energy values from multiple biased simulations. The author parallelized WHAM using CUDA, testing performance on different GPU architectures versus CPUs. GPU-WHAM achieved convergence comparable to CPU-WHAM but was up to twice as fast, with the speedup increasing with GPU computational capabilities. The GPU/CPU speed ratio remained constant with varying problem sizes, indicating good scaling.
We Love Performance! How Tic Toc Games Uses ECS in Mobile Puzzle GamesUnity Technologies
Tic Toc Games recently implemented Unity's Entity Component System (ECS) in their mobile puzzle engine, which brought both great performance improvements and faster iteration time. This intermediate-level session will explain why ECS is able to process data much faster, how they increased iteration speed using ECS, and their experience learning and working with Unity's ECS package.
Garth Smith - Tic Toc Games
Exploiting the Linux Kernel via Intel's SYSRET Implementationnkslides
Intel handles SYSRET instructions weirdly and might throw around exceptions while still being in ring0. When the kernel is not being extra careful when returning to userland after being signaled with a syscall bad things can happen. Like root shells.
Don't mention TLB (at all?!?), just confuses people. Was just put so people
were aware that it was being set up for deterministic behaviour (the side
channel is the cache exclusively, not the TLB missing).
Don't mention the privilege level arch stuff until *after* Variant 1 has been
discussed, rather prior to Variant 2, and especially 3/Meltdown.
To explain the victim vs. attacker domains better in Variant 1, the example of
two threads in a process should be given, where one thread is the
'parent'/'governor' of the other(s), and has privileged information, e.g., a
valid TLS session key for a bank account login in another thread/tab in a
browser. One thread should not be able to 'see' another's private data.
Items such as the AntiVirus report could easily be omitted...
Thanks,
Kim Phillips
This paper aims to make students become familiar with the UNIX environment, developing programming skills in C, to increase their exposure to a shell functionality and put into practice concepts of management processes.
This document discusses GPU computing and CUDA programming. It begins with an introduction to GPU computing and CUDA. CUDA (Compute Unified Device Architecture) allows programming of Nvidia GPUs for parallel computing. The document then provides examples of optimizing matrix multiplication and closest pair problems using CUDA. It also discusses implementing and optimizing convolutional neural networks (CNNs) and autoencoders for GPUs using CUDA. Performance results show speedups for these deep learning algorithms when using GPUs versus CPU-only implementations.
Speech in Let'Swift conference on 23, Sep 2017.
This is about various concurrency APIs in swift and async/await, actor model and debugging option.
2017년 9월 23일에 Let'Swift에서 발표한 스위프트에서 동시성에 대한 자료입니다.
현재 사용할 수 있는 다양한 API에 대해 정리해보고 앞으로 나아갈길 그리고 디버깅에 대한 이야기를 했습니다.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Jugal Parikh, Microsoft
Holly Stewart, Microsoft
Humans are susceptible to social engineering. Machines are susceptible to tampering. Machine learning is vulnerable to adversarial attacks. Singular machine learning models can be “gamed” leading to unexpected outcomes.
In this talk, we’ll compare the difficulty of tampering with cloud-based models and client-based models. We then discuss how we developed stacked ensemble models to make our machine learning defenses less susceptible to tampering and significantly improve overall protection for our customers. We talk about the diversity of our base ML models and technical details on how they are optimized to handle different threat scenarios. Lastly, we’ll describe suspected tampering activity we’ve witnessed using protection telemetry from over half a billion computers, and whether our mitigation worked.
This document summarizes an academic research paper that analyzes an optimal N-policy for a Bernoulli feedback Mx/G/1 machining system with general setup times. The paper develops a mathematical model of the system using supplementary variable technique to obtain the probability generating function of the system queue size distribution and mean number of failed units. It also derives the Laplace-Stieltjes transform of the waiting time and evaluates the mean waiting time. Finally, it formulates the total operational cost function to determine the optimal value of N that minimizes costs.
This document provides an overview of Nvidia CUDA programming basics. It discusses the CUDA programming model, memory model, and API. The programming model describes how the GPU is seen as a compute device to execute kernels in parallel across a grid of thread blocks. Each block contains a batch of cooperating threads with shared memory. The memory model describes the different memory spaces including shared, global, and constant memory. The API extends C with qualifiers for functions, variables, and execution configurations to specify kernel execution. A simple example calculates scalar products across vectors in parallel. Optimization techniques for the example are discussed.
As the leap second approaches, there is no better time to reflect on our misconceptions about time and numerals, past catastrophes and possible mitigation techniques.
Blocks is a cool concept and is very much needed for performance improvements and responsiveness. GCD helps run blocks effortlessly by scheduling on a desired queue, priority and lots more.
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
This document discusses thread cooperation in parallel programming on GPUs. It introduces mechanisms for threads to communicate and synchronize their parallel execution. Specifically, it shows how to index data for threads within blocks and across multiple blocks to allow vector addition on arbitrarily long vectors. By launching threads in a grid of blocks and incrementing the thread index by the total number of threads, each thread can work on a different portion of the data in parallel.
The document discusses the Disruptor, a high performance inter-thread messaging library created by LMAX. It introduces key concepts like the ring buffer, producer-consumer pattern, and consumer dependency graphs. It then explains the Disruptor architecture which uses a multithreaded producer to populate a ring buffer of event data which is consumed by handler threads in a specified order without blocking. Finally, it outlines examples of unicast, multicast and pipelined consumer configurations and references additional resources to learn more about the Disruptor and related concurrency concepts.
Not all gameplay needs to happen immediately. In fact, there are many cases in which deferring commands may offer a better outcome – improved user experience, performance, etc. These slides explore thinking about where deferred commands are needed and provides examples on how to take full advantage of the Entity Command Buffer.
Speaker: Elora Krzanich – Unity
Watch the session on YouTube: https://youtu.be/SecJibpoTYw
This document discusses the Java Virtual Machine (JVM) memory model and just-in-time (JIT) compilation. It explains that the JVM uses dynamic compilation via a JIT to optimize bytecode at runtime. The JIT profiles code and performs optimizations like inlining, loop unrolling, and escape analysis. It also discusses how the JVM memory model allows for instruction reordering and caching but ensures sequential consistency through happens-before rules and volatile variables. The document provides examples of anomalies that can occur without synchronization and how tools like synchronized, locks, and atomic operations can be used to prevent issues.
Building High-Performance Language Implementations With Low EffortStefan Marr
This talk shows how languages can be implemented as self-optimizing interpreters, and how Truffle or RPython go about to just-in-time compile these interpreters to efficient native code.
Programming languages are never perfect, so people start building domain-specific languages to be able to solve their problems more easily. However, custom languages are often slow, or take enormous amounts of effort to be made fast by building custom compilers or virtual machines.
With the notion of self-optimizing interpreters, researchers proposed a way to implement languages easily and generate a JIT compiler from a simple interpreter. We explore the idea and experiment with it on top of RPython (of PyPy fame) with its meta-tracing JIT compiler, as well as Truffle, the JVM framework of Oracle Labs for self-optimizing interpreters.
In this talk, we show how a simple interpreter can reach the same order of magnitude of performance as the highly optimizing JVM for Java. We discuss the implementation on top of RPython as well as on top of Java with Truffle so that you can start right away, independent of whether you prefer the Python or JVM ecosystem.
While our own experiments focus on SOM, a little Smalltalk variant to keep things simple, other people have used this approach to improve peek performance of JRuby, or build languages such as JavaScript, R, and Python 3.
This document discusses the Java Memory Model (JMM) and how it describes how threads interact through memory in Java. It covers key aspects of the JMM including happens-before ordering, memory barriers, visibility rules, and how final fields and atomic instructions interact with the memory model. It also discusses performance considerations and how different processor architectures implement memory ordering.
This document discusses the Java Memory Model (JMM). It begins by introducing the goals of familiarizing the attendee with the JMM, how processors work, and how the Java compiler and JVM work. It then covers key topics like data races, synchronization, atomicity, and examples. The document provides examples of correctly synchronized programs versus programs with data races. It explains concepts like happens-before ordering, volatile variables, and atomic operations. It also discusses weaknesses in some common multi-threading constructs like double-checked locking and discusses how constructs like final fields can enable safe publication of shared objects. The document concludes by mentioning planned improvements to the JMM in JEP 188.
This thesis implements the WHAM algorithm for estimating free energy profiles from molecular dynamics simulations using NVIDIA CUDA to run it on GPUs. WHAM iteratively calculates unbiased probability distributions and free energy values from multiple biased simulations. The author parallelized WHAM using CUDA, testing performance on different GPU architectures versus CPUs. GPU-WHAM achieved convergence comparable to CPU-WHAM but was up to twice as fast, with the speedup increasing with GPU computational capabilities. The GPU/CPU speed ratio remained constant with varying problem sizes, indicating good scaling.
We Love Performance! How Tic Toc Games Uses ECS in Mobile Puzzle GamesUnity Technologies
Tic Toc Games recently implemented Unity's Entity Component System (ECS) in their mobile puzzle engine, which brought both great performance improvements and faster iteration time. This intermediate-level session will explain why ECS is able to process data much faster, how they increased iteration speed using ECS, and their experience learning and working with Unity's ECS package.
Garth Smith - Tic Toc Games
Exploiting the Linux Kernel via Intel's SYSRET Implementationnkslides
Intel handles SYSRET instructions weirdly and might throw around exceptions while still being in ring0. When the kernel is not being extra careful when returning to userland after being signaled with a syscall bad things can happen. Like root shells.
Don't mention TLB (at all?!?), just confuses people. Was just put so people
were aware that it was being set up for deterministic behaviour (the side
channel is the cache exclusively, not the TLB missing).
Don't mention the privilege level arch stuff until *after* Variant 1 has been
discussed, rather prior to Variant 2, and especially 3/Meltdown.
To explain the victim vs. attacker domains better in Variant 1, the example of
two threads in a process should be given, where one thread is the
'parent'/'governor' of the other(s), and has privileged information, e.g., a
valid TLS session key for a bank account login in another thread/tab in a
browser. One thread should not be able to 'see' another's private data.
Items such as the AntiVirus report could easily be omitted...
Thanks,
Kim Phillips
This paper aims to make students become familiar with the UNIX environment, developing programming skills in C, to increase their exposure to a shell functionality and put into practice concepts of management processes.
This document discusses GPU computing and CUDA programming. It begins with an introduction to GPU computing and CUDA. CUDA (Compute Unified Device Architecture) allows programming of Nvidia GPUs for parallel computing. The document then provides examples of optimizing matrix multiplication and closest pair problems using CUDA. It also discusses implementing and optimizing convolutional neural networks (CNNs) and autoencoders for GPUs using CUDA. Performance results show speedups for these deep learning algorithms when using GPUs versus CPU-only implementations.
Speech in Let'Swift conference on 23, Sep 2017.
This is about various concurrency APIs in swift and async/await, actor model and debugging option.
2017년 9월 23일에 Let'Swift에서 발표한 스위프트에서 동시성에 대한 자료입니다.
현재 사용할 수 있는 다양한 API에 대해 정리해보고 앞으로 나아갈길 그리고 디버깅에 대한 이야기를 했습니다.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Jugal Parikh, Microsoft
Holly Stewart, Microsoft
Humans are susceptible to social engineering. Machines are susceptible to tampering. Machine learning is vulnerable to adversarial attacks. Singular machine learning models can be “gamed” leading to unexpected outcomes.
In this talk, we’ll compare the difficulty of tampering with cloud-based models and client-based models. We then discuss how we developed stacked ensemble models to make our machine learning defenses less susceptible to tampering and significantly improve overall protection for our customers. We talk about the diversity of our base ML models and technical details on how they are optimized to handle different threat scenarios. Lastly, we’ll describe suspected tampering activity we’ve witnessed using protection telemetry from over half a billion computers, and whether our mitigation worked.
This document summarizes an academic research paper that analyzes an optimal N-policy for a Bernoulli feedback Mx/G/1 machining system with general setup times. The paper develops a mathematical model of the system using supplementary variable technique to obtain the probability generating function of the system queue size distribution and mean number of failed units. It also derives the Laplace-Stieltjes transform of the waiting time and evaluates the mean waiting time. Finally, it formulates the total operational cost function to determine the optimal value of N that minimizes costs.
This document provides an overview of Nvidia CUDA programming basics. It discusses the CUDA programming model, memory model, and API. The programming model describes how the GPU is seen as a compute device to execute kernels in parallel across a grid of thread blocks. Each block contains a batch of cooperating threads with shared memory. The memory model describes the different memory spaces including shared, global, and constant memory. The API extends C with qualifiers for functions, variables, and execution configurations to specify kernel execution. A simple example calculates scalar products across vectors in parallel. Optimization techniques for the example are discussed.
As the leap second approaches, there is no better time to reflect on our misconceptions about time and numerals, past catastrophes and possible mitigation techniques.
Blocks is a cool concept and is very much needed for performance improvements and responsiveness. GCD helps run blocks effortlessly by scheduling on a desired queue, priority and lots more.
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
This document discusses thread cooperation in parallel programming on GPUs. It introduces mechanisms for threads to communicate and synchronize their parallel execution. Specifically, it shows how to index data for threads within blocks and across multiple blocks to allow vector addition on arbitrarily long vectors. By launching threads in a grid of blocks and incrementing the thread index by the total number of threads, each thread can work on a different portion of the data in parallel.
The document discusses the Disruptor, a high performance inter-thread messaging library created by LMAX. It introduces key concepts like the ring buffer, producer-consumer pattern, and consumer dependency graphs. It then explains the Disruptor architecture which uses a multithreaded producer to populate a ring buffer of event data which is consumed by handler threads in a specified order without blocking. Finally, it outlines examples of unicast, multicast and pipelined consumer configurations and references additional resources to learn more about the Disruptor and related concurrency concepts.
Not all gameplay needs to happen immediately. In fact, there are many cases in which deferring commands may offer a better outcome – improved user experience, performance, etc. These slides explore thinking about where deferred commands are needed and provides examples on how to take full advantage of the Entity Command Buffer.
Speaker: Elora Krzanich – Unity
Watch the session on YouTube: https://youtu.be/SecJibpoTYw
This document discusses the Java Virtual Machine (JVM) memory model and just-in-time (JIT) compilation. It explains that the JVM uses dynamic compilation via a JIT to optimize bytecode at runtime. The JIT profiles code and performs optimizations like inlining, loop unrolling, and escape analysis. It also discusses how the JVM memory model allows for instruction reordering and caching but ensures sequential consistency through happens-before rules and volatile variables. The document provides examples of anomalies that can occur without synchronization and how tools like synchronized, locks, and atomic operations can be used to prevent issues.
Building High-Performance Language Implementations With Low EffortStefan Marr
This talk shows how languages can be implemented as self-optimizing interpreters, and how Truffle or RPython go about to just-in-time compile these interpreters to efficient native code.
Programming languages are never perfect, so people start building domain-specific languages to be able to solve their problems more easily. However, custom languages are often slow, or take enormous amounts of effort to be made fast by building custom compilers or virtual machines.
With the notion of self-optimizing interpreters, researchers proposed a way to implement languages easily and generate a JIT compiler from a simple interpreter. We explore the idea and experiment with it on top of RPython (of PyPy fame) with its meta-tracing JIT compiler, as well as Truffle, the JVM framework of Oracle Labs for self-optimizing interpreters.
In this talk, we show how a simple interpreter can reach the same order of magnitude of performance as the highly optimizing JVM for Java. We discuss the implementation on top of RPython as well as on top of Java with Truffle so that you can start right away, independent of whether you prefer the Python or JVM ecosystem.
While our own experiments focus on SOM, a little Smalltalk variant to keep things simple, other people have used this approach to improve peek performance of JRuby, or build languages such as JavaScript, R, and Python 3.
This document discusses the Java Memory Model (JMM) and how it describes how threads interact through memory in Java. It covers key aspects of the JMM including happens-before ordering, memory barriers, visibility rules, and how final fields and atomic instructions interact with the memory model. It also discusses performance considerations and how different processor architectures implement memory ordering.
This document discusses the Java Memory Model (JMM). It begins by introducing the goals of familiarizing the attendee with the JMM, how processors work, and how the Java compiler and JVM work. It then covers key topics like data races, synchronization, atomicity, and examples. The document provides examples of correctly synchronized programs versus programs with data races. It explains concepts like happens-before ordering, volatile variables, and atomic operations. It also discusses weaknesses in some common multi-threading constructs like double-checked locking and discusses how constructs like final fields can enable safe publication of shared objects. The document concludes by mentioning planned improvements to the JMM in JEP 188.
This document discusses techniques for deterministic replay of multithreaded programs. It describes how recording shared memory ordering information can enable replay that reproduces data races and concurrency bugs. Specifically, it outlines using a directory-based approach to track read-write dependencies between threads and reduce the log size through transitive reduction of dependencies.
GPU programing
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW
Prerequisite knowledge for shared memory concurrencyViller Hsiao
The document provides an overview of prerequisite knowledge for shared memory concurrency, including memory hierarchy, data consistency across memory levels, issues with simple spinlock implementations, and atomic instructions supported by CPUs like ARM and RISC-V. It discusses examples of implementing spinlocks using ARMv8 atomic instructions like ldrex/strex. It also covers memory ordering, memory barrier instructions, cache coherence protocols, and memory consistency models for architectures like ARMv8 and RISC-V.
The document discusses Bluespec, a hardware description language that combines features of Haskell and SystemVerilog assertions (SVA). Bluespec models all state explicitly using guarded atomic actions on state. Behavior is expressed as rules with guards and actions. Assertions in Bluespec are compiled into finite state machines and checked concurrently as rules. The document provides an example of using Bluespec to write functional and performance assertions for a cache controller design.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
I am Tristan Tyson. I am an Architecture Assignment Expert at architectureassignmenthelp.com. I hold a Master's in Architecture from, the University of Nottingham, England. I have been helping students with their assignments for the past 7 years. I solve assignments related to Architecture.
Visit architectureassignmenthelp.com or email info@architectureassignmenthelp.com. You can also call on +1 678 648 4277 for any assistance with Architecture Assignments.
Agenda:
This talk will provide an in-depth review of the usage of canaries in the kernel and the interaction with userspace, as well as a short review of canaries and why they are needed in general so don't be afraid if you never heard of them.
Speaker:
Gil Yankovitch, CEO, Chief Security Researcher from Nyx Security Solutions
This document discusses the Meltdown and Spectre vulnerabilities that were discovered in modern CPUs. Meltdown allows reading kernel memory from user space by exploiting out-of-order execution and speculative execution. Spectre attacks exploit speculative execution to access sensitive information through side channels. The document explains speculative execution, how Meltdown works by reading kernel memory speculatively, and the two variants of Spectre attacks - bound check bypass and branch target injection. Mitigations like KPTI and inserting speculative execution blocking instructions are discussed. The vulnerabilities are considered some of the greatest in computer history due to their fundamental exploitation of CPU designs.
This document discusses the Meltdown and Spectre vulnerabilities that were discovered in modern CPUs. Meltdown allows reading kernel memory from user space by exploiting out-of-order execution and speculative execution. Spectre attacks exploit speculative execution to access sensitive information through side channels. The document explains speculative execution, how Meltdown works by inducing mispredictions and reading memory access times, and the two variants of Spectre that exploit conditional branches and indirect branches. Mitigations like KPTI and inserting blocking instructions are discussed along with the performance trade-offs of addressing these vulnerabilities.
This document discusses memory management techniques used in operating systems, including:
- Base and limit registers that define the logical address space and protect memory accesses.
- Address binding from source code to executable addresses at different stages.
- The memory management unit (MMU) that maps virtual to physical addresses using base/limit registers.
- Segmentation architecture that divides memory into logical segments like code, data, stack, heap.
ECECS 472572 Final Exam ProjectRemember to check the errata EvonCanales257
ECE/CS 472/572 Final Exam Project
Remember to check the errata section (at the very bottom of the page) for updates.
Your submission should be comprised of two items: a .pdf file containing your written report and a .tar file containing a directory structure with your C or C++ source code. Your grade will be reduced if you do not follow the submission instructions.
All written reports (for both 472 and 572 students) must be composed in MS Word, LaTeX, or some other word processor and submitted as a PDF file.
Please take the time to read this entire document. If you have questions there is a high likelihood that another section of the document provides answers.
Introduction
In this final project you will implement a cache simulator. Your simulator will be configurable and will be able to handle caches with varying capacities, block sizes, levels of associativity, replacement policies, and write policies. The simulator will operate on trace files that indicate memory access properties. All input files to your simulator will follow a specific structure so that you can parse the contents and use the information to set the properties of your simulator.
After execution is finished, your simulator will generate an output file containing information on the number of cache misses, hits, and miss evictions (i.e. the number of block replacements). In addition, the file will also record the total number of (simulated) clock cycles used during the situation. Lastly, the file will indicate how many read and write operations were requested by the CPU.
It is important to note that your simulator is required to make several significant assumptions for the sake of simplicity.
1. You do not have to simulate the actual data contents. We simply pretend that we copied data from main memory and keep track of the hypothetical time that would have elapsed.
2. Accessing a sub-portion of a cache block takes the exact same time as it would require to access the entire block. Imagine that you are working with a cache that uses a 32 byte block size and has an access time of 15 clock cycles. Reading a 32 byte block from this cache will require 15 clock cycles. However, the same amount of time is required to read 1 byte from the cache.
3. In this project assume that main memory RAM is always accessed in units of 8 bytes (i.e. 64 bits at a time).
When accessing main memory, it's expensive to access the first unit. However, DDR memory typically includes buffering which means that the RAM can provide access to the successive memory (in 8 byte chunks) with minimal overhead. In this project we assume an overhead of 1 additional clock cycle per contiguous unit.
For example, suppose that it costs 255 clock cycles to access the first unit from main memory. Based on our assumption, it would only cost 257 clock cycles to access 24 bytes of memory.
4. Assume that all caches utilize a "fetch-on-write" scheme if a miss occurs on a Store operation. This means that you must always fetch ...
The document discusses dense linear algebra solvers and algorithms. It provides an overview of existing software for dense linear algebra including LINPACK, EISPACK, LAPACK, ScaLAPACK, PLASMA, and MAGMA. It then discusses challenges with dense linear algebra on modern hardware including distributed memory, heterogeneity, and the high cost of communication. It introduces tile algorithms as an approach to address these challenges compared to traditional LAPACK algorithms.
ECECS 472572 Final Exam ProjectRemember to check the errat.docxtidwellveronique
ECE/CS 472/572 Final Exam Project
Remember to check the errata section (at the very bottom of the page) for updates.
Your submission should be comprised of two items:
a
.pdf
file containing your written report and a
.tar
file containing a directory structure with your C or C++ source code. Your grade will be reduced if you do not follow the submission instructions.
All written reports (for both 472 and 572 students) must be composed in MS Word, LaTeX, or some other word processor and submitted as a PDF file.
Please take the time to read this entire document. If you have questions there is a high likelihood that another section of the document provides answers.
Introduction
In this final project you will implement a cache simulator. Your simulator will be configurable and will be able to handle caches with varying capacities, block sizes, levels of associativity, replacement policies, and write policies. The simulator will operate on trace files that indicate memory access properties. All input files to your simulator will follow a specific structure so that you can parse the contents and use the information to set the properties of your simulator.
After execution is finished, your simulator will generate an output file containing information on the number of cache misses, hits, and miss evictions (i.e. the number of block replacements). In addition, the file will also record the total number of (simulated) clock cycles used during the situation. Lastly, the file will indicate how many read and write operations were requested by the CPU.
It is important to note that your simulator is required to make several significant assumptions for the sake of simplicity.
1. You do not have to simulate the actual data contents. We simply pretend that we copied data from main memory and keep track of the hypothetical time that would have elapsed.
2. Accessing a sub-portion of a cache block takes the exact same time as it would require to access the entire block. Imagine that you are working with a cache that uses a 32 byte block size and has an access time of 15 clock cycles. Reading a 32 byte block from this cache will require 15 clock cycles. However, the same amount of time is required to read 1 byte from the cache.
3. In this project assume that main memory RAM is always accessed in units of 8 bytes (i.e. 64 bits at a time).
When accessing main memory, it's expensive to access the first unit. However, DDR memory typically includes buffering which means that the RAM can provide access to the successive memory (in 8 byte chunks) with minimal overhead. In this project we assume an
overhead of 1 additional clock cycle per contiguous unit
.
For example, suppose that it costs 255 clock cycles to access the first unit from main memory. Based on our assumption, it would only cost 257 clock cycles to access 24 bytes of memory.
4. Assume that all caches utilize a "fetch-on-write" scheme if a miss occurs on a Store operation. This means that .
ECECS 472572 Final Exam ProjectRemember to check the err.docxtidwellveronique
ECE/CS 472/572 Final Exam Project
Remember to check the errata section (at the very bottom of the page) for updates.
Your submission should be comprised of two items:
a
.pdf
file containing your written report and a
.tar
file containing a directory structure with your C or C++ source code. Your grade will be reduced if you do not follow the submission instructions.
All written reports (for both 472 and 572 students) must be composed in MS Word, LaTeX, or some other word processor and submitted as a PDF file.
Please take the time to read this entire document. If you have questions there is a high likelihood that another section of the document provides answers.
Introduction
In this final project you will implement a cache simulator. Your simulator will be configurable and will be able to handle caches with varying capacities, block sizes, levels of associativity, replacement policies, and write policies. The simulator will operate on trace files that indicate memory access properties. All input files to your simulator will follow a specific structure so that you can parse the contents and use the information to set the properties of your simulator.
After execution is finished, your simulator will generate an output file containing information on the number of cache misses, hits, and miss evictions (i.e. the number of block replacements). In addition, the file will also record the total number of (simulated) clock cycles used during the situation. Lastly, the file will indicate how many read and write operations were requested by the CPU.
It is important to note that your simulator is required to make several significant assumptions for the sake of simplicity.
1. You do not have to simulate the actual data contents. We simply pretend that we copied data from main memory and keep track of the hypothetical time that would have elapsed.
2. Accessing a sub-portion of a cache block takes the exact same time as it would require to access the entire block. Imagine that you are working with a cache that uses a 32 byte block size and has an access time of 15 clock cycles. Reading a 32 byte block from this cache will require 15 clock cycles. However, the same amount of time is required to read 1 byte from the cache.
3. In this project assume that main memory RAM is always accessed in units of 8 bytes (i.e. 64 bits at a time).
When accessing main memory, it's expensive to access the first unit. However, DDR memory typically includes buffering which means that the RAM can provide access to the successive memory (in 8 byte chunks) with minimal overhead. In this project we assume an
overhead of 1 additional clock cycle per contiguous unit
.
For example, suppose that it costs 255 clock cycles to access the first unit from main memory. Based on our assumption, it would only cost 257 clock cycles to access 24 bytes of memory.
4. Assume that all caches utilize a "fetch-on-write" scheme if a miss occurs on a Store operation. This means that.
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...corehard_by
What do threads, atomic variables, mutexes, and conditional variables have in common? They are the basic building blocks of any concurrent application in C++, which are even for the experienced C++ programmers a big challenge. This massively changed with C++17 and change even more with C++20/23. What did we get with C++17, what can we hope for with C++20/23? With C++17, most of the standard template library algorithms are available in sequential, parallel, and vectorised variants. With the upcoming standards, we can look forward to executors, transactional memory, significantly improved futures and coroutines. To make it short. These are just the highlights from the concurrent and parallel perspective. Thus there is the hope that in the future C++ abstractions such as executors, transactional memory, futures and coroutines are used and that threads, atomic variables, mutexes and condition variables are just implementation details.
Programming Language Memory Models: What do Shared Variables Mean?greenwop
The document discusses the challenges of defining memory models for shared memory parallel programs. It argues that there is emerging consensus around an interleaving semantics called Sequential Consistency, but only for programs that are free of data races. This allows for important compiler and hardware optimizations while restricting reordering around synchronization. However, languages like Java cannot outlaw all data races, so the meaning of programs with races remains unclear. The document explores some speculative solutions to address this major open problem.
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxARIV4
Please do ECE572 requirement
ECE/CS 472/572 Final Exam Project (Whole question is attached in word file)
Remember to check the errata section (at the very bottom of the page) for updates.
Your submission should be comprised of two items:
a
.pdf
file containing your written report and a
.tar
file containing a directory structure with your C or C++ source code. Your grade will be reduced if you do not follow the submission instructions.
All written reports (for both 472 and 572 students) must be composed in MS Word, LaTeX, or some other word processor and submitted as a PDF file.
Please take the time to read this entire document. If you have questions there is a high likelihood that another section of the document provides answers.
Introduction
In this final project you will implement a cache simulator. Your simulator will be configurable and will be able to handle caches with varying capacities, block sizes, levels of associativity, replacement policies, and write policies. The simulator will operate on trace files that indicate memory access properties. All input files to your simulator will follow a specific structure so that you can parse the contents and use the information to set the properties of your simulator.
After execution is finished, your simulator will generate an output file containing information on the number of cache misses, hits, and miss evictions (i.e. the number of block replacements). In addition, the file will also record the total number of (simulated) clock cycles used during the situation. Lastly, the file will indicate how many read and write operations were requested by the CPU.
It is important to note that your simulator is required to make several significant assumptions for the sake of simplicity.
1. You do not have to simulate the actual data contents. We simply pretend that we copied data from main memory and keep track of the hypothetical time that would have elapsed.
2. Accessing a sub-portion of a cache block takes the exact same time as it would require to access the entire block. Imagine that you are working with a cache that uses a 32 byte block size and has an access time of 15 clock cycles. Reading a 32 byte block from this cache will require 15 clock cycles. However, the same amount of time is required to read 1 byte from the cache.
3. In this project assume that main memory RAM is always accessed in units of 8 bytes (i.e. 64 bits at a time).
When accessing main memory, it's expensive to access the first unit. However, DDR memory typically includes buffering which means that the RAM can provide access to the successive memory (in 8 byte chunks) with minimal overhead. In this project we assume an
overhead of 1 additional clock cycle per contiguous unit
.
For example, suppose that it costs 255 clock cycles to access the first unit from main memory. Based on our assumption, it would only cost 257 clock cycles to access 24 bytes of memory.
4. Assume that all caches utilize a "fetch-.
WhatsApp offers simple, reliable, and private messaging and calling services for free worldwide. With end-to-end encryption, your personal messages and calls are secure, ensuring only you and the recipient can access them. Enjoy voice and video calls to stay connected with loved ones or colleagues. Express yourself using stickers, GIFs, or by sharing moments on Status. WhatsApp Business enables global customer outreach, facilitating sales growth and relationship building through showcasing products and services. Stay connected effortlessly with group chats for planning outings with friends or staying updated on family conversations.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
What is Augmented Reality Image Trackingpavan998932
Augmented Reality (AR) Image Tracking is a technology that enables AR applications to recognize and track images in the real world, overlaying digital content onto them. This enhances the user's interaction with their environment by providing additional information and interactive elements directly tied to physical images.
Odoo ERP software
Odoo ERP software, a leading open-source software for Enterprise Resource Planning (ERP) and business management, has recently launched its latest version, Odoo 17 Community Edition. This update introduces a range of new features and enhancements designed to streamline business operations and support growth.
The Odoo Community serves as a cost-free edition within the Odoo suite of ERP systems. Tailored to accommodate the standard needs of business operations, it provides a robust platform suitable for organisations of different sizes and business sectors. Within the Odoo Community Edition, users can access a variety of essential features and services essential for managing day-to-day tasks efficiently.
This blog presents a detailed overview of the features available within the Odoo 17 Community edition, and the differences between Odoo 17 community and enterprise editions, aiming to equip you with the necessary information to make an informed decision about its suitability for your business.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
4. Memory model(Consistency Model)
• “the memory model specifies the allowed
behavior of multithreaded programs executing
with shared memory.”[1]
• “consistency(memory model) provide rules
about loads and stores and how they act upon
memory.”[1]
A contract between software and hardware.
5. Does the computer execute the program you wrote.
NO!
Source code
compiler
processor
caches
execution
7. How dare they change my code!
• The program you wrote is not what you want.
• Transformation to make better performance.
• As long as long they have the same effects.
8. Optimizations
Z = 3
Y = 2
X = 1
// use X , Y, Z
X = 1
Y = 2
Z = 3
// use X ,Y, Z
X = 1
Y = 2
X = 3
// use X and Y
Y = 2
X = 3
// use X and Y
for(i = 0; i < cols; ++i)
for(j = 0; j < rows; ++j)
a[j*rows + i] += 42;
for(j = 0; j < rows; ++j)
for(i = 0; i < cols; ++i)
a[j*rows + i] += 42;
Optimizations are ubiquitous: compiler, processor will do whatever they
see fit to optimize your code to improve performance.
9. Memory model from HW’s perspective
Shared memory support for multicore computer
system is the source of all these difficulties.
10. Memory architecture
• The effect of memory operation.
Core Memory
Core1
Core2 Memory
Core3
Accesses to memory are serialized.
11. Cache(and store buffer)
Core 1 Core 2
Store buffer
Memory
cache
2 issues arise:
a. coherence. (invisible to software)
b. consistency.
How to order stores and loads to the memory?
core 1
S1: store data = d1
S2: store flag = d2
core 2
L1: load r1 = flag
B1: if(r1 != d2) goto L1
L2: load r2 = data
Key point:
Writes are not automatically visible.
Reads/writes are not necessarily performed in order.
Store buffer
cache
12. Program order & memory order
• Program order: the order of execution in the
program.
what programmer wants.
• Memory order: the order of the corresponding
operation with respect to memory.
the observed order.
13. Sequential consistency
Program order is the same as memory order for
every single thread.
core1 core2
If L(a) <p L(b) L(a) <m L(b)
If L(a) <p S(b) L(a) <m S(b)
If S(a) <p S(b) S(a) <m S(b)
If S(a) <p L(b) S(a) <m L(b)
Every load gets its value from the last
store before it in memory order.
store
load
store
store
load
simple & easy to program with.
performance optimizations are constrained.
14. Total store order(TS)
Also known as “processor consistency”, used in x86/64, SPARC, etc.
core1 core2
If L(a) <p L(b) L(a) <m L(b)
If L(a) <p S(b) L(a) <m S(b)
If S(a) <p S(b) S(a) <m S(b)
If S(a) <p L(b) S(a) <m L(b)
Every load gets its value from the last
store before it in memory order or in
program order.
store
load
store
store
load
Need fence to accomplish SC.
15. Memory fence
• independent memory operations are effectively
performed in random order.
• Need a way to instruct compiler and processor to
restrict the order.
Memory fence, a per cpu based intervention.
• Fences are not guaranteed to have any effect on
other cpu.
• Fences do not guarantee what order other cpu will
see.
16. Release consistency
• Provide 2 types of operation(fence).
a) acquire operation.
b) release operation.
Acquire operation
Release operation
Memory operations are not
allowed to move up across
Key observations:
Acquire operation indicates the start of an critical section.
Release operation indicates the end of an critical section.
Memory operations are
not allowed to move
down across
17. Memory model from SW’s perspective
Software memory model
X86/64 PowerPC ARM
The other part of the contract for SW to obey
18. Ordering
Down to the earth: it is all about side effect of the execution of your program
with respect to memory interaction.
a) Memory operations in program order are not the same as memory order.
b) Use of fence to prevent the potential ordering.
19. How does ordering matter
1: load(g_y)
2: load(g_x)
3: store(g_x)
4: store(g_y)
Non-deterministic reordering makes program nearly impossible to reason about.
21. Is reordering that bad?
Yes. No.
It depends.
As long as we don’t see the reordering, whatever it is!
Hardware loves to do reordering in order to optimize
performance.
Software, however, need SC to ensure correct code.
22. SC-DRF
• Fully sequential consistency, ideal world.
execute the code you wrote.
what most programmers expect.
• SC-DRF: sequential consistency for data race free,
the reality.
Compromise between software and hardware!
As long as you don’t write data race code, HW guarantees you the illusion
of fully sequential consistency.
23. Race condition
• A memory location is simultaneously accessed
by two or more threads, and at least one
thread is a writer.
• Key point: transaction.
1) atomic: no torn-read or torn-write.
2) visibility: propagate side effect from thread
to thread.
24. Critical section
• Race condition occurs only when we have to
manipulate shared variables.
• Create a critical region to serialize the accesses.
a way to implement transaction.
25. Critical section
Execution of shared variables
Good fence makes good neighbor
Reordering within critical section?
As long as they don’t move out
of the section.
Acquire fence
Release fence
Full fence will work, but acquire and release operation are better.
26. c++11 atomic
• Operations on atomic type are performed atomically, AKA, synchronization operations.
• User can specify the memory ordering for every load & store.
template <class T> struct atomic {
bool is_lock_free() const noexcept;
void store(T, memory_order = memory_order_seq_cst) noexcept;
T load(memory_order = memory_order_seq_cst) const noexcept;
T exchange(T, memory_order = memory_order_seq_cst) noexcept;
bool compare_exchange_weak(T&, T, memory_order, memory_order) noexcept;
bool compare_exchange_strong(T&, T, memory_order, memory_order) noexcept;
bool compare_exchange_weak(T&, T, memory_order = memory_order_seq_cst) noexcept;
bool compare_exchange_strong(T&, T, memory_order = memory_order_seq_cst) noexcept;
};
Synchronization operations specify how assignments in one thread visible to another.
[c++ standard: 1.10.5]
27. C++11 memory order
namespace std {
typedef enum memory_order {
memory_order_relaxed, // no ordering constraint.
memory_order_acquire, // load operation using this order is an acquire operation.
memory_order_consume, // a weaker version of acquire semantic.
memory_order_release, // store operation using this order is an release operation.
memory_order_acq_rel, // both, for RMW operation: eg, exchange().
memory_order_seq_cst // sequential consistency, like memory_order_acq_rel,
// plus a single total order on all memory_order_acq_rel operation.
} memory_order;
}
Note: applied only to read and write performed to the same memory location.
28. Acquire/release and Consume/release
atomic<int> guard(0);
int pay_load = 0;
// thread 0
pay_load = 1;
guard.store(1, memory_order_release);
// thread 1
int pay;
int g = guard.load(memory_order_acquire);
If (g) pay = pay_load;
atomic<int*> guard(0);
int pay_load = 0;
// thread 0
pay_load = 1;
guard.store(&pay_load, memory_order_release);
// thread 1
int pay;
Int* g = guard.load(memory_order_consume);
If (g) pay = *g;
g mush carry a dependency to pay = *g
data dependency
On most weak-order architectures, memory ordering between data dependent
instructions is preserved, in such case explicit memory fence is not necessary.[7]
29. memory_order_seq_cst
• Order memory operation the same way as release and
acquire.
• Establish a single total order on all memory_order_seq_cst
operations.
Suppose x,y are atomic variables and are initialized to 0.[6]
Thread 1
x = 1
Thread 2
y = 1
Thread 3
if (y = 1 && x == 0)
cout << “y first”;
Thread 4
if (y = 0 && x == 1)
cout << “x first”;
Must not allow to print both messages.
30. C++11 memory fence
data.store(3, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
flag.store(1, std::memory_order_relaxed);
flag2.store(2, std::memory_order_relaxed);
• It is different from what you think comparing to a
traditional fence.
• More like a way to do synchronization.
Above code is NOT equivalent to the following:
extern "C" void atomic_thread_fence(memory_order order) noexcept;
data.store(3, std::memory_order_relaxed);
flag.store(1, std::memory_order_release);
flag2.store(2, std::memory_order_relaxed);
A release fence prevents all preceding memory operations from reordered past all
subsequent writes.
flag2.store() is allowed reorder before data.store().
// other memory operation preceding the fence.
std::atomic_thread_fence(std::memory_order_release);
flag.store(1, std::memory_order_relaxed);
An acquire fence prevents all subsequent memory operations from reordered
past the all preceding read.
flag.load(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
// other memory operations.
31. Quiz
Hint:
a. Need an acquire before load of g_y in foo1().
b. Need an acquire before load of g_x in foo2().
Can we accomplish that?
acquire/release is pairwise operations.
State what order is needed to prevent reordering?
32. Quiz: Peterson’s algo again.
atomic<int> g_victim;
atomic<bool> g_flag[2];
void lock1()
{
g_flag[o].store(true, ?);
g_victim.store(0, ?);
while (g_flag[1].load(?) && g_victim.load(?) == 0);
// lock acquired.
}
void unlock1()
{
g_flag[0].store(false, ?);
}
Thread 0
Store(g_flag[0])
Store(g_victim)
Load(g_flag[1])
Load(g_victim)
Thread 1
Store(g_flag[1])
Store(g_victim)
Load(g_flag[0])
Load(g_victim)
atomic<int> g_victim;
atomic<bool> g_flag[2];
void lock1()
{
g_flag[o].store(true, memory_order_relaxed);
g_victim.exchange(0, memory_order_acq_rel);
while (g_flag[1].load(memory_order_acquire)
&& g_victim.load(memory_order_relaxed) == 0);
// lock acquired.
}
void unlock1()
{
g_flag[0].store(false, memory_order_release);
}
Atomic read-modify-write operations shall always read the last value (in the modification order) written
before the write associated with the read-modify-write operation.[standard §29.3.12]
33. A few terms: synchronize with
Thread 1:
• An operation A synchronizes-with an operation B if:
1) A is a store Data = 42
to some atomic variable m, with an
Flag = 1
ordering ofstd::memory_order_release,
or std::memory_order_seq_cst.
2) B is a load from the same variable m, with an ordering
of std::memory_order_acquire or std::Thread memory_2:
order_seq
_cst.
R1 = Flag
If (R1== 1) Use data
3) and B reads the value stored by A.
34. A few terms: dependency-ordered before
An operation A dependency-ordered before an operation B if:
1) A is a store to some atomic variable m, with an ordering
Thread 1:
ofstd::memory_order_release, or std::memory_order_seq_cst.
Data = 42
Flag = &Data
2) B is a load from the same variable m, with an ordering
of std::memory_order_consume.
Thread 2:
3) and B reads the value stored by the “release sequence headed
by A.
R1 = Flag
If (R1) Use R1
35. A few terms: happen before
Sequence before: the order of evaluations within a single thread
Or synchronize with.
Or dependency-ordered before.
Or concatenations of the above 3 relationships
with 2 exceptions.[standard 1.10.11]
happen-before indicates visibility.
36. volatile
• A compiler aware semantic.
compiler guarantees that no reordering, no
optimization enforced for this variable.
other thread may not see this guarantee.
has nothing to do with inter-thread synchronization.
• Not an atomic operation.