This document discusses memory ordering and synchronization in multithreaded programs. It begins with background on mutexes, semaphores, and their differences. It then discusses problems that can occur with locking-based synchronization methods like mutexes, such as deadlocks, priority inversion, and performance issues. Alternative lock-free programming techniques using atomic operations are presented as a way to synchronize access without locks. Finally, memory ordering, consistency models, barriers, and their implementations in compilers, Linux kernels, and ARM architectures are covered in detail.
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency.
In this talk we revisit the changes the Linux block layer in the
last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux.
Sagi Grimberg
Sagi is Principal Architect and co-founder at LightBits Labs.
It describes the MMC storage device driver functionality in Linux Kernel and it's role. It explains different type of storage devices available and how they are handled from MMC driver point of view. It describes eMMC (internal storage) device and SD (external storage) devices in details and SD protocol used for communicating with these devices in Linux.
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency.
In this talk we revisit the changes the Linux block layer in the
last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux.
Sagi Grimberg
Sagi is Principal Architect and co-founder at LightBits Labs.
It describes the MMC storage device driver functionality in Linux Kernel and it's role. It explains different type of storage devices available and how they are handled from MMC driver point of view. It describes eMMC (internal storage) device and SD (external storage) devices in details and SD protocol used for communicating with these devices in Linux.
Linux Kernel Booting Process (2) - For NLKBshimosawa
Describes the bootstrapping part in Linux, and related architectural mechanisms and technologies.
This is the part two of the slides, and the succeeding slides may contain the errata for this slide.
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMULinaro
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
Speaker: Alex Bennée
Date: September 22, 2015
★ Session Description ★
While QEMU has continued to be optimised for KVM to make use of the growing number of cores on modern systems, TCG emulation has been stuck running in a single thread. This year there is another push to get a workable solution merged upstream. We shall present a review of the challenges that need to be addressed: locking, TLB and cache maintenance and generic solution for the various atomic/exclusive operations. We will discuss previous work that has been done in this field before presenting a design that addresses these requirements. Finally we shall look at the current proposed patches and the design decisions they have taken.
★ Resources ★
Video: https://www.youtube.com/watch?v=9xQGDTEmNtI
Presentation: http://www.slideshare.net/linaroorg/sfo15202-towards-multithreaded-tiny-code-generator-tcg-in-qemu
Etherpad: pad.linaro.org/p/sfo15-202
Pathable: https://sfo15.pathable.com/meetings/302833
★ Event Details ★
Linaro Connect San Francisco 2015 - #SFO15
September 21-25, 2015
Hyatt Regency Hotel
http://www.linaro.org
http://connect.linaro.org
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
How to create SystemVerilog verification environment?Sameh El-Ashry
Basic knowledge for the verification engineer to learn the art of creating SystemVerilog verification environment.
Starting from the specifications extraction till coverage closure.
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedAdrian Huang
This slide deck describes the Linux booting flow for x86_64 processors.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Virtual File System in Linux Kernel
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
This paper is a technology preview that describes a new hardware-based capability known as Intel® Virtual Machine Control Structure (Intel® VMCS) Shadowing, which will be available with 4th generation Intel® CoreTM vProTM processor and describes the hardware-assisted security provided by XenClient, Deep Defender. Intel VMCS Shadowing can enable faster performance for multi-VMM usage models. Both Citrix and McAfee are evaluating this capability for inclusion in future product releases.
Linux Kernel Booting Process (2) - For NLKBshimosawa
Describes the bootstrapping part in Linux, and related architectural mechanisms and technologies.
This is the part two of the slides, and the succeeding slides may contain the errata for this slide.
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMULinaro
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
Speaker: Alex Bennée
Date: September 22, 2015
★ Session Description ★
While QEMU has continued to be optimised for KVM to make use of the growing number of cores on modern systems, TCG emulation has been stuck running in a single thread. This year there is another push to get a workable solution merged upstream. We shall present a review of the challenges that need to be addressed: locking, TLB and cache maintenance and generic solution for the various atomic/exclusive operations. We will discuss previous work that has been done in this field before presenting a design that addresses these requirements. Finally we shall look at the current proposed patches and the design decisions they have taken.
★ Resources ★
Video: https://www.youtube.com/watch?v=9xQGDTEmNtI
Presentation: http://www.slideshare.net/linaroorg/sfo15202-towards-multithreaded-tiny-code-generator-tcg-in-qemu
Etherpad: pad.linaro.org/p/sfo15-202
Pathable: https://sfo15.pathable.com/meetings/302833
★ Event Details ★
Linaro Connect San Francisco 2015 - #SFO15
September 21-25, 2015
Hyatt Regency Hotel
http://www.linaro.org
http://connect.linaro.org
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
How to create SystemVerilog verification environment?Sameh El-Ashry
Basic knowledge for the verification engineer to learn the art of creating SystemVerilog verification environment.
Starting from the specifications extraction till coverage closure.
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedAdrian Huang
This slide deck describes the Linux booting flow for x86_64 processors.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Virtual File System in Linux Kernel
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
This paper is a technology preview that describes a new hardware-based capability known as Intel® Virtual Machine Control Structure (Intel® VMCS) Shadowing, which will be available with 4th generation Intel® CoreTM vProTM processor and describes the hardware-assisted security provided by XenClient, Deep Defender. Intel VMCS Shadowing can enable faster performance for multi-VMM usage models. Both Citrix and McAfee are evaluating this capability for inclusion in future product releases.
Most scalability bottlenecks come from code containing locks, producing significant contention under heavy loads. We'll cover striping, copy-on-write, ring buffer, spinning, to reduce this contention or get lock free code & explain concepts like Compare-And-Swap and memory barriers.
Docker and Go: why did we decide to write Docker in Go?Jérôme Petazzoni
Docker is currently one of the most popular Go projects. After a (quick) Docker intro, we will discuss why we picked Go, and how it turned out for us.
We tried to list all the drawbacks and minor inconveniences that we met while developing Docker; not to complain about Go, but to give the audience an idea of what to expect. Depending on your project, those drawbacks could be minor inconveniences or showstoppers; we thought you would want to know about them to help you to make the right choice!
Introduce Brainf*ck, another Turing complete programming language. Then, try to implement the following from scratch: Interpreter, Compiler [x86_64 and ARM], and JIT Compiler.
Blocks is a cool concept and is very much needed for performance improvements and responsiveness. GCD helps run blocks effortlessly by scheduling on a desired queue, priority and lots more.
This presentation by Stanislav Donets (Lead Software Engineer, Consultant, GlobalLogic, Kharkiv) was delivered at GlobalLogic Kharkiv C++ Workshop #1 on September 14, 2019.
In this talk were covered:
- Graphics Processing Units: Architecture and Programming (theory).
- Scratch Example: Barnes Hut n-Body Algorithm (practice).
Conference materials: https://www.globallogic.com/ua/events/kharkiv-cpp-workshop/
For More information, refer to Java EE 7 performance tuning and optimization book:
The book is published by Packt Publishing:
http://www.packtpub.com/java-ee-7-performance-tuning-and-optimization/book
This presentation is used in many places including Ambedkar Institute of Technology, Banglore. I was engaging more than 60 faculty members for 5 full days. Both tutorials and hands on training. This presentation explains Unix Internals, Socket programming, both data gram based and IP based concepts are explained with live examples.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
3. Background
•Synchronization of multithread
program
– Mutex (mutual exclusion)
• Ensuring that no two processes or threads are in their critical section
at the same time
– Here, a critical section refers to a period of time when the process
accesses a shared resource, such as shared memory
3
4. Background
– Semaphore
• A mutex is essentially the same thing as a binary semaphore, and
sometimes uses the same basic implementation
• However, the term "mutex" is used to describe a construct which
prevents two processes from accessing a shared resource
concurrently
• The term "binary semaphore" is used to describe a construct which
limits access to a single resource
• In many cases a mutex has a concept of an “owner”
– the process which locked the mutex is the only process allowed to
unlock it. In contrast, semaphores generally do not have this
restriction
– Semaphore vs. mutex
• http://www.kernel.org/doc/Documentation/mutex-design.txt
4
5. Synchronization
and mutex
Common synchronization methods
5
Reference:
http://msdn.microsoft.com/e
n-us/library/ms810047.aspx
Windows mutex mechanisms
Type of mutex IRQL considerations Recursion and thread details
Interrupt spin lock Acquisition raises IRQL to DIRQ and returns
previous IRQL to caller.
Not recursive. Release on same
thread as acquire.
Spin lock Acquisition raises IRQL to
DISPATCH_LEVEL and returns previous
IRQL to caller.
Not recursive. Release on same
thread as acquire.
Queued spin lock Acquisition raises IRQL to
DISPATCH_LEVEL and stores previous
IRQL in lock owner handle.
Not recursive. Release on same
thread as acquire.
Fast mutex Acquisition raises IRQL to APC_LEVEL and
stores previous IRQL in lock.
Not recursive. Release on same
thread as acquire.
Kernel mutex (a
kernel dispatcher
object)
Enters critical region upon acquisition and
leaves critical region upon release.
Recursive. Release on same
thread as acquire.
Synchronization
event (a kernel
dispatcher object)
Acquisition does not change IRQL. Wait at
IRQL <= APC_LEVEL and signal at IRQL
<= DISPATCH_LEVEL.
Not recursive. Release on the
same thread or on a different
thread.
Unsafe fast mutex Acquisition does not change IRQL. Acquire
and release at IRQL <= APC_LEVEL.
Not recursive. Release on same
thread as acquire.
Synchronization
method
Description Windows mechanisms
Interlocked
operations
Provides atomic logical,
arithmetic, and list
manipulation operations that
are both thread-safe and
multiprocessor safe.
InterlockedXxx and
ExInterlockedXxx routines
Mutexes Provides (mutually) exclusive
access to memory.
Spin locks, fast mutexes,
kernel mutexes,
synchronization events
Shared/exclusive
lock
Allows one thread to write or
many threads to read the
protected data.
Executive resources
Counted semaphore Allows a fixed number of
acquisitions.
Semaphores
6. What is wrong with Mutexes?
• Mutexes are perfectly fine, but you have a problem if there is lock
contention
– If you want your algorithm to be fast, you want to use the
available cores as much as possible instead of letting them sleep
– A thread can hold a mutex and be de-scheduled by the CPU
(because of a cache miss or its time slice is over), then all the
threads that want to acquire this mutex will be blocked
– And if you have a lot of blocking, the OS also needs to do more
context switches which are expensive because they clear the
caches
6
Reference:
http://woboq.com/blog/introduction
-to-lockfree-programming.html
7. What is wrong with Mutexes?
• Problems with locking
– Deadlock
– Priority Inversion
• Low-priority processes hold a lock required by a higher priority process
– Convoying
• All the other processes slow to the speed of the slowest one
– Async-signal-safety
• Signal handlers can’t use lock-based primitives
– Kill-tolerant availability
• What happens if threads are killed/crash while holding locks
– Pre-emption tolerance
• What happens if you’re pre-empted holding a lock
– Overall performance
7
Reference:
http://www.cs.cmu.edu/~410-
s05/lectures/L31_LockFree.pdf
8. So how can we do it without locking?
• Lock-free Programming
– Thread-safe access to shared data without the use of
synchronization primitives such as mutexes
– Practical with hardware support
• Modern CPUs have something called atomic operations
• The use of shared memory and an atomic instruction provides the
mutual exclusion
8
9. Atomic operation
• Atomic operation
– Processors have instructions that can be used to implement lock-
free and wait-free algorithms
• Atomic read-write
• Atomic swap, also called XCHG
• Test-and-set
• Fetch-and-add
• Compare-and-swap (CAS)
– Compare and Exchange (CMPXCHG) instruction in the x86 and
Itanium architectures
– ABA problem
» http://woboq.com/blog/introduction-to-lockfree-programming.html
9
Reference:
http://en.wikipedia.org/wiki/Atomic_operation
http://en.wikipedia.org/wiki/Read-modify-write
10. Atomic operation
• Load-Link/Store-Conditional
– The LDREX and STREX instructions in ARM split the operation of
atomically updating memory into two separate steps. Together, they provide
atomic updates in conjunction with exclusive monitors that track exclusive
memory accesses. Load-Exclusive and Store-Exclusive must only access
memory regions marked as Normal
– For example
» LDREX R1, [R0] performs a Load-Exclusive from the address in R0, places the value into
R1 and updates the exclusive monitor(s).
» STREX R2, R1, [R0] performs a Store-Exclusive operation to the address in R0,
conditionally storing the value from R1 and indicating success or failure in R2.
10
Reference:
http://infocenter.arm.com/help/topic/co
m.arm.doc.dht0008a/ch01s02s01.html
http://infocenter.arm.com/help/topic/co
m.arm.doc.dht0008a/CJAGCFAF.html
Exclusive accesses to memory locations
marked as Non-shareable are checked
only against this local monitor. Exclusive
accesses to memory locations marked as
Shareable are checked against both the
local monitor and the global monitor.
11. Atomic operation
• GCC Built-in functions for atomic memory access
– http://gcc.gnu.org/onlinedocs/gcc-4.6.3/gcc/Atomic-Builtins.html
• Atomic operations supported in Linux Kernel
– https://www.kernel.org/doc/Documentation/atomic_ops.txt
• Atomic operations supported in C11/C++11
– C11 defines a new _Atomic() type specifier. You can declare an
atomic integer like this:
_Atomic(int) counter;
– C++11 moves this declaration into the standard library:
#include <atomic>
std::atomic<int> counter;
11
Reference:
http://www.informit.com/articles
/article.aspx?p=1832575
12. Atomic operation
• Is atomic operation enough?
• Linux-v3.7.8/arch/arm/include/asm/atomic.h
12
static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
{
unsigned long oldval, res;
smp_mb();
do {
__asm__ __volatile__("@ atomic_cmpxchgn"
"ldrex %1, [%3]n"
"mov %0, #0n"
"teq %1, %4n"
"strexeq %0, %5, [%3]n"
: "=&r" (res), "=&r" (oldval), "+Qo" (ptr->counter)
: "r" (&ptr->counter), "Ir" (old), "r" (new)
: "cc");
} while (res);
smp_mb();
return oldval;
}
Reference:
http://lxr.linux.no/#linux+v3.7.8/arch/ar
m/include/asm/atomic.h#L115
Before talking about memory
barrier, let’s see memory ordering
first.
Memory barrier
14. Memory ordering
• Memory ordering - memory access ordering
– Program order
• the order of the program’s object code as seen by the CPU, which might differ from
the order in the source code due to compiler optimizations
– Execution order
• It can differ from program order due to both compiler and CPU implementation
optimizations
– Perceived order
• It can differ from the execution order due to caching, interconnect, and memory-
system optimizations
• Why memory reordering
– Performance!
14
Reference:
http://www.rdrop.com/users/paulmck/sca
lability/paper/ordering.2007.09.19a.pdf
http://preshing.com/20120930/weak-vs-
strong-memory-models
15. Memory consistency models
• Memory models – memory consistency models
• Sequential consistency
– all reads and all writes are in-order
• Relaxed consistency
– Some types of reordering are allowed
• Loads can be reordered after loads (for better working of cache coherency,
better scaling)
• Loads can be reordered after stores
• Stores can be reordered after stores
• Stores can be reordered after loads
• Weak consistency
– Reads and Writes are arbitrarily reordered, limited only by explicit
memory barriers
15
16. Weak VS. Strong memory model
16
Reference:
http://preshing.com/20120930/
weak-vs-strong-memory-models
17. Memory ordering in some architectures
17
SPARC TSO = total-store order (default)
SPARC RMO = relaxed-memory order (not supported on recent
CPUs)
SPARC PSO = partial store order (not supported on recent CPUs)
Type Alpha ARMv7 PA-RISC POWER
SPARC
RMO
SPARC
PSO
SPARC
TSO
x86
x86
oostore
AMD64 IA-64 zSeries
Loads reordered after loads Y Y Y Y Y Y Y
Loads reordered after stores Y Y Y Y Y Y Y
Stores reordered after stores Y Y Y Y Y Y Y Y
Stores reordered after loads Y Y Y Y Y Y Y Y Y Y Y Y
Atomic reordered with loads Y Y Y Y Y
Atomic reordered with stores Y Y Y Y Y Y
Dependent loads reordered Y
Incoherent Instruction cache pipeline Y Y Y Y Y Y Y Y Y Y
Reference:
http://en.wikipedia.org/wiki/Memory_ordering
18. Types of Memory Barrier
• #LoadLoad
• #StoreStore
• #LoadStore
• #StoreLoad
– A StoreLoad barrier ensures that all stores performed before the barrier are visible to other
processors, and that all loads performed after the barrier receive the latest value that is visible at the
time of the barrier
18
Reference:
http://preshing.com/20120710/memory-
barriers-are-like-source-control-operations
if (IsPublished) // Load and check shared flag
{
LOADLOAD_FENCE(); // Prevent reordering of loads
return Value; // Load published value
}
Value = x; // Publish some data
STORESTORE_FENCE();
IsPublished = 1; // Set shared flag to indicate availability of data
19. Memory barrier in compiler
• GCC compiler memory barrier
– These barriers prevent a compiler from reordering instructions,
they do not prevent reordering by CPU.
• GCC support for hardware memory barriers
– This builtin issues a full memory barrier.
19
Reference:
http://en.wikipedia.org/wiki/Memory_ordering
http://gcc.gnu.org/onlinedocs/gcc-
4.6.3/gcc/Atomic-Builtins.html
asm volatile("" ::: "memory");
or
__asm__ __volatile__ ("" ::: "memory");
_sync_synchronize (...);
20. Memory barriers in Linux kernel
• General barrier
– barrier()
• Compiler barrier only. The compiler will not reorder memory accesses from one side of this
statement to the other. This has no effect on the order that the processor actually executes
the generated instructions.
• Mandatory barriers
– mb()
• A full system memory barrier. All memory operations before the mb() in the instruction
stream will be committed before any operations after the mb() are committed. This ordering
will be visible to all bus masters in the system. It will also ensure the order in which
accesses from a single processor reaches slave devices.
– rmb()
• Like mb(), but only guarantees ordering between read accesses. That is, all read
operations before an rmb() will be committed before any read operations after the rmb().
– wmb()
• Like mb(), but only guarantees ordering between write accesses. That is, all write
operations before a wmb() will be committed before any write operations after the wmb().
20
Reference:
http://blogs.arm.com/software-
enablement/448-memory-access-ordering-
part-2-barriers-and-the-linux-kernel/
http://www.kernel.org/doc/Documentation/
memory-barriers.txt
21. Memory barriers in Linux kernel
• SMP conditional barriers
– smp_mb()
• Similar to mb(), but only guarantees ordering between cores/processors within an
SMP system. All memory accesses before the smp_mb() will be visible to all cores
within the SMP system before any accesses after the smp_mb().
– smp_rmb()
• Like smp_mb(), but only guarantees ordering between read accesses.
– smp_wmb()
• Like smp_mb(), but only guarantees ordering between write accesses.
– SMP barriers are a subset of mandatory barriers, not a superset.
• An SMP barrier cannot replace a mandatory barrier, but a mandatory barrier can
replace an SMP barrier.
• Implicit barriers
– Locking constructs in the kernel act as implicit SMP barriers, in the same way
as pthread synchronization operations do in user space.
– I/O accessor macros (readb(), iowrite32()) for the ARM architecture act as
explicit memory barriers when kernel is compiled with
CONFIG_ARM_DMA_MEM_BUFFERABLE. This was added in linux-2.6.35.
• arch/arm/include/asm/io.h
• arch/arm/mm/Kconfig
21
Reference:
https://www.kernel.org/doc/Documentatio
n/memory-barriers.txt
23. Memory ordering in ARM Architecture
• Memory types
– Normal memory
• Normal memory is effectively for all of your data and executable code
• This memory type permits speculative reads, merging of accesses and repeating of
reads without side effects
• Accesses to Normal memory can always be buffered, and in most situations they
are also cached - but they can be configured to be uncached
• There is no implicit ordering of Normal memory accesses
– Device memory and Strongly-ordered memory
• Used with memory mapped peripherals or other control registers
• Processors implementing the LPAE treat Device and Strongly-ordered memory
regions identically
• ARMv7-A processors that do not implement the LPAE can set device memory to be
Shareable or Non-shareable
• Accesses to these types of memory must happen exactly the number of times that
executing the program suggests they should
• There is no guarantee about ordering between memory accesses to different
devices, or usually between accesses of different memory types
23
Reference:
http://blogs.arm.com/software-enablement/594-
memory-access-ordering-part-3-memory-access-
ordering-in-the-arm-architecture/
24. Memory ordering in ARM Architecture
• Arranges of ARM Memory Types
– Normal
• Shareable or Non-shareable
• Cacheable or Non-cacheable
– Device (w/o LPAE)
• Shareable or Non-shareable
– Device (w LPAE)
• Always shareable
– Strongly-ordered
• Always shareable
• Have to wait slave’s access ACK
24
ARM ® Architecture Reference
Manual
ARMv7-A and ARMv7-R edition
25. Memory ordering in ARM Architecture
• Figure A3-5 shows the memory ordering between two explicit accesses A1 and A2,
where A1 occurs before A2 in program order
< Accesses must arrive at any particular memory-mapped peripheral or block of
memory in program order, that is, A1 must arrive before A2. There are no ordering
restrictions about when accesses arrive at different peripherals or blocks of
memory.
– Accesses can arrive at any memory-mapped peripheral or block of memory in
any order.
25
26. Memory ordering in ARM Architecture
• Barriers
– Barriers were introduced progressively into the ARM architecture
• Some ARMv5 processors, such as the ARM926EJ-S, implemented a Drain Write
Buffer cp15 operation, which halted execution until any buffered writes had drained
into the external memory system
• With the introduction of the ARMv6 memory model, this operation was redefined in
more architectural terms and became the Data Synchronization Barrier
– ARMv6 also introduced the new Data Memory Barrier and Flush Prefetch Buffer
cp15 operations
• ARMv7 evolved the memory model somewhat, extending the meaning of the
barriers - and the Flush Prefetch Buffer operation was renamed the Instruction
Synchronization Barrier
• ARMv7 also allocated dedicated instruction encodings for the barrier operations
– Use of the cp15 operations is now deprecated and software targeting ARMv7 or
later should use the DMB, DSB and ISB mnemonics.
• And finally, ARMv7 extended the Shareability concept to cover both Inner-shareable
and Outer-shareable domains
– This together with AMBA4 ACE gives us barriers that propagate into the memory
system
26
27. Memory ordering in ARM Architecture
– Instruction Synchronization Barrier (ISB)
• The ISB ensures that any subsequent instructions are fetched anew
from cache in order that privilege and access is checked with the
current MMU configuration
– It is used to ensure any previously executed context changing
operations will have completed by the time the ISB completed
• Access type and domain are not really relevant for this barrier
– It is not used in any of the Linux memory barrier primitives, but
appears in memory management, cache control and context
switching code
27
28. Memory ordering in ARM Architecture
– Data Memory Barrier (DMB)
• DMB prevents reordering of data accesses instructions across itself
– All data accesses by this processor/core before the DMB will be
visible to all other masters within the specified shareability domain
before any of the data accesses after it
– It also ensures that any explicit preceding data/unified cache
maintenance operations have completed before any subsequent
data accesses are executed
– The DMB instruction takes two optional parameters: an operation
type (stores only - 'ST' - or loads and stores) and a domain
– The default operation type is loads and stores and the default
domain is System
• In the Linux kernel, the DMB instruction is used for the smp_*mb()
macros
28
29. Memory ordering in ARM Architecture
– Data Synchronization Barrier (DSB)
• DSB enforces the same ordering as the Data Memory Barrier
– But it also blocks execution of any further instructions until
synchronization is complete
– It also waits until all cache and branch predictor maintenance
operations have completed for the specified shareability domain
– If the access type is load and store then it also waits for any TLB
maintenance operations to complete.
• In the Linux kernel, the DSB instruction is used for the *mb() macros.
29
30. Domain
Abbreviatio
n
Description
Non-shareable NSH
A domain consisting only of the local agent. Accesses that never need to be synchronized with other
cores, processors or devices. Not normally used in SMP systems.
Inner
Shareable
ISH
A domain potentially shared by multiple agents, but usually not all agents in the system.
A system can have multiple Inner Shareable domains. An operation that affects one Inner Shareable
domain does not affect other Inner Shareable domains in the system.
Outer
Shareable
OSH
A domain almost certainly shared by multiple agents, and quite likely consisting of several Inner
Shareable domains. An operation that affects an Outer Shareable domain also implicitly affects all
Inner Shareable domains within it.
For processors such as the Cortex-A15 MPCore that implement the LPAE, all Device memory
accesses are considered Outer Shareable. For other processors, the shareability attribute can be
explicitly set (to shareable or non-shareable).
Full system SY
An operation on the full system affects all agents in the system; all Non-shareable regions, all Inner
Shareable regions and all Outer Shareable regions. Simple peripherals such as UARTs, and several
more complex ones, are not normally necessary to have in a restricted shareability domain.
Memory ordering in ARM Architecture
• Shareability domains
– Shareability domains define "zones" within the bus topology within which memory
accesses are to be kept consistent (taking place in a predictable way) and
potentially coherent (with hardware support)
– Outside of this domain, observers might not see the same order of memory
accesses as inside it
30
Reference:
http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/CIHGHHIE.html
ARMv7
31. Memory ordering in ARM Architecture
31
Allocated values for the data barriers (DMB/DSB)ARMv8
32. Memory ordering in ARM Architecture
• The shareability domains example
32
4 cores per cluster,
2 clusters per chip
34. Memory model supported in C++11
• C++ Memory model
– Sequential consistent/acquire-release/relaxed
• http://en.cppreference.com/w/cpp/atomic/memory_order
• http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
34
35. Acquire and Release Semantics
• ARMv8 AArch64/AArch32 support load-acquire/store-release
instructions
– The Load-Acquire/Store-Release instructions can remove the requirement to use
the explicit DMB memory barrier instruction
35
Reference:
http://preshing.com/20120913/acq
uire-and-release-semantics
http://www.arm.com/files/downloa
ds/ARMv8_Architecture.pdf
Acquire semantics is a property which can only apply to
operations which read from shared memory. The operation is
then considered a read-acquire. Acquire semantics prevent
memory reordering of the read-acquire with any read or write
operation which follows it in program order.
Release semantics is a property which can only apply to
operations which write to shared memory. The operation is then
considered a write-release. Release semantics prevent memory
reordering of the write-release with any read or write operation
which precedes it in program order.
36. Acquire and Release Semantics
• An demo example
36
//Shared global variables
int A = 0;
int Ready = 0;
//Thread 1
A = 42;
Ready = 1;
//Thread 2
int r1 = Ready;
int r2 = A;
//Possible results of r1, r2
r1 =0 r2 = 0
r2 = 42
r1 = 1 r2 = 0
r2 = 42
//Shared global variables
int A = 0;
Atomic<int> Ready = 0;
//Thread 1
A = 42;
Ready.store(1,
memory_order_release);
//Thread 2
int r1 =
Ready.load(memory_ord
er_acquire);
int r2 = A;
//Possible results of r1, r2
r1 =0 r2 = 0
r2 = 42
r1 = 1 r2 = 42
37. Acquire and Release Semantics
• A Write-Release Can Synchronize-With a Read-Acquire
37
// Thread 1
void SendTestMessage(void* param)
{
// Copy to shared memory using non-atomic stores.
g_payload.tick = clock();
g_payload.str = "TestMessage";
g_payload.param = param;
// Perform an atomic write-release to indicate that the message is ready.
g_guard.store(1, std::memory_order_release);
}
// Thread 2
bool TryReceiveMessage(Message& result)
{
// Perform an atomic read-acquire to check whether the message is ready.
int ready = g_guard.load(std::memory_order_acquire);
if (ready != 0)
{
// Yes. Copy from shared memory using non-atomic loads.
result.tick = g_payload.tick;
result.str = g_msg_str;
result.param = g_payload.param;
return true;
}
// No.
return false;
}
Reference:
http://preshing.com/20130823/
the-synchronizes-with-relation/
39. Volatile V.S. memory-order/atomic
• What does the volatile keyword mean?
39
Reference:
http://www.drdobbs.com/parallel/vola
tile-vs-volatile/212701484
40. Volatile V.S. memory-order/atomic
• C programmers have often taken volatile to mean that the variable could
be changed outside of the current thread of execution
– as a result, they are sometimes tempted to use it in kernel code
when shared data structures are being used
– In other words, they have been known to treat volatile types as a sort
of easy atomic variable, which they are not
– The use of volatile in kernel code is almost never correct
• The key point to understand with regard to volatile is that its purpose is to
suppress optimization, which is almost never what one really wants to do
• In the kernel, one must protect shared data structures against unwanted
concurrent access, which is very much a different task
• Like volatile, the kernel primitives which make concurrent access to data
safe (spinlocks, mutexes, memory barriers, etc.) are designed to prevent
unwanted optimization. If they are being used properly, there will be no
need to use volatile as well
40
Reference:
https://www.kernel.org/doc/Document
ation/volatile-considered-harmful.txt
41. Volatile V.S. memory-order/atomic
• To safely write lock-free code that communicates between threads without using
locks
– prefer to use ordered atomic variables
– Java/.NET volatile, C++0x atomic<T>, and C-compatible atomic_T
• To safely communicate with special hardware or other memory that has unusual
semantics
– use un-optimizable variables: ISO C/C++ volatile
– Remember that reads and writes of these variables are not necessarily
atomic
• To protect shared data structures against unwanted concurrent access in kernel
code
– use kernel concurrent access primitives, like spinlocks, mutexes, memory
barriers
• Finally, to express a variable that both has unusual semantics and has any or all
of the atomicity and/or ordering guarantees needed for lock-free coding
– only the ISO C++11 Standard provides a direct way to spell it: volatile
atomic<T>
41
43. Usage of memory barrier
instructions
• In what situations might I need to insert memory barrier instructions?
– Mutexes
43
Reference:
http://infocenter.arm.com/help/topic/
com.arm.doc.genc007826/Barrier_Lit
mus_Tests_and_Cookbook_A08.pdf
http://infocenter.arm.com/help/topic/
com.arm.doc.faqs/ka14041.html
LOCKED EQU 1
UNLOCKED EQU 0
lock_mutex
; Is mutex locked?
LDREX r1, [r0] ; Check if locked
CMP r1, #LOCKED ; Compare with "locked"
WFEEQ ; Mutex is locked, go into standby
BEQ lock_mutex ; On waking re-check the mutex
; Attempt to lock mutex
MOV r1, #LOCKED
STREX r2, r1, [r0] ; Attempt to lock mutex
CMP r2, #0x0 ; Check whether store completed
BNE lock_mutex ; If store failed, try again
DMB ; Required before accessing protected resource
BX lr
unlock_mutex
DMB ; Ensure accesses to protected resource have completed
MOV r1, #UNLOCKED ; Write "unlocked" into lock field
STR r1, [r0]
DSB ; Ensure update of the mutex occurs before other CPUs wake
SEV ; Send event to other CPUs, wakes any CPU waiting on using WFE
BX lr
44. Usage of memory barrier instructions
– Memory Remapping
• Consider a situation where your reset handler/boot code lives in Flash memory (ROM),
which is aliased to address 0x0 to ensure that your program boots correctly from the vector
table, which normally resides at the bottom of memory (see left-hand-side memory map).
• After you have initialized your system, you may wish to turn off the Flash memory alias so
that you can use the bottom portion of memory for RAM (see right-hand-side memory
map). The following code (running from the permanent Flash memory region) disables the
Flash alias, before calling a memory block copying routine (e.g., memcpy) to copy some
data from to the bottom portion of memory (RAM).
44
MOV r0, #0
MOV r1, #REMAP_REG
STR r0, [r1] ; Disable Flash alias
BL block_copy_routine() ; Block copy code into RAM
BL copied_routine() ; Execute copied routine (now in RAM)
DMB ; Ensure above str completion with DMB
DSB ; Ensure block copy is completed with DSB
ISB ; Ensure pipeline flush with ISB
Question
45. Usage of memory barrier instructions
– Self-modifying code
– If the memory you are performing the block copying routine on is marked as 'cacheable'
the instruction cache will need to be invalidated so that the processor does not execute
any other 'cached' code.
– For "write-back" regions the data cache must be cleaned before the instruction cache
invalidate.
45
Overlay_manager
; ...
BL block_copy ; Copy new routine from ROM to RAM
B relocated_code ; Branch to new routine
Overlay_manager
; ...
BL block_copy ; Copy new routine from ROM to RAM
data_cache_clean ; Clean the cache so that the new routine is written out to memory
icache_and_pb_invalidate ; Invalidate the instruction cache and branch predictor so that the
; old routine is no longer cached
B relocated_code ; Branch to new routine
DSB ; Ensure block copy has completed
ISB ; Flush pipeline to ensure processor fetches new instructions
DSB ; Ensure data cache clean has completed
DSB ; Ensure invalidate has completed
ISB ; Flush pipeline to ensure processor fetches new instructions