The document discusses concurrent programming theory and models. It begins with basic definitions of processes, threads, and shared memory models. It then covers formal models of computation including shared objects and registers. Key concepts discussed include linearizability, happens-before relations, mutual exclusion, and non-blocking algorithms.
Software transactional memory. pure functional approachAlexander Granin
Slides for C++ Russia 2018
I'm presenting my `cpp_stm_free` library: composable monadic STM for C++ on Free monads for lock-free concurrent programming.
Software transactional memory. pure functional approachAlexander Granin
Slides for C++ Russia 2018
I'm presenting my `cpp_stm_free` library: composable monadic STM for C++ on Free monads for lock-free concurrent programming.
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...Takuo Watanabe
This paper presents an integration of the Actor model in Emfrp, a functional reactive programming language designed for resource constrained embedded systems. In this integration, actors not only express nodes that represent time-varying values, but also present communication mechanism. The integration provides a higher-level view of the internal representation of nodes, representations of time-varying values, as well as an actor-based inter-device communication mechanism.
Presentation given on Monday 10 September at the ROOT Users' Workshop 2018 in Sarajevo. Progress update on the Automated Parallel Computation of Collaborative Statistical Models project, a collaboration between the Netherlands eScience Center and Nikhef.
We present an update on our recent efforts to further parallelize RooFit. We have performed extensive benchmarks and identified at least three bottlenecks that will benefit from parallelization. To tackle these and possible future bottlenecks, we designed a parallelization layer that allows us to parallelize existing classes with minimal effort, but with high performance and retaining as much of the existing class's interface as possible. The high-level parallelization model is a task-stealing approach. The implementation is currently based on the bi-directional memory mapped pipe (BidirMMapPipe), but could in the future be replaced by other modes of communication between processes.
A software agent controlling 2 robot arms in co-operating concurrent tasksRuleML
TeleoR is a major extension of Nilsson’s Teleo-Reactive (TR)
rule based robotic agent programming language. Programs comprise sequences of guarded action rules grouped into parameterised procedures.
The guards are deductive queries to a set of rapidly changing percept and other dynamic facts in the agent’s Belief Store. The actions are either tuples of primitive actions for external robotic resources, to be executed in parallel, or a single call to a TeleoR procedure, which can be a recursive call. The guards form a sub-goal tree routed at the guard of the first rule. When partially instantiated by the arguments of some call, this guard is the goal of the call.
TeleoR extends TR in being typed and higher order, with extra forms of rules that allow finer control over sub-goal achieving task behaviour.
Its Belief Store inference language is a higher order logic+function rule language, QuLog. QuLog also has action rules and primitive actions for updating the Belief Store and sending messages. The action of a TeleoR rule may be a combination of the action of a TR rule and a sequence of
QuLog actions. TeleoR’s most important extension of TR is the concept of task atomic procedures, some arguments of which belong to a special but application specific resource type. This allows the high level programming of multitasking agents using multiple robotic resources. When two or more tasks
need to use overlapping resources their use is alternated between task atomic calls in each task, in such a way that there is no interference, deadlock or task starvation.
This multi-task programming is illustrated by giving the essentials of a program for an agent controlling two robotic arms in multiple block tower assembly tasks. It has been used to control both a Python interactive graphical simulation and a Baxter robot building real block towers, in each case with help or hindrance from a human. The arms move in parallel whenever it can be done without risk of clashing.
Algorithm Complexity presentation slides. Time Complexity and Space Complexity analysis using Big-O notation with examples that demonstrates how a function complexity effects to algorithm efficiency.
Slides from tech talk about the art of non-blocking waiting in Java with LockSupport.park/unpark and AbstractQueuedSynchronizer. Presented on JPoint 2016 Conference.
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...Takuo Watanabe
This paper presents an integration of the Actor model in Emfrp, a functional reactive programming language designed for resource constrained embedded systems. In this integration, actors not only express nodes that represent time-varying values, but also present communication mechanism. The integration provides a higher-level view of the internal representation of nodes, representations of time-varying values, as well as an actor-based inter-device communication mechanism.
Presentation given on Monday 10 September at the ROOT Users' Workshop 2018 in Sarajevo. Progress update on the Automated Parallel Computation of Collaborative Statistical Models project, a collaboration between the Netherlands eScience Center and Nikhef.
We present an update on our recent efforts to further parallelize RooFit. We have performed extensive benchmarks and identified at least three bottlenecks that will benefit from parallelization. To tackle these and possible future bottlenecks, we designed a parallelization layer that allows us to parallelize existing classes with minimal effort, but with high performance and retaining as much of the existing class's interface as possible. The high-level parallelization model is a task-stealing approach. The implementation is currently based on the bi-directional memory mapped pipe (BidirMMapPipe), but could in the future be replaced by other modes of communication between processes.
A software agent controlling 2 robot arms in co-operating concurrent tasksRuleML
TeleoR is a major extension of Nilsson’s Teleo-Reactive (TR)
rule based robotic agent programming language. Programs comprise sequences of guarded action rules grouped into parameterised procedures.
The guards are deductive queries to a set of rapidly changing percept and other dynamic facts in the agent’s Belief Store. The actions are either tuples of primitive actions for external robotic resources, to be executed in parallel, or a single call to a TeleoR procedure, which can be a recursive call. The guards form a sub-goal tree routed at the guard of the first rule. When partially instantiated by the arguments of some call, this guard is the goal of the call.
TeleoR extends TR in being typed and higher order, with extra forms of rules that allow finer control over sub-goal achieving task behaviour.
Its Belief Store inference language is a higher order logic+function rule language, QuLog. QuLog also has action rules and primitive actions for updating the Belief Store and sending messages. The action of a TeleoR rule may be a combination of the action of a TR rule and a sequence of
QuLog actions. TeleoR’s most important extension of TR is the concept of task atomic procedures, some arguments of which belong to a special but application specific resource type. This allows the high level programming of multitasking agents using multiple robotic resources. When two or more tasks
need to use overlapping resources their use is alternated between task atomic calls in each task, in such a way that there is no interference, deadlock or task starvation.
This multi-task programming is illustrated by giving the essentials of a program for an agent controlling two robotic arms in multiple block tower assembly tasks. It has been used to control both a Python interactive graphical simulation and a Baxter robot building real block towers, in each case with help or hindrance from a human. The arms move in parallel whenever it can be done without risk of clashing.
Algorithm Complexity presentation slides. Time Complexity and Space Complexity analysis using Big-O notation with examples that demonstrates how a function complexity effects to algorithm efficiency.
Slides from tech talk about the art of non-blocking waiting in Java with LockSupport.park/unpark and AbstractQueuedSynchronizer. Presented on JPoint 2016 Conference.
This is a detailed review of ACM International Collegiate Programming Contest (ICPC) Northeastern European Regional Contest (NEERC) 2016 Problems. It includes a summary of problem and names of problem authors and detailed runs statistics for each problem.
Problem statements are available here:
http://neerc.ifmo.ru/information/problems.pdf
The beginning of the following video has the actual review (in Russian): https://www.youtube.com/watch?v=fN25KkNYsjA
Пишем самый быстрый хеш для кэширования данныхRoman Elizarov
Типичный случай — приложению работающему с БД некоторые объекты нужны так часто, то их необходимо кэшировать в памяти. В этом случае их кладут в структуру данных типа хэш. Однако, бывают случаи, когда даже поиск в этом хэше становится узким местом приложения и решения из стандартных библиотек перестают устраивать по своей производительности.
Основной упор доклада будет не на конкретный алгоритм, а на та техниках дизайна быстрых алгоритмов — на что надо обращать внимание, как вообще подходить к решению подобных задач.
Многопоточное Программирование - Теория и ПрактикаRoman Elizarov
Многоядерные процессоры используются во всех серверах, рабочих станциях и мобильных устройствах. Написание многопоточных программ необходимо для обеспечения вертикальной масштабируемости, но, в отличие от однопоточных программ, их намного сложней отладить и протестировать, чтобы убедиться в корректности. Важно понимать какие именно гарантии дают те или иные конструкции языка и библиотеки при их многопоточном исполнении и какие подводные камни могут нарушить корректность кода. Доклад будет содержать краткое введение в теорию многопоточного программирования. Мы рассмотрим теоретические модели, которые используются для описания поведения многопоточных программ. Будут рассмотрены понятия последовательной согласованности и линеаризуемости (с примерами) и объяснено зачем это все-нужно программисту-практику. Будет показано как эти понятия применяются в модели памяти Java с примерами кода приводящего к неожиданным результатам с точки зрения человека, который с ней не знаком.
Доклад сделан для конференции Java Point Student Day 2016.
Statechart modeling of interactive gesture-based applicationsTom Mens
Developing intuitive interactive applications that are easy to maintain by developers is quite challenging, due to the complexity and the many technical aspects involved in such applications. In this article, we tackle the problem in two complementary ways. First, we propose a gestural interface to improve the user experience when interacting with applications that require the manipulation of 3D graphical scenes. Second, we reduce the complexity of developing such applications by modeling their executable behaviour using statecharts. We validate our approach by creating a modular and extensible Java framework for the development of in- teractive gesture-based applications. We developed a proof- of-concept application using this framework, that allows the user to construct and manipulate 3D scenes in OpenGL by using hand gestures only. These hand gestures are captured by the Kinect sensor, and translated into events and actions that are interpreted and executed by communicating statecharts that model the main behaviour of the interactive application.
Михаил Елизаров «Man in the middle в действии!»Mail.ru Group
В офисе Mail.Ru Group прошла девятая встреча Defcon Moscow, посвящённая информационной безопасности.
Подробнее о встрече читайте в нашем блоге: http://team.mail.ru/.
We will introduce key concepts for a data lake and present aspects related to its implementation. Also discussing critical success factors, pitfalls to avoid operational aspects, and insights on how AWS enables a server-less data lake architecture.
Speaker: Sebastien Menant, Solutions Architect, Amazon Web Services
Concurrency in Distributed Systems : Leslie Lamport papersSubhajit Sahu
In computer science, concurrency is the ability of different parts or units of a program, algorithm, or problem to be executed out-of-order or in partial order, without affecting the final outcome. This allows for parallel execution of the concurrent units, which can significantly improve overall speed of the execution in multi-processor and multi-core systems. In more technical terms, concurrency refers to the decomposability property of a program, algorithm, or problem into order-independent or partially-ordered components or units.[1]
A number of mathematical models have been developed for general concurrent computation including Petri nets, process calculi, the parallel random-access machine model, the actor model and the Reo Coordination Language.
Lock-free algorithms for Kotlin CoroutinesRoman Elizarov
Presentation for SPTCC 2017 - http://neerc.ifmo.ru/sptcc/
Video (part 1): https://www.youtube.com/watch?v=W2dOOBN1OQI
Video (part 2): https://www.youtube.com/watch?v=iQsN_IDUTSc
This is a detailed review of ACM International Collegiate Programming Contest (ICPC) Northeastern European Regional Contest (NEERC) 2015 Problems. It includes a summary of problem and names of problem authors and detailed runs statistics for each problem. Video of the actual presentation that was recorded during NEERC is here https://www.youtube.com/watch?v=vn7v1MuWXdU (in Russian)
Note: there were only preliminary stats avaialble, because problems review was happening before before the closing ceremony. This published presentation has full stats.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
2. What? For whom?
• The practical experience in writing concurrent programs is
assumed
- Here, concurrent == using shared memory
- Assuming audience knows and used in practice locks, synchronized
sections, compare and set, etc
- Knowledge of “Java Concurrency in Practice” is a plus!
• The theory behind the practical constructs will be explained
- Formal models
- Key definitions
- Important facts and theorems (without proofs)
- Practical corollaries
• But some concepts are simplified
3. Just a reminder: the free lunch is over
http://www.gotw.ca/publications/concurrency-ddj.htm
4. Basic definitions
• Process owns memory and other resources in OS
• Thread of execution defines current instruction pointer, stack
pointer and other registers
- Threads execute program code
- Multiple threads per process are sharing the same memory
• However, both terms are often used interchangeably in theory
- “Process” seems to be used more often due to historical reasons
- And they are typically named P, Q, R, … etc in papers
5. Why model?
• Formal models of computation let you
define and prove certain desired
properties of you programs
• The models let you prove impossibility
of achieving certain results under specific
constraints
- Saving your time trying to find a working
solution
6. The model with shared objects
[Shared] Memory
Thread 1
[Shared] Object 1
Thread 2
[Shared] Object 2
[Shared] Object M
Thread N
8. Shared objects
• Threads (or processes) perform operations on shared memory
objects
• This model doesn’t care about operations that are internal to
threads:
- Computations performed by threads
- Updates to threads’ CPU registers
- Updates to threads’ stacks
- Updates to any “thread local” memory regions
• Only inter-thread communication matters
• The only type of inter-thread communication in this model is via
shared objects
9. [Shared] Registers
• Don’t confuse with CPU registers (eax, ebx, etc in x86)
- They are just part of “thread state” in concurrent programming theory
• In concurrent programming [shared] register is the simplest kind of
shared object:
- It has some value type (typically boolean or integer)
- With read and write operations
• Registers are basic building blocks for many practical concurrent
algorithms
• The model of threads + shared registers is a decent abstraction for
modern multicore hardware systems
- It abstracts away enough actual complexity to make theoretical
reasoning possible
10. Message passing models
• We can model parallel computing by letting threads send
messages to each other, instead of giving them shared registers
(or other shared objects)
- It is closer to how the hardware memory bus actually works on a low
level (CPUs send messages to memory via interconnects)
- But it is farther from how the programs actually work with
• Message passing is typically used to model distributed programs
• Both models are theoretically equivalent in their power
- But the practical performance of various algorithms will be different
- We work with shared objects model where performance matters
(taking care to optimize the number of shared objects and the number
of operations on them is close to the real practical optimization)
11. Parallel
Concurrent Distributed
[shared memory] [message passing]
* NOTE: There is no general consensus on this terminology
12. Properties of concurrent programs
• Serial programs are usually deterministic
- Unless explicit calls to random number generator are present
- Their properties are established by analyzing their
state, invariants, pre- and post- conditions
• Concurrent programs are inherently nondeterministic
- Even when the code for each thread is fully deterministic
- Outcome depends on the actual execution history – what operations
on shared objects where performed by threads in what order
- When you say “program A has property P” it actually means “program
A has property P in any execution”
13. Modeling executions
• S is a global state, which includes:
- State of all threads
S
- State of all shared objects or all “in
flight” messages (in distributed system)
f g
• f and g are operations on shared
objects
f(S) g(S) - for registers it can be either
ri.read(value) or ri.write(value)
- There are as many possible operations
in each state as there are active threads
• not as simple for distributed case
• f(S) is a new state after operation f was
performed in state S
14. Example P0,Q0
x=0
shared int x (-, -)
thread P: thread Q:
0: x = 1 0: x = 2
1: print x P1,Q0 P0,Q1 1: print x
2: stop x=1 A total of 17 states x=2 2: stop
(-, -) (-, -)
P2,Q0 P1,Q1 P1,Q1 P0,Q2
x=1 x=2 x=1 x=2
(1, -) (-, -) (-, -) (-, 2)
+1 state not shown +2 states not shown +2 states not shown +1 state not shown
P2,Q2 P2,Q2 P2,Q2 P2,Q2
x=2 x=2 x=1 x=1
(1, 2) (2, 2) (1, 1) (2, 1)
15. Discussion of the execution model with states
• This model is not truly “parallel”
- All operations happen serially (albeit in undefined order)
• In reality (on a modern CPU)
- A read or write operation is not instantaneous. It takes time
- There are multiple memory banks that work in parallel. You have
multiple read or write operation happening at the same time.
• However, you can safely use this model for atomic registers
- Atomic (linearizable) registers work as if each write or read is
instantaneous and as if there is no parallelism
- Will define what this means precisely later
• A more general model of execution is needed to analyze a wider
class of primitives
16. Lamport’s happens before (occurs before) model
• An execution history is a pair (H, →H)
- “H” is a set of operations e, f, g, … that happened during execution
- “→H” is a transitive, irreflexive, antisymmetric relation on a set of
operations H (strict partial order relation)
- “e → H f” means “e happens before f [in H]” or “occurs before”
• H is ommited where it is not ambiguous
• In global time model of execution, each operation e has
- s(e) and f(e) – times where it has started and finished
e f f (e) s( f )
- Albeit convenient to visualize, in reality there is no global time (no
central clock) in a modern system (so formal proofs cannot use time)
17. Legal executions
• Execution is legal, if it satisfies specifications of all objects
x.w(1)
P
x.r(1) LEGAL
Q
x.w(1)
P
x.r(2) ILLEGAL
Q
18. Serial executions
• Execution is serial, if “happens before” is a total order
x.w(1)
P
x.r(1) SERIAL
Q
x.w(1)
P
x.r(1) NON-SERIAL
Q
e and f are called parallel when e f f e
19. Linearizable executions
• Execution is linearizable, if its history (“happens before” relation)
can be extended to a legal and serial (total) history
x.w(1)
P
x.r(1) LINEARIZABLE
Q
x.w(1)
P
x.r(2) NON-LINEARIZABLE
Q
20. Linearizable (atomic) objects
• Object is called linearizable (atomic) if all execution histories with
respect to this object are linearizable
• Lineriazability is composable. A system execution on linearizable
objects is linearizable.
• In global time model, each operation in linearizable execution has
a linearization point T(e)
e : s(e) T (e) f (e)
e, f : e f T (e) T ( f ) e f T (e) T ( f )
x.w(1)
P
x.r(1)
Q
21. Atomic registers and other objects
• Atomic register == linearizable register
- They work as if read/write operations happen instantaneously at
linearization point and in some specific serial order
- Thus we can use “global state” model of execution to analyze
behavior of a program whose threads are working with shared atomic
registers (or with other atomic objects)
• volatile fields in Java work like atomic registers
- AtomicXXX classes are atomic registers, too (with additional ops)
• Thread-safe classes (synchronized, ConcurrentXXX) are atomic
(linearizable) unless explicitly specified otherwise
- “thread-safe” in practice means “linearizable”, e.g. designed to work
as if all operations happen in some serial order without an outside
synchronization even if accessed concurrently
23. Mutual exclusion (lock)
The mutex protocol • The main desired property of protocol is
mutual exclusion. Two executions of
thread Pid: critical section cannot be parallel:
loop forever:
nonCriticalSection i, j : i j CSi CS j CS j CSi
mutex.lock
criticalSection
mutex.unlock • It is also known as correctness
requirement for mutual exclusion
protocol
24. Mutex attempt #1
threadlocal int id // 0 or 1 • This protocol does guarantee
mutual exclusion
shared boolean want[2]
• But there is no guarantee of
def lock: progress. It can get into live-lock
want[id] = true (both threads spinning forever in
while want[1 - id]: pass
lock)
def unlock: • So, the other desired property is
want[id] = false
progress: critical section should
get entered infinitely often
25. Mutex attempt #2
threadlocal int id // 0 or 1 • This protocol does guarantee
mutual exclusion and progress
shared int victim
• But critical section can be entered
def lock: in a turn-by-turn fashion only. One
victim = id thread working in isolation will
while victim == id: pass
starve.
def unlock: • So, the stronger progress is
pass
desired. Freedom from
starvation: if one (or more)
threads wants to enter critical
section, then it’ll enter CS in a
finite number of steps
26. Peterson’s mutual exclusion algorithm
threadlocal int id // 0 or 1 • This protocol does guarantee
mutual exclusion, progress and
shared boolean want[2]
shared int victim freedom from starvation
• The order of operations in this
def lock: pseudo-code is important
want[id] = true
victim = id • Not the first one invented (1981),
while want[1-id] and but the simplest 2-thread one
victim == id:
pass • Hard to generalize to N threads
(can be, but the result is complex)
def unlock:
want[id] = false
27. Lamport’s [bakery] mutual exclusion algorithm
threadlocal int id // 0 to N-1 • This protocol does guarantee
mutual exclusion, progress
shared boolean want[N] and freedom from starvation
shared int label[N] for N threads
def lock: • This protocol has an additional
want[id] = true doorway first-come, first-served
label[id] = max(label) + 1 (FCFS) property. First thread
while exists k: k != i and
want[k] and
finishing doorway gets lock
(label[k], k) < (label[id], id) : first
pass • But relies on infinite labels.
def unlock:
They can be replaced with
want[id] = false “concurrent bounded
timestamps”
28. Pros and cons of locks
• With mutual exclusion any serial object can be turned into a
linearizable shared object.
- Just protect all operations as critical sections with a mutex
- Using two phase locking (2PL) you can build complex linearizable
objects out of smaller building blocks
- Nothing more but shared registers are enough to build a mutex
- Profit!
• But
- By using multiple locks you can get into a deadlock
- Locks lead to priority inversion
- Locks limit concurrency of code by ensuring that critical sections are
executed strictly serially with respect to each other
29. Amdahl’s Law for parallelization
• The maximal speedup of code with N threads when S portion of it
is serial
1
speedup
1 S
S
N
1
lim speedup
N S
• Even when just 5% of code is serial (S=0.05), the maximal
possible speedup of the code is 20.
30. Non-blocking algorithms (objects)
• What happens if OS scheduler pauses a thread that is working
inside a critical section (is holding a lock)?
- No other operation on the corresponding object can proceed
• Lock-free: An object or operation (method) is lock-free if one of
the active (non-paused) threads can complete an operation in the
finite number of steps.
- Some threads may starve, but only when some other threads
complete their operations
• Wait-free: An object or operation (method) is wait-free if any of the
active (non-paused) threads can complete an operation in the finite
number of steps
- No starvation is allowed
31.
32. Non-atomic registers
• Physical register (SRAM) is not atomic
- However, it is wait-free, but…
- It stores only boolean (bit) values
- It can have only a single reader (SR)
and single writer (SW)
- Trying to read and write at the same
time leads to unpredictable results
- But it is a safe register
• When reading after write completes,
the most recent written value is
returned
Through a chain of software constructions on top of safe boolean SRSW
registers it is possible to build wait-free atomic multi valued multi reader
(MR) multi writer (MW) register
33. Atomic shapshot
• Just read values of N registers in a loop and return
- is not an atomic snapshot (“read N registers atomically”) operation
System states
r1.w(1) r2.w(2) r1 r2
P
r1.r(0) r2.r(2) 0 0
Q ? 1 0
Q tries to take snapshot: 1 2
this execution cannot be linearized
Read state
r1 r2
0 2
34. Lock-free atomic snapshot
• Add version to each register
- On write atomically write a pair (new_version, new_value) to a register
where new_version = old_version + 1
• To take an atomic shapshot
- Read in a loop all versions and values
- Reread them to check if versions are still them same
• If still same -> snapshot was atomic, return it
• If changed -> shapshot was not atomic, repeat
• Can loop trying to take snapshot forever (starvation), thus it is not
a wait-free algorithm
• But it is lock-free. The system as a whole has progress. A loop in
snapshot means writes are being completed
35. Wait-free atomic snapshot
• Yes, it is possible to make it wait-free, so that every operation
(including snapshot) is guaranteed to complete in a finite number
of steps under all circumstances
- Threads will have to cooperate
- Each updating thread will have to take a snapshot and store it in its
own per-thread register to help complete concurrent snapshots
• O(N2) storage requirement, O(N) time for each operation
• Not practical
- This is true about all wait-free algorithms
- There are no practical wait-free algorithms
• But certain individual non-modifying operations in some algorithms
can be implemented wait-free
36. Wait-free synchronization and consensus
The consensus protocol • What other wait-free objects can we
build using atomic wait-free registers
threadlocal int proposal as our primitive?
- The question was definitely answered
thread Pid:
print consensus
by M. Herlihy in 1991
stop - He considered wait-free
implementations of consensus
protocol
• In a consensus protocol all threads
have to reach agreement on a value.
- It has to be non-trivial
- The protocol must be wait-free
37. Consensus number
• Consensus number of a shared
object or class of objects is the
largest number N, such that a
[wait-free] consensus protocol for Lock-based (blocking)
N threads can be implemented consensus protocol
using these objects as primitive threadlocal int proposal // != 0
building blocks.
shared int value
• Consensus number of atomic
registers is 1 (one, uno, один) def consensus:
- Even two threaded [wait-free] lock
consensus protocol cannot be if value == 0:
reached using any number of value = proposal
unlock
atomic registers
return value
- However, it’s trivial with locks!
38. Read-Modify-Write (RMW) registers
• It’s a register that is augmented
with additional RMW operation(s)
- Each RMW operation has a kernel
function F and is typically named
“getAndF” RMW register
• Common2 class of RMW kernels shared int value
- F1(F2(x)) == F1(x) or
def getAndF:
- F1(F2(x)) == F2(F1(x)) old = value // read
• Common2 examples: value = F(old) // modify, write
return old
- F(x)=a // set to const
- F(x)=x+a // add const
Non-trivial Common2 RMW registers have consensus number 2
39. Consensus hierarchy
Objects and operations Consensus
number
Atomic Register with get (read), set (write) operations 1
Atomic snapshot of N registers
Common2 Read-Modify-Write Registers: 2
getAndAdd (atomic inc/dec), getAndSet (atomic swap),
queue and stack (with enqueue/dequeue, push/pop only)
Atomic assignment of any N registers 2n-2
Universal operations: ∞
compareAndSet/compareAndSwap (CAS), queue with
peek operation, memory-to-memory swap
40. Universality of consensus
• Any object can be turned into a concurrent wait-free linearizable
object for N threads if we have a consensus protocol for N threads
using universal construction
- Corollary: consensus hierarchy is strict.
- However, universal construction is not really efficient for real-life
• Lock-free universal construction via CAS is easy and practical
shared register<MyObject> value
def concurrentOperationX: MyObject is a pointer
loop: if it’s state does not fit
oldval = value.get into CAS-able
newval = oldval.deepCopy machine word
newval.serialOperationX
until value.CAS(oldval, newval) is successful
41. Implementing lock-free algorithms
• Let’s try to implement CAS-based universal construction in C:
typedef struct object { /* my object’s state is here */ } object_t;
void serial_operation_X(object_t *ptr); // updates state pointed to by ptr
void concurrent_operation_X(object_t **ptr) {
object_t *oldval, *newval = malloc(sizeof(object_t));
do {
oldval = *ptr;
memcpy(newval, oldval, sizeof(object_t));
serial_operation_X(newval);
} while (! __sync_bool_compare_and_swap(ptr, oldval, newval));
free(oldval);
}
Problem: it can copy trash, that was freed, and serial_operation_X will crash
42. Implementing lock-free algorithms (attempt #2)
• Let’s try to implement CAS-based universal construction in C:
typedef struct object { /* my object’s state is here */ } object_t;
void serial_operation_X(object_t *ptr); // updates state pointed to by ptr
void concurrent_operation_X(object_t **ptr) {
object_t *oldval, *newval = malloc(sizeof(object_t));
do {
oldval = *ptr;
memcpy(newval, oldval, sizeof(object_t)); // assume no segfault here
__sync_synchronize(); // make sure we see changes of *ptr
if (oldval != *ptr) continue;
serial_operation_X(newval);
} while (! __sync_bool_compare_and_swap(ptr, oldval, newval));
free(oldval);
}
43. Still doesn’t work: ABA problem
A, B and C are memory locations
start with *ptr == A
Thread P: Thread Q:
1: oldval is A 1: oldval == A
2: (newval = malloc()) is B 2: (newval = malloc()) == C
3: CAS(ptr, A, B) is successful
4: free(A)
// makes operation_X again // sleeps/slow all that time
5: oldval is B
6: (newval = malloc()) is A
7: CAS(ptr, B, A) is successful
8: free(B) 3: CAS(ptr, A, C) is successful
*ptr is going A, B, A
44. Solving ABA problem
• Attach version to a pointer and increment it on every operation
- Need to CAS two words at the same
- That’s why CPUs have ops like CMPXCHG8B (for 32bit mode) and
CMPXCHG16B (for 64bit mode)
• Rely on garbage collector (GC) for memory management
- In GC runtime environment the ABA problem simply does not exist
- Makes your non-blocking concurrent programming much easier!
• Use other schemes that rely on coordination between threads
(hazard pointers)
• Use special hardware support (LL/SC or hardware memory
transactions)
• Still, universal construction is efficient only if object state is small
45. Tree-like persistent data structures
oldval newval
Root Update B Root’
NodeA NodeB NodeB’
NodeC NodeD
Reallocate and update only path from updated node to the root
47. Lock-free stacks
• Use universal construction on linked-list representation of the stack
(it’s a trivial tree-like structure!)
- root is pointing to the top of stack
- push and pop have trivial implementation with minimal overhead
• With a lot of cores, root becomes bottleneck. Use elimation-backoff
- Threads trying to push and pop at the same time meet elsewhere
• But linked data structures are slow on modern machines
- No memory locality
- Next memory address is not known before reading previous node –
code must pay memory latency penalty on each access
- Array-based single-threaded stack is many times faster than linked
one
• Alas, no practical & efficient array-based lock-free algos are known
48. Lock-free queues
• Michael & Scott algo for lock-free unbounded linked queue
- Great implementation in java.util.concurrent.ConcurrentLinkedQueue
• Array-based bounded cyclic queues cannot be practically &
efficiently make lock-free
- But limiting to a single producer and single consumer helps (in case of
a bounded array-based queue)
- Don’t not even need CAS for SPSC queue
- Use N of them for MP or MC
- Can do MP and MC queue (and even deque) if you additionally keep
a version of every slot in the array
• but this is not really practical
- Or reallocate memory when array is filled (unrolled linked list)
• a really practical alternative if needed
49. More practical notes
• Strict FIFO queue will always get contended
- Multiple producers will contend for tail
- Multiple consumers will contend for head
- Does not scale to a lot of cores
• In practice, strict FIFO queue is rarely needed
- Usually, it does not really matter if first in is really first out
• but it needs to be eventually out
- See java.util.concurrent.ForkJoinPool for one alternative
• Lock-free algorihthms can be faster (and scale better) that their
lock-based counterparts, but always slower than serial algos
Avoid unnecessary synchronization between threads
50. Data structures for search
• Ordered
- Balanced trees are hard to make lock-free (not practical)
- But Bill Pugh’s skip lists are practical in lock-free case
• Because they are based on order linked sets
• which support lock-free implementation
• See java.util.concurrent.ConcurrentSkipList for implementation
• Unordered
- Fixed-size hash-tables are trivial in concurrent case
- Resizable hash-table can be implemented lock-free, too
• As either ordered linked set with lookup hash-table
(recursive split-ordering)
• Or fully based on arrays
(Cliff Click’s high-scale hash-table)
51. Hardware transactional memory (HTM)
• Is scheduled to debut in Intel Haswell processors
- Allows to begin transaction, perform it inside processor cache, then
commit to main memory its effects or abort
- Enhances existing cache infrastructure
- While tracking interference between threads on top of existing cache-
coherence protocols
• It makes more efficient lock-free algorithms practical
- Like LIFO stacks and FIFO queues with any number of participants
- Like concurrent hash tables without pain
- Hardware just automatically detects conflicts without a code overhead
to manage them and rolls back allowing code to start transaction
again (just like you’d do in CAS universal construction)
52. Software Transactional Memory (STM)
• Is a simplified programming model
- Similar to locks, but use atomic section instead of synchronized
- Same problems as locks, but
• Without worry to take the right lock
• Without worry about deadlocks
• Conflicting transaction is transparently restarted by transaction
manager
• It has poor performance, but makes life easier
- when there are few limited places, where threads have to coordinate
though shared objects
- It is inefficient if there are a lot of shared objects and/or they are
accessed very often