CONCURRENT DATA
STRUCTURES
The Role of Locking
Dr. C.V. Suresh Babu
Overview


Introduction
 Synchronization
 Non-blocking

Synchronization



Is Non-blocking Synchronization performancebeneficial for Parallel Applications?



NOBLE: A Non-blocking Synchronization Interface.
How can we make non-blocking synchronization
accessible to the parallel programmer?



Lock-free Skip lists



Conclusions, Future Work
Systems: SMP


Cache-coherent distributed shared
memory multiprocessor systems:
 UMA
 NUMA
Synchronization
Barriers
 Locks, semaphores,… (mutual
exclusion)


“A significant part of the work performed
by today’s parallel applications is spent on
synchronization.”
...
Lock-Based Synchronization:
Sequential
Non-blocking Synchronization


Lock-Free Synchronization
 Optimistic

approach

• Assumes it’s alone and prepares
operation which later takes place (unless
interfered) in one atomic step, using
hardware atomic primitives
• Interference is detected via shared
memory
• Retries until not interfered by other
operations
• Can cause starvation
Example: Shared Queue
The usual approach is to implement operations using retry loops.
Here’s an example:
type Qtype = record v: valtype; next: pointer to Qtype end
type Qtype = record v: valtype; next: pointer to Qtype end
shared var Tail: pointer to Qtype;
shared var Tail: pointer to Qtype;
local var old, new: pointer to Qtype
local var old, new: pointer to Qtype
procedure Enqueue (input: valtype)
procedure Enqueue (input: valtype)
new := (input, NIL);
new := (input, NIL);
repeat old := Tail
repeat old := Tail
until CAS2(&Tail, &(old->next), old, NIL, new, new)
until CAS2(&Tail, &(old->next), old, NIL, new, new)

old
Tail

new

old
Tail

new
Non-blocking Synchronization


Lock-Free Synchronization
 Avoids

problems that locks have

 Fast
 Starvation?



(not in the Context of HPC)

Wait-Free Synchronization
 Always

finishes in a finite number of its own

steps.
• Complex algorithms
• Memory consuming
• Less efficient on average than lock-free
Overview


Introduction
 Synchronization
 Non-blocking

Synchronization



Is Non-blocking Synchronization performancebeneficial for Parallel Scientific Applications?



NOBLE: A Non-blocking Synchronization Interface.
How can we make non-blocking synchronization
accessible to the parallel programmer?



Conclusions, Future Work
Non-blocking
Synchronisation
Synchronisation:
 An alternative approach for synchronisation
introduced 25 years ago
 Many theoretical results
Evaluation:
 Micro-benchmarks shows better
performance than mutual exclusion in real
or simulated multiprocessor systems.
Practice




Non-blocking synchronization is still not
used in practical applications
Non-blocking solutions are often
 complex
 having

non-standard or un-clear
interfaces
 non-practical

?

?
Practice
Question?
”How the performance of
parallel scientific
applications is affected by
the use of non-blocking
synchronisation rather than
lock-based one?”

?

?

?
Answers
How the performance of parallel scientific
applications is affected by the use of nonblocking synchronisation rather than lockbased one?






The identification of the basic locking
operations that parallel programmers use in
their applications.
The efficient non-blocking implementation of
these synchronisation operations.
The architectural implications on the design
of non-blocking synchronisation.
Comparison of the lock-based and lock-free
versions of the respective applications
Applications
Ocean

simulates eddy currents in an ocean basin.

Radiosity

computes the equilibrium distribution of light in a scene
using the radiosity method.

Volrend

renders 3D volume data into an image using a raycasting method.

Water

Evaluates forces and potentials that occur over time
between water molecules.

Spark98

a collection of sparse matrix kernels.
Each kernel performs a sequence of sparse matrix
vector product operations using matrices that are
derived from a family of three-dimensional finite
element earthquake applications.
Removing Locks in
Applications


Many locks are
“Simple Locks”.



Many critical
sections contain
shared floatingpoint variables.



Large critical
sections.







CAS, FAA and LL/SC can
be used to implement
non-blocking version.
Floating-point
synchronization primitives
are needed. A DoubleFetch-and-Add primitive
was designed.
Efficient Non-blocking
implementations of big
ADT are used.
Experimental Results:
Speedup
58P
58P

32P
24P

24P

58P
58P
SPARK98
Before:
spark_setlock(lockid);
w[col][0] += A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2];
w[col][1] += A[Anext][0][1]*v[i][0] + A[Anext][1][1]*v[i][1] + A[Anext][2][1]*v[i][2];
w[col][2] += A[Anext][0][2]*v[i][0] + A[Anext][1][2]*v[i][1] + A[Anext][2][2]*v[i][2];
spark_unsetlock(lockid);
After:
dfad(&w[col][0], A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]);
dfad(&w[col][1], A[Anext][0][1]*v[i][0] + A[Anext][1][1]*v[i][1] + A[Anext][2][1]*v[i][2]);
dfad(&w[col][2], A[Anext][0][2]*v[i][0] + A[Anext][1][2]*v[i][1] + A[Anext][2][2]*v[i][2]);
Overview


Introduction
 Synchronization
 Non-blocking

Synchronization



Is Non-blocking Synchronization beneficial for
Parallel Scientific Applications?



NOBLE: A Non-blocking Synchronization Interface.
How can we make non-blocking synchronization
accessible to the parallel programmer?



Conclusions, Future Work
Practice




Non-blocking synchronization is still not
used in practical applications
Non-blocking solutions are often
 complex
 having

non-standard or un-clear
interfaces
 non-practical

?

?
NOBLE: Brings Non-blocking closer to Practice


Create a non-blocking inter-process
communication interface with the properties:
 Attractive

functionality
 Programmer friendly
 Easy to adapt existing solutions
 Efficient
 Portable
 Adaptable for different programming languages
NOBLE Design: Portable
Noble.h
#define NBL...
#define NBL...
#define NBL...

Exported definitions
Identical for all platforms
Platform in-dependent

QueueLF.c

StackLF.c

#include “Platform/Primitives.h”
…

#include “Platform/Primitives.h”
…

...

Platform dependent
SunHardware.asm

IntelHardware.asm

CAS, TAS, Spin-Locks
…

CAS, TAS, Spin-Locks
...

...
Using NOBLE
• First create a global variable
handling the shared data
object, for example a stack:
• Create the stack with the
appropriate implementation:

Globals
#include <noble.h>
...
NBLStack* stack;

Main
stack=NBLStackCreateLF(10000);
...

Threads
• When some thread wants to
do some operation:

NBLStackPush(stack, item);

or
item=NBLStackPop(stack);
Using NOBLE
Globals
#include <noble.h>
...
NBLStack* stack;

Main


When the data structure is
not in use anymore:

stack=NBLStackCreateLF(10000);
...
NBLStackFree(stack);
Using NOBLE
Globals
#include <noble.h>
...
NBLStack* stack;

• To change the
synchronization mechanism,
only one line of code has to
be changed!

Main
stack=NBLStackCreateLB();
...
NBLStackFree(stack);

Threads
NBLStackPush(stack, item);

or
item=NBLStackPop(stack);
Design: Attractive functionality


Data structures for multi-threaded usage
 FIFO

Queues
 Priority Queues
 Dictionaries
 Stacks
 Singly linked lists
 Snapshots
 MWCAS
 ...


Clear specifications
Status


Multiprocessor support
 Sun

Solaris (Sparc)
 Win32 (Intel x86)
 SGI (Mips)
 Linux (Intel x86)
Availiable for academic use:
http://www.noble-library.org/
Did our Work have any
Impact?
1)

2)

3)

Industry has initialized contacts and
uses a test version of NOBLE.
Free-ware developers has showed
interest.
Interest from research organisations.
NOBLE is freely availiable for
research and educational purposes.
A Lock-Free Skip list


Presented as part of the: H. Sundell, Ph. Tsigas
Fast and Lock-Free Concurrent Priority Queues
for Multi-Thread Systems. 17th IEEE/ACM
International Parallel and Distributed
Processing Symposium (IPDPS ´03), May 2003
(TR 2002). Best Paper Award

A very similar lock-free skip list algorithm will be
presented this August at the ACM Symposium
on Principles of Distributed Computing (PODC
2004):
”Lock-Free Linked Lists and Skip Lists”
Mikhail Fomitchev, Eric Ruppert
Randomized Algorithm: Skip Lists


William Pugh: ”Skip Lists: A Probabilistic
Alternative to Balanced Trees”, 1990
 Layers

of ordered lists with different
densities, achieves a tree-like behavior

Head

Tail

1

2
 Time

3

4

5

6

7

complexity: O(log2N) – probabilistic!

…
25%
50%
Our Lock-Free Concurrent
Skip List
 Define

node state to depend on the
insertion status at lowest level as well
as a deletion flag

1
3
2
1

p

D

2

D

 Insert
 Set

3

D

4

D

5

D

6

D

7

D

from lowest level going upwards

deletion flag. Delete from
highest level going downwards

3
2
1

p

D
Concurrent Insert vs. Delete
operations


b)

1

Problem:

2
Delete

3
Insert

- both nodes are deleted!


4

a)

Solution (Harris et al): Use bit 0 of
pointer to mark deletion status
1

b)

2 *
c)

a)

3

4
Dynamic Memory Management
Problem: System memory allocation
functionality is blocking!
 Solution (lock-free), IBM freelists:


 Pre-allocate

a number of nodes, link
them into a dynamic stack structure,
and allocate/reclaim using CAS
Allocate

Head

Mem 1

Reclaim

Used 1

Mem 2

…

Mem n
The ABA problem


Problem: Because of concurrency
(pre-emption in particular), same
pointer value does not always mean
same node (i.e. CAS succeeds)!!!
Step 1:

1

6

7

3

7

4
Step 2:

2
4
The ABA problem


Solution: (Valois et al) Add reference
counting to each node, in order to prevent
nodes that are of interest to some thread to
be reclaimed until all threads have left the
node
New Step 2:

1 *

6 *

1

1

CAS Failes!

2

3
?

7
?

4
1

?
Helping Scheme


Threads need to traverse safely
2 *

1

4

or



2 *

4

?

?


1

Need to remove marked-to-be-deleted
nodes while traversing – Help!
Finds previous node, finish deletion and
continues traversing from previous node

1

2 *

4
Overlapping operations on
Insert 2
shared data
2


Example: Insert operation 1

4

- which of 2 or 3 gets inserted?


Solution: Compare-And-Swap
atomic primitive:
CAS(p:pointer to word, old:word,
new:word):boolean
atomic do
if *p = old then
*p := new;
return true;
else return false;

3
Insert 3
Experiments
1-30 threads on platforms with
different levels of real concurrency
 10000 Insert vs. DeleteMin operations
by each thread. 100 vs. 1000 initial
inserts
 Compare with other implementations:


 Lotan

and Shavit, 2000
 Hunt et al “An Efficient Algorithm for
Concurrent Priority Queue Heaps”,
1996
Full Concurrency
Medium Pre-emption
High Pre-emption
Lessons Learned








The Non-Blocking Synchronization
Paradigm can be suitable and beneficial to
large scale parallel applications.
Experimental Reproducable Work. Many
results claimed by simulation are not
consistent with what we observed.
Applications gave us nice problems to look
at and do theoretical work on. (IPDPS 2003
Algorithmic Best Paper Award)
NOBLE helped programmers to trust our
implementations.
Future Work
Extend NOBLE for loosely coupled
systems.
 Extend the set of data structures
supported by NOBLE based on the
needs of the applications.
 Reactive-Synchronisation


Role of locking

  • 1.
    CONCURRENT DATA STRUCTURES The Roleof Locking Dr. C.V. Suresh Babu
  • 2.
    Overview  Introduction  Synchronization  Non-blocking Synchronization  IsNon-blocking Synchronization performancebeneficial for Parallel Applications?  NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer?  Lock-free Skip lists  Conclusions, Future Work
  • 3.
    Systems: SMP  Cache-coherent distributedshared memory multiprocessor systems:  UMA  NUMA
  • 4.
    Synchronization Barriers  Locks, semaphores,…(mutual exclusion)  “A significant part of the work performed by today’s parallel applications is spent on synchronization.” ...
  • 5.
  • 6.
    Non-blocking Synchronization  Lock-Free Synchronization Optimistic approach • Assumes it’s alone and prepares operation which later takes place (unless interfered) in one atomic step, using hardware atomic primitives • Interference is detected via shared memory • Retries until not interfered by other operations • Can cause starvation
  • 7.
    Example: Shared Queue Theusual approach is to implement operations using retry loops. Here’s an example: type Qtype = record v: valtype; next: pointer to Qtype end type Qtype = record v: valtype; next: pointer to Qtype end shared var Tail: pointer to Qtype; shared var Tail: pointer to Qtype; local var old, new: pointer to Qtype local var old, new: pointer to Qtype procedure Enqueue (input: valtype) procedure Enqueue (input: valtype) new := (input, NIL); new := (input, NIL); repeat old := Tail repeat old := Tail until CAS2(&Tail, &(old->next), old, NIL, new, new) until CAS2(&Tail, &(old->next), old, NIL, new, new) old Tail new old Tail new
  • 8.
    Non-blocking Synchronization  Lock-Free Synchronization Avoids problems that locks have  Fast  Starvation?  (not in the Context of HPC) Wait-Free Synchronization  Always finishes in a finite number of its own steps. • Complex algorithms • Memory consuming • Less efficient on average than lock-free
  • 9.
    Overview  Introduction  Synchronization  Non-blocking Synchronization  IsNon-blocking Synchronization performancebeneficial for Parallel Scientific Applications?  NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer?  Conclusions, Future Work
  • 10.
    Non-blocking Synchronisation Synchronisation:  An alternativeapproach for synchronisation introduced 25 years ago  Many theoretical results Evaluation:  Micro-benchmarks shows better performance than mutual exclusion in real or simulated multiprocessor systems.
  • 11.
    Practice   Non-blocking synchronization isstill not used in practical applications Non-blocking solutions are often  complex  having non-standard or un-clear interfaces  non-practical ? ?
  • 12.
    Practice Question? ”How the performanceof parallel scientific applications is affected by the use of non-blocking synchronisation rather than lock-based one?” ? ? ?
  • 13.
    Answers How the performanceof parallel scientific applications is affected by the use of nonblocking synchronisation rather than lockbased one?     The identification of the basic locking operations that parallel programmers use in their applications. The efficient non-blocking implementation of these synchronisation operations. The architectural implications on the design of non-blocking synchronisation. Comparison of the lock-based and lock-free versions of the respective applications
  • 14.
    Applications Ocean simulates eddy currentsin an ocean basin. Radiosity computes the equilibrium distribution of light in a scene using the radiosity method. Volrend renders 3D volume data into an image using a raycasting method. Water Evaluates forces and potentials that occur over time between water molecules. Spark98 a collection of sparse matrix kernels. Each kernel performs a sequence of sparse matrix vector product operations using matrices that are derived from a family of three-dimensional finite element earthquake applications.
  • 15.
    Removing Locks in Applications  Manylocks are “Simple Locks”.  Many critical sections contain shared floatingpoint variables.  Large critical sections.    CAS, FAA and LL/SC can be used to implement non-blocking version. Floating-point synchronization primitives are needed. A DoubleFetch-and-Add primitive was designed. Efficient Non-blocking implementations of big ADT are used.
  • 16.
  • 17.
    SPARK98 Before: spark_setlock(lockid); w[col][0] += A[Anext][0][0]*v[i][0]+ A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]; w[col][1] += A[Anext][0][1]*v[i][0] + A[Anext][1][1]*v[i][1] + A[Anext][2][1]*v[i][2]; w[col][2] += A[Anext][0][2]*v[i][0] + A[Anext][1][2]*v[i][1] + A[Anext][2][2]*v[i][2]; spark_unsetlock(lockid); After: dfad(&w[col][0], A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]); dfad(&w[col][1], A[Anext][0][1]*v[i][0] + A[Anext][1][1]*v[i][1] + A[Anext][2][1]*v[i][2]); dfad(&w[col][2], A[Anext][0][2]*v[i][0] + A[Anext][1][2]*v[i][1] + A[Anext][2][2]*v[i][2]);
  • 18.
    Overview  Introduction  Synchronization  Non-blocking Synchronization  IsNon-blocking Synchronization beneficial for Parallel Scientific Applications?  NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer?  Conclusions, Future Work
  • 19.
    Practice   Non-blocking synchronization isstill not used in practical applications Non-blocking solutions are often  complex  having non-standard or un-clear interfaces  non-practical ? ?
  • 20.
    NOBLE: Brings Non-blockingcloser to Practice  Create a non-blocking inter-process communication interface with the properties:  Attractive functionality  Programmer friendly  Easy to adapt existing solutions  Efficient  Portable  Adaptable for different programming languages
  • 21.
    NOBLE Design: Portable Noble.h #defineNBL... #define NBL... #define NBL... Exported definitions Identical for all platforms Platform in-dependent QueueLF.c StackLF.c #include “Platform/Primitives.h” … #include “Platform/Primitives.h” … ... Platform dependent SunHardware.asm IntelHardware.asm CAS, TAS, Spin-Locks … CAS, TAS, Spin-Locks ... ...
  • 22.
    Using NOBLE • Firstcreate a global variable handling the shared data object, for example a stack: • Create the stack with the appropriate implementation: Globals #include <noble.h> ... NBLStack* stack; Main stack=NBLStackCreateLF(10000); ... Threads • When some thread wants to do some operation: NBLStackPush(stack, item); or item=NBLStackPop(stack);
  • 23.
    Using NOBLE Globals #include <noble.h> ... NBLStack*stack; Main  When the data structure is not in use anymore: stack=NBLStackCreateLF(10000); ... NBLStackFree(stack);
  • 24.
    Using NOBLE Globals #include <noble.h> ... NBLStack*stack; • To change the synchronization mechanism, only one line of code has to be changed! Main stack=NBLStackCreateLB(); ... NBLStackFree(stack); Threads NBLStackPush(stack, item); or item=NBLStackPop(stack);
  • 25.
    Design: Attractive functionality  Datastructures for multi-threaded usage  FIFO Queues  Priority Queues  Dictionaries  Stacks  Singly linked lists  Snapshots  MWCAS  ...  Clear specifications
  • 26.
    Status  Multiprocessor support  Sun Solaris(Sparc)  Win32 (Intel x86)  SGI (Mips)  Linux (Intel x86) Availiable for academic use: http://www.noble-library.org/
  • 27.
    Did our Workhave any Impact? 1) 2) 3) Industry has initialized contacts and uses a test version of NOBLE. Free-ware developers has showed interest. Interest from research organisations. NOBLE is freely availiable for research and educational purposes.
  • 28.
    A Lock-Free Skiplist  Presented as part of the: H. Sundell, Ph. Tsigas Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems. 17th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS ´03), May 2003 (TR 2002). Best Paper Award A very similar lock-free skip list algorithm will be presented this August at the ACM Symposium on Principles of Distributed Computing (PODC 2004): ”Lock-Free Linked Lists and Skip Lists” Mikhail Fomitchev, Eric Ruppert
  • 29.
    Randomized Algorithm: SkipLists  William Pugh: ”Skip Lists: A Probabilistic Alternative to Balanced Trees”, 1990  Layers of ordered lists with different densities, achieves a tree-like behavior Head Tail 1 2  Time 3 4 5 6 7 complexity: O(log2N) – probabilistic! … 25% 50%
  • 30.
    Our Lock-Free Concurrent SkipList  Define node state to depend on the insertion status at lowest level as well as a deletion flag 1 3 2 1 p D 2 D  Insert  Set 3 D 4 D 5 D 6 D 7 D from lowest level going upwards deletion flag. Delete from highest level going downwards 3 2 1 p D
  • 31.
    Concurrent Insert vs.Delete operations  b) 1 Problem: 2 Delete 3 Insert - both nodes are deleted!  4 a) Solution (Harris et al): Use bit 0 of pointer to mark deletion status 1 b) 2 * c) a) 3 4
  • 32.
    Dynamic Memory Management Problem:System memory allocation functionality is blocking!  Solution (lock-free), IBM freelists:   Pre-allocate a number of nodes, link them into a dynamic stack structure, and allocate/reclaim using CAS Allocate Head Mem 1 Reclaim Used 1 Mem 2 … Mem n
  • 33.
    The ABA problem  Problem:Because of concurrency (pre-emption in particular), same pointer value does not always mean same node (i.e. CAS succeeds)!!! Step 1: 1 6 7 3 7 4 Step 2: 2 4
  • 34.
    The ABA problem  Solution:(Valois et al) Add reference counting to each node, in order to prevent nodes that are of interest to some thread to be reclaimed until all threads have left the node New Step 2: 1 * 6 * 1 1 CAS Failes! 2 3 ? 7 ? 4 1 ?
  • 35.
    Helping Scheme  Threads needto traverse safely 2 * 1 4 or  2 * 4 ? ?  1 Need to remove marked-to-be-deleted nodes while traversing – Help! Finds previous node, finish deletion and continues traversing from previous node 1 2 * 4
  • 36.
    Overlapping operations on Insert2 shared data 2  Example: Insert operation 1 4 - which of 2 or 3 gets inserted?  Solution: Compare-And-Swap atomic primitive: CAS(p:pointer to word, old:word, new:word):boolean atomic do if *p = old then *p := new; return true; else return false; 3 Insert 3
  • 37.
    Experiments 1-30 threads onplatforms with different levels of real concurrency  10000 Insert vs. DeleteMin operations by each thread. 100 vs. 1000 initial inserts  Compare with other implementations:   Lotan and Shavit, 2000  Hunt et al “An Efficient Algorithm for Concurrent Priority Queue Heaps”, 1996
  • 38.
  • 39.
  • 40.
  • 41.
    Lessons Learned     The Non-BlockingSynchronization Paradigm can be suitable and beneficial to large scale parallel applications. Experimental Reproducable Work. Many results claimed by simulation are not consistent with what we observed. Applications gave us nice problems to look at and do theoretical work on. (IPDPS 2003 Algorithmic Best Paper Award) NOBLE helped programmers to trust our implementations.
  • 42.
    Future Work Extend NOBLEfor loosely coupled systems.  Extend the set of data structures supported by NOBLE based on the needs of the applications.  Reactive-Synchronisation 