Role of locking

CONCURRENT DATA
STRUCTURES
The Role of Locking
Dr. C.V. Suresh Babu

Overview


Introduction
 Synchronization
 Non-blocking

Synchronization



Is Non-blocking Synchronization performancebeneficial for Parallel Applications?



NOBLE: A Non-blocking Synchronization Interface.
How can we make non-blocking synchronization
accessible to the parallel programmer?



Lock-free Skip lists



Conclusions, Future Work

Systems: SMP


Cache-coherent distributed shared
memory multiprocessor systems:
 UMA
 NUMA

Synchronization
Barriers
 Locks, semaphores,… (mutual
exclusion)


“A significant part of the work performed
by today’s parallel applications is spent on
synchronization.”
...

Lock-Based Synchronization:
Sequential

Non-blocking Synchronization


Lock-Free Synchronization
 Optimistic

approach

• Assumes it’s alone and prepares
operation which later takes place (unless
interfered) in one atomic step, using
hardware atomic primitives
• Interference is detected via shared
memory
• Retries until not interfered by other
operations
• Can cause starvation

Example: Shared Queue
The usual approach is to implement operations using retry loops.
Here’s an example:
type Qtype = record v: valtype; next: pointer to Qtype end
type Qtype = record v: valtype; next: pointer to Qtype end
shared var Tail: pointer to Qtype;
shared var Tail: pointer to Qtype;
local var old, new: pointer to Qtype
local var old, new: pointer to Qtype
procedure Enqueue (input: valtype)
procedure Enqueue (input: valtype)
new := (input, NIL);
new := (input, NIL);
repeat old := Tail
repeat old := Tail
until CAS2(&Tail, &(old->next), old, NIL, new, new)
until CAS2(&Tail, &(old->next), old, NIL, new, new)

old
Tail

new

old
Tail

new

Non-blocking Synchronization


Lock-Free Synchronization
 Avoids

problems that locks have

 Fast
 Starvation?



(not in the Context of HPC)

Wait-Free Synchronization
 Always

finishes in a finite number of its own

steps.
• Complex algorithms
• Memory consuming
• Less efficient on average than lock-free

Overview


Introduction
 Synchronization
 Non-blocking

Synchronization



Is Non-blocking Synchronization performancebeneficial for Parallel Scientific Applications?







Non-blocking
Synchronisation
Synchronisation:
 An alternative approach for synchronisation
introduced 25 years ago
 Many theoretical results
Evaluation:
 Micro-benchmarks shows better
performance than mutual exclusion in real
or simulated multiprocessor systems.

Practice




Non-blocking synchronization is still not
used in practical applications
Non-blocking solutions are often
 complex
 having

non-standard or un-clear
interfaces
 non-practical

?

?

Practice
Question?
”How the performance of
parallel scientific
applications is affected by
the use of non-blocking
synchronisation rather than
lock-based one?”

?

?

?

Answers
How the performance of parallel scientific
applications is affected by the use of nonblocking synchronisation rather than lockbased one?






The identification of the basic locking
operations that parallel programmers use in
their applications.
The efficient non-blocking implementation of
these synchronisation operations.
The architectural implications on the design
of non-blocking synchronisation.
Comparison of the lock-based and lock-free
versions of the respective applications

Applications
Ocean

simulates eddy currents in an ocean basin.

Radiosity

computes the equilibrium distribution of light in a scene
using the radiosity method.

Volrend

renders 3D volume data into an image using a raycasting method.

Water

Evaluates forces and potentials that occur over time
between water molecules.

Spark98

a collection of sparse matrix kernels.
Each kernel performs a sequence of sparse matrix
vector product operations using matrices that are
derived from a family of three-dimensional finite
element earthquake applications.

Removing Locks in
Applications


Many locks are
“Simple Locks”.



Many critical
sections contain
shared floatingpoint variables.



Large critical
sections.







CAS, FAA and LL/SC can
be used to implement
non-blocking version.
Floating-point
synchronization primitives
are needed. A DoubleFetch-and-Add primitive
was designed.
Efficient Non-blocking
implementations of big
ADT are used.

Experimental Results:
Speedup
58P
58P

32P
24P

24P

58P
58P

SPARK98
Before:
spark_setlock(lockid);
w[col][0] += A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2];
spark_unsetlock(lockid);
After:
dfad(&w[col][0], A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]);

Overview


Introduction
 Synchronization
 Non-blocking

Synchronization



Is Non-blocking Synchronization beneficial for
Parallel Scientific Applications?







NOBLE: Brings Non-blocking closer to Practice


Create a non-blocking inter-process
communication interface with the properties:
 Attractive

functionality
 Programmer friendly
 Easy to adapt existing solutions
 Efficient
 Portable
 Adaptable for different programming languages

NOBLE Design: Portable
Noble.h
#define NBL...
#define NBL...
#define NBL...

Exported definitions
Identical for all platforms
Platform in-dependent

QueueLF.c

StackLF.c

#include “Platform/Primitives.h”
…

#include “Platform/Primitives.h”
…

...

Platform dependent
SunHardware.asm

IntelHardware.asm

CAS, TAS, Spin-Locks
…

CAS, TAS, Spin-Locks
...

...

Using NOBLE
• First create a global variable
handling the shared data
object, for example a stack:
• Create the stack with the
appropriate implementation:

Globals
#include <noble.h>
...
NBLStack* stack;

Main
stack=NBLStackCreateLF(10000);
...

Threads
• When some thread wants to
do some operation:

NBLStackPush(stack, item);

or
item=NBLStackPop(stack);

Using NOBLE
Globals
#include <noble.h>
...
NBLStack* stack;

Main


When the data structure is
not in use anymore:

stack=NBLStackCreateLF(10000);
...
NBLStackFree(stack);

Using NOBLE
Globals
#include <noble.h>
...
NBLStack* stack;

• To change the
synchronization mechanism,
only one line of code has to
be changed!

Main
stack=NBLStackCreateLB();
...
NBLStackFree(stack);

Threads
NBLStackPush(stack, item);

or
item=NBLStackPop(stack);

Design: Attractive functionality


Data structures for multi-threaded usage
 FIFO

Queues
 Priority Queues
 Dictionaries
 Stacks
 Singly linked lists
 Snapshots
 MWCAS
 ...


Clear specifications

Status


Multiprocessor support
 Sun

Solaris (Sparc)
 Win32 (Intel x86)
 SGI (Mips)
 Linux (Intel x86)
Availiable for academic use:
http://www.noble-library.org/

Did our Work have any
Impact?
1)

2)

3)

Industry has initialized contacts and
uses a test version of NOBLE.
Free-ware developers has showed
interest.
Interest from research organisations.
NOBLE is freely availiable for
research and educational purposes.

A Lock-Free Skip list


Presented as part of the: H. Sundell, Ph. Tsigas
Fast and Lock-Free Concurrent Priority Queues
for Multi-Thread Systems. 17th IEEE/ACM
International Parallel and Distributed
Processing Symposium (IPDPS ´03), May 2003
(TR 2002). Best Paper Award

A very similar lock-free skip list algorithm will be
presented this August at the ACM Symposium
on Principles of Distributed Computing (PODC
2004):
”Lock-Free Linked Lists and Skip Lists”
Mikhail Fomitchev, Eric Ruppert

Randomized Algorithm: Skip Lists


William Pugh: ”Skip Lists: A Probabilistic
Alternative to Balanced Trees”, 1990
 Layers

of ordered lists with different
densities, achieves a tree-like behavior

Head

Tail

1

2
 Time

3

4

5

6

7

complexity: O(log2N) – probabilistic!

…
25%
50%

Our Lock-Free Concurrent
Skip List
 Define

node state to depend on the
insertion status at lowest level as well
as a deletion flag

1
3
2
1

p

D

2

D

 Insert
 Set

3

D

4

D

5

D

6

D

7

D

from lowest level going upwards

deletion flag. Delete from
highest level going downwards

3
2
1

p

D

Concurrent Insert vs. Delete
operations


b)

1

Problem:

2
Delete

3
Insert

- both nodes are deleted!


4

a)

Solution (Harris et al): Use bit 0 of
pointer to mark deletion status
1

b)

2 *
c)

a)

3

4

Dynamic Memory Management
Problem: System memory allocation
functionality is blocking!
 Solution (lock-free), IBM freelists:


 Pre-allocate

a number of nodes, link
them into a dynamic stack structure,
and allocate/reclaim using CAS
Allocate

Head

Mem 1

Reclaim

Used 1

Mem 2

…

Mem n

The ABA problem


Problem: Because of concurrency
(pre-emption in particular), same
pointer value does not always mean
same node (i.e. CAS succeeds)!!!
Step 1:

1

6

7

3

7

4
Step 2:

2
4

The ABA problem


Solution: (Valois et al) Add reference
counting to each node, in order to prevent
nodes that are of interest to some thread to
be reclaimed until all threads have left the
node
New Step 2:

1 *

6 *

1

1

CAS Failes!

2

3
?

7
?

4
1

?

Helping Scheme


Threads need to traverse safely
2 *

1

4

or



2 *

4

?

?


1

Need to remove marked-to-be-deleted
nodes while traversing – Help!
Finds previous node, finish deletion and
continues traversing from previous node

1

2 *

4

Overlapping operations on
Insert 2
shared data
2


Example: Insert operation 1

4

- which of 2 or 3 gets inserted?


Solution: Compare-And-Swap
atomic primitive:
CAS(p:pointer to word, old:word,
new:word):boolean
atomic do
if *p = old then
*p := new;
return true;
else return false;

3
Insert 3

Experiments
1-30 threads on platforms with
different levels of real concurrency
 10000 Insert vs. DeleteMin operations
by each thread. 100 vs. 1000 initial
inserts
 Compare with other implementations:


 Lotan

and Shavit, 2000
 Hunt et al “An Efficient Algorithm for
Concurrent Priority Queue Heaps”,
1996

Lessons Learned








The Non-Blocking Synchronization
Paradigm can be suitable and beneficial to
large scale parallel applications.
Experimental Reproducable Work. Many
results claimed by simulation are not
consistent with what we observed.
Applications gave us nice problems to look
at and do theoretical work on. (IPDPS 2003
Algorithmic Best Paper Award)
NOBLE helped programmers to trust our
implementations.

Future Work
Extend NOBLE for loosely coupled
systems.
 Extend the set of data structures
supported by NOBLE based on the
needs of the applications.
 Reactive-Synchronisation


Role of locking

More Related Content

What's hot

Viewers also liked

Similar to Role of locking

More from Dr. C.V. Suresh Babu

Recently uploaded

Role of locking