SlideShare a Scribd company logo
1 of 12
Download to read offline
Concurrent Deterministic Skiplist and Other
DataStructures
Aparna Sasidharan
Illinois Institute of Technology
Chicago,IL,USA
aparnasasidharan2017@gmail.com
Abstract—Skiplists are used in a variety of applications for
storing data subject to order criteria. In this article we dis-
cuss the design, analysis and performance of a concurrent
deterministic skiplist on many-core NUMA nodes [1]. We also
evaluate the performance of a concurrent lock-free unbounded
queue implementation and three concurrent multi-reader, multi-
writer (MWMR) hash table implementations and compare them
with those from Intel’s Thread Building Blocks (TBB) library [2].
We focus on strategies for memory management that reduce
page faults and cache misses for the memory access patterns
in these data structures. This paper proposes hierarchical usage
of concurrent data structures in programs to improve memory
latencies by reducing memory accesses from remote NUMA
nodes.
I. MOTIVATION
Most modern high performance machines have many-core
CPUs with high core counts [3]. The CPUs may be used along-
side accelerators for compute intensive applications. Many
computational science applications have shown scalability [4]
on many-core and heterogeneous nodes due to their regular
memory access patterns which have natural spatial and tem-
poral localities, low page faults and cache misses. Distributed
data-intensive workloads [5] which perform point location
and range searches have less spatial and temporal locali-
ties compared to applications from computational science.
These workloads do not scale as well as their counterparts
in computational science on high performance machines. In
this paper, we discuss the performance of three fundamental
data structures that are used for storing and querying data in
data-intensive applications on many-core NUMA nodes [1].
Performance can be improved if search applications use scal-
able concurrent data structures. For example, using suitable
functions that convert queries to keys, point location may use
concurrent hash tables and range searches may use concurrent
skiplists for query processing.
The focus of this article is a concurrent deterministic
skiplist. Skiplists can function as ordered sets and are building
blocks for ordered multi-sets and heaps. Concurrent ran-
domized skiplist implementations contain nodes with random
number of links (heights) [6]. The costs of insertion, search
and deletion are O(logn) with high probability. The O(logn)
computational complexity depends on the random number
generator used by the implementation for generating node
heights. The random number generator should generate nodes
of height j + 1 with probability 1
t
j
where t is the number of
nodes to skip at level 1 and j + 1 is the level, if levels are
numbered from 1 to L. If probabilities follow this distribution,
there will be 1
t
j
∗ N links per level j + 1 and all randomized
skiplist operations will be O(logn) with high probability [7].
A deterministic skiplist is balanced and its costs for insertion,
find and deletion are O(logn). Our implementation is a
concurrent version of 1-2-3-4 skiplist [8]. The skiplist has
log(N) levels and the number of links at each level is at least
1
4 the links at the previous lower level. Levels are numbered
from bottom (0) to root (logn). Leaf nodes are the links at level
1. Leaf nodes contain pointers to the nodes of a linked-list
which stores keys and data (terminal nodes). The algorithms
are discussed in detail in the remaining sections, along with
performance.
Dictionaries or Hash Tables are useful data structures for
organizing data based on keys, where the key universe is
much larger than the number of keys used in a program [9].
Operations on hash tables such as insert, delete and find can
be performed in constant time per key. But the performance of
concurrent hash tables is affected by random memory accesses
which cause cache misses and page faults in programs which
use large workloads. This paper discusses the performance
of three implementations of concurrent multi-writer multi-
reader (MWMR) hash tables.
Queues are widely used for load balancing workloads, stor-
ing data and for communication. A poor queue implementation
can affect the scalability of a program. This article discusses
controlled memory allocation and recycling to reduce page
faults in data structures using concurrent lock-free queues as
a motivating example.
All experiments in this paper were performed on the Delta
Supercomputer from the National Center for Supercomputing
Applications (NCSA) [10] which has AMD Milan many-
core NUMA nodes [11]. The paper is organized as follows.
Section II discusses the design and analysis of the concurrent
deterministic skiplist. Sections III, IV and V discuss strategies
for memory management in concurrent queues. Scalability of
the concurrent skiplist is discussed in section VI. Hash tables
and their performance are discussed in sections VII and VIII.
Sections IX and X have the related work, conclusions and
future work.
arXiv:2309.09359v1
[cs.DC]
17
Sep
2023
Fig. 1. Deterministic Skiplist with 3 levels
II. SKIPLIST
In this section we discuss a concurrent 1-2-3-4 skiplist based
on the sequential version from Munro and Sedgewick [8]. A
skiplist is an ordered linked-list with additional links. The extra
links reduce search costs by providing shortcuts to different
intervals in the linked-list. Without these links, search costs
can be as high as the length of the list O(n). In a random
skiplist [12], lookup nodes divide the linked-list into non-
overlapping intervals of random lengths. Although interval
lengths are random, the total number of links traversed to
find a node in a list with n nodes is O(logn) with high
probability. A deterministic 1-2-3-4 skiplist divides the linked-
list into intervals of at least 2 nodes (1 interval) and at most
5 nodes (4 contiguous intervals), with a hierarchy of links to
these intervals. The data structure is described in figure 1 for
a terminal linked-list with 7 nodes. Each non-terminal node in
the skiplist stores a key and pointers to the next and bottom
nodes. The children of a node have keys ≤ to its key. This
invariant is used to maintain order among keys. The skiplist
shown in figure 1, has two levels, with level 1 being the leaf
level. Any two consecutive levels in the skiplist maintain the
invariant that keys at level l + 1 are subsets of keys at level
l. Addition and deletion operations on the terminal linked-list
cause addition or deletion of non-terminal nodes to maintain
the O(logn) number of links to reach any node in the linked-
list. These are referred to as re-balancing operations in this
paper. They also include operations that adjust the skiplist
height (change in the number of levels).
Addition (Insertion) operation traverses the skiplist recur-
sively until it finds a level 1 node with key ≤ the new
key. Nodes on the path from the root to the leaf node are
checked for 1-2-3-4 criterion and re-balancing adds a new
node if a node has 5 children. This operation avoids the
creation of nodes with ≥ 6 children post addition. These
are local operations which require locking nodes in L shape,
i.e a node and its children. The maximum number of nodes
locked by a thread at any time during addition is at most
6. Addition maintains sorted order in the terminal linked-
list and new nodes are inserted after checking for duplicates.
Algorithms for addition of keys are provided in algorithms 1
and 2. The key of the root node (head) is the maximum
key (264
− 1). All terminal and non-terminal linked-lists are
terminated with sentinel tail nodes. If re-balancing added a
node as the next neighbor of the root, then, the skiplist’s
height should be increased, using algorithm 3. Also, terminal
linked-list nodes have sentinel bottom nodes. Sentinel nodes
point to themselves, thereby avoiding null pointers in the
implementation. We used wide unsigned integers (128-bits)
to store the key and next pointer in the same data type. The
64-bit key is stored in the upper half and the 64-bit pointer
is stored in the lower half of the wide integer. We used the
wide unsigned integer implementation from Boost library [13]
for this purpose. Boost supports atomic operations on wide
integers. Key and pointer fields were extracted from wide
integers using bit masks. Nodes also contain one additional
bit field : mark. The mark bit is set when a node is deleted
and is no longer in the skiplist. Since key and next pointers
are updated atomically during addition and deletion, our Find
implementation is lock-free. In our Find implementation, we
try to obtain a set of valid nodes in L shape and make traversal
decisions based on their keys. If a node or its children change
state, it will be detected at the next recursive step. If a node
is deleted, it can be detected by reading its mark field. If the
key of a node was decreased by a re-balancing operation, the
search can continue to the right, by following its next pointer.
Deletion operation deletes a key from the terminal linked-
list. Removal of nodes with deleted keys from non-terminal
linked-lists is performed lazily to reduce work per deletion.
If a non-terminal node (except root) in the deletion path has
exactly 2 children, then, the number of children is increased
by re-balancing, to maintain the 1-2-3-4 criteria post deletion.
Deletion may perform one of two re-balancing operations
borrow or merge at non-terminal nodes along the path from the
root to the terminal linked-list. The merge operation removes
nodes from non-terminal linked-lists by merging the children
of two neighboring nodes and removing one of them. In the
implementation described in this paper, merge removes the
node with the higher key. The borrow operation, borrows a
node from a neighbor keeping the number of nodes constant.
We implemented borrow as a post operation of merge to
maintain the correctness of Find. Algorithms used by deletion
are provided in algorithms 5 and 6. We haven’t provided the
deletion algorithm because its control flow is similar to the
addition algorithm, except that it locks a pair of adjacent
nodes and their children following LL pattern at every level.
The pair of nodes for the next level of recursion is chosen
from the union of child nodes. The previous neighbor of the
child node containing the key to be deleted is the preferred
partner. If a child node does not have a previous neighbor,
then, its next neighbor is chosen. Deletion of nodes can lead
to a skiplist where the root node has a 0 length interval.
This is detected and the skiplist’s height is decreased, using
the DecreaseDepth algorithm provided in algorithm 6. All
function implementations Addition, Find and Deletion have
retry semantics i.e if a function exited prematurely due to one
of the following conditions, then it retries :
1) Marked node.
2) The next neighbor of root is not tail.
3) Root node has 0 length interval.
The DropKey function removes a node with matching key
from the terminal linked-list. The AddNode function adds a
new node to the terminal linked-list if duplicates do not exist.
The CheckNodeKey function is used to update a node’s key
if the keys of its children are < than its key. This happens if
the child node with the highest key was removed by Deletion
or merge. The Acquire function locks a node and Release
unlocks it. Since these are simple functions, pseudo-codes
for them are not provided here. In our implementation we
used a memory manager that recycles deleted nodes. We
used reference counters in every node to avoid the ABA
problem [6]. Reference counters were incremented during
every recycling operation performed on a node.
All operations on this concurrent skiplist are linearizable.
The linearization points in the algorithms are the statements
where key and next pointer are updated, i.e a new node is
added or an existing node is deleted. The space requirement
for this data structure is O(n) nodes for the terminal linked-list
and at most
Plogn
i=1
n
2i nodes for the non-terminal linked-lists.
Other data structures that can be used in place of skiplists for
operations such as point location, range search and ranking
are balanced binary search trees (BST) such as red-black trees
or AVL trees and their variations. BSTs store keys and data
in internal and external nodes, which makes their cost per
search operation at most log(n). Compared to binary trees, a
1-2-3-4 skiplist has higher concurrency due to its branching
factor. Skiplist operations can be performed concurrently at
every level by locking a few subset of nodes. However, search
costs in skiplists implementation are never less than log(n).
Deterministic skiplists are similar in design to multi-way trees
or (a, b) trees, a ≥ 2 and b ≥ 2 ∗ a [14]. (a, b) trees
have equivalent re-balancing operations such as node splitting,
node sharing and node fusing. (a, b) tree implementations
usually perform re-balancing bottom-up. The total number of
re-balancing operations can be bounded using the bottom-up
analysis described in [14].
Define c = min(min(2a − 1, ⌈b+1
2 ⌉) − a, b − max(2a −
1, ⌊b+1
2 ⌋)). c ≥ 1 and b ≥ 2∗a. Also define a balance function
b(v) on the nodes v of the (a, b) tree as :
b(v) = min(ρ(r) − a, b − ρ(v), c) (1)
where ρ(v) is the arity or the number of children of node
v and r is the root node. For the root node, b(r) is defined
as min(ρ(r) − 2, b − ρ(r), c). The balance of a sub-tree T
of height h (bh(T)) is the sum of the balance values of
the nodes of T. Let T0 be the initial tree and Tn the tree
Algorithm 1: Skiplist : Addition
Input: Node n, Key key
Output: Integer
procedure Addition(n, key)
Acquire(n)
kn ← n.get keynext()
nkey←kn[127:64]
nnext←kn[63:0]
if isMarked(n) then
Release(n)
return RETRY
end if
if isSentinel(n) then
Release(n)
return FALSE
end if
if isHead(n) ∧¬ isSentinel(nnext) then
Release(n)
return RETRY
end if
cnodes← null
b ← n.bottom
cnodes← AcquireChildren(n)
CheckNodeKey(n,cnodes)
if nkey<key then
r← nnext
ReleaseChildren(cnodes)
Release(n)
return Addition(r,key)
end if
AdditionRebalance(n)
if isLeaf(n) then
ret← AddNode(n,key)
ReleaseChildren(cnodes)
Release(n)
return ret
end if
nn ← null
for d ∈ cnodes do
dkn ← d.get keynext()
dkey←dkn[127:64]
if key≤ dkey then
nn ← d
break
end if
end for
ReleaseChildren(cnodes)
Release(n)
return Addition(nn,key)
end procedure
obtained after n operations (additions and deletions). Tree
operations such as additions and deletions decrease the balance
of nodes by a constant value. Balance can be restored by re-
balancing operations. Since they are performed bottom-up, re-
balancing operations at height h increase the balance at h and
decrease the balance at h+1. Let anh be the number of node
additions (splitting), mnh the number of node merges (fusion)
and bnh the number of node borrows (share) at level h. The
recurrence for balance bh(Tn) can defined as :
Algorithm 2: Skiplist : Add Non-Terminal Node
Input: Node n
Output: void
procedure AdditionRebalance(n)
cnodes ← Children(n)
nnodes ← Count(cnodes)
if nnodes > 4 then
nn ← NewNode()
kn ← n.get keynext()
nkey← kn[127:64]
nnext←kn[63:0]
nn.update keynext(nkey, nnext)
nn.bottom ← cnodes[2]
kn ← cnodes[1].get keynext()
nkey←kn[127:64]
n.update keynext(nkey, nn)
end if
end procedure
Algorithm 3: Skiplist : Increase Depth
Input:
Output:
procedure IncreaseDepth()
n← head
Acquire(n)
kn ← n.get keynext()
nkey←kn[127:64]
nnext←kn[63:0]
nn← nnext
if isSentinel(nn) then
Release(n)
return
end if
d← NewNode()
d.bottom← n.bottom
d.update keynext(nkey, nnext)
n.bottom← d
n.update keynext(maxV alue, tail)
Release(n)
return
end procedure
bh(Tn) ≥ bh(T0)−(anh−1 +mnh−1)+(2c∗anh +(c+1)∗mnh)
(2)
Rearranging, and setting bh(T0) = 0 for h ≥ 1 for empty
tree T0, we get,
anh + mnh ≤
anh−1 + mnh−1
c + 1
+
bh(Tn)
c + 1
(3)
bnh ≤ mnh−1 − mnh (4)
Solving the recurrences, we can bound the total number
of re-balancing operations up to height h in an (a, b) tree to
2∗(c+2)∗ n
(c+1)h for n additions and deletions. This value de-
creases exponentially with increasing value of h. The complete
proof can be found in [14]. From the analysis of (a, b) trees,
we observe that most re-balancing operations are performed
at the lower levels of the data structure. Since the lower levels
have higher concurrency, several re-balancing operations can
Algorithm 4: Skiplist : Algorithm for lock-free
Find
Input: Node n,Key key
Output: Integer
procedure Find(n,key)
if isMarked(n) then
return RETRY
end if
d← n.bottom, cnodes← null
keys← null,nn← null
kn ← n.get keynext()
nkey ← kn[127:64],nnext ← kn[63:0]
if isHead(n) ∧¬ isSentinel(nnext) then
return RETRY
end if
if isSentinel(n) then
return FALSE
end if
if ¬ isMarked(n) ∧ isSentinel(d) then
if nkey=key then
return TRUE
end if
else
if nkey>key then
return FALSE
end if
end if
end if
if nkey < key then
return Find(nnext,key)
end if
while true do
dn ← d.get keynext()
if isMarked(n) ∨ isMarked(d) then
return RETRY
end if
if dn[127:64] ≤ nkey ∧¬ isSentinel(d) then
cnodes.add(d), keys.add(dn[127 : 64])
end if
if dn[127:64] ≥ nkey ∨ isSentinel(d) then
break
end if
d← dn[63:0]
end while
if isHead(n) then
nnodes← Count(cnodes)
if keys[nnodes-1]̸= maxkey then
return RETRY
end if
end if
c← 0
for k∈ keys do
if key≤ k then
nn← cnodes[c]
break
end if
c← c+1
end for
if nn̸= null then
return Find(nn,key)
end if
else
return FALSE
end if
end procedure
Algorithm 5: Skiplist : MergeBorrow Node
Input: Node n1,Node n2,Key key
Output: Integer
procedure MergeBorrowNode(n1,n2,key)
n1nodes← Children(n1)
n2nodes← Children(n2)
n1kn ← n1.get keynext()
n1key ← n1kn[127:64]
fn ← key ≤ n1key ? true : false
b ← fn ∧ Count(n1nodes)=2
c ← ¬ fn ∧ Count(n2nodes)=2
if b ∨ c then
n2kn ← n2.get keynext()
n2k ← n2kn[127:64]
n2n ← n2kn[63:0]
n1.update keynext(n2k, n2n)
MarkNode(n2)
Release(n2)
n2← null
if b ∧ Count(n2nodes)>2 then
nn← NewNode()
nn.update keynext(n2k, n2n)
nn.update bottom(n2nodes[1])
n2kn ← n2nodes[0].get keynext()
n1.update keynext(n2kn[127 : 64], nn)
Acquire(nn)
n2← nn
end if
else
if c ∧ Count(n1nodes)>2 then
nn← NewNode()
nn.update keynext(n2k, n2n)
p ← Count(n1nodes)
p1 ← p-1,p2← p-2
nn.update bottom(n1nodes[p1])
n2kn ← n1nodes[p2].get keynext()
n1.update keynext(n2kn[127 : 64], nn)
Acquire(nn)
n2← nn
end if
end if
end if
end procedure
be performed in parallel. In our concurrent skiplist design, re-
balancing is performed proactively during top-down traversals.
Alternatively, re-balancing can be performed using bottom-up
traversals post skiplist operations, which is the implementation
described by Munro and Sedgewick [8]. In both cases, the total
number of re-balancing operations is linear in the number of
additions and deletions. The recurrences described above hold
for both top-down and bottom-up re-balancing if b > 2∗a+1.
When b = 2 ∗ a or b = 2 ∗ a + 1 (1-2-3-4 skiplist), the
above recurrences are not valid for top-down re-balancing. For
example, for a = 2,b = 5, a node with 5 children, when split,
will create nodes with 2 and 3 children, where the 2 nodes
will require re-balancing during deletions. But the number of
re-balancing operations will still be linear in the number of
operations.
We did not use bottom-up re-balancing because it requires
two traversals per operation. It will be difficult to maintain the
Algorithm 6: Skiplist : Decrease Depth
Input:
Output:
procedure DecreaseDepth()
n← head
Acquire(n)
kn ← n.get keynext()
nkey← kn[127:64]
b← n.bottom
Acquire(b)
kn ← b.get keynext()
bkey ← kn[127:64]
if nkey=bkey∧¬ isSentinel(b) then
bb ← b.bottom
if bb̸= b then
Acquire(bb)
n.bottom← bb
MarkNode(b)
Release(bb)
end if
end if
Release(b)
Release(n)
end procedure
correctness of lock-free Find along with addition and deletion
traversals in opposite directions. Multi-way search trees [15]
discusses a concurrent implementation which performs re-
balancing bottom-up but the Find implementation uses locks.
The lock-free B+
-tree design from [16] maintains correctness
by deleting nodes after every re-balancing operation.
III. UNBOUNDED LOCK-FREE QUEUES
We chose lock-free queues for this discussion because they
are useful for load balancing workloads within and across
nodes in many-core processors. Boost provides a lock-free
queue implementation that allocates memory in blocks. Their
implementation is based on the lock-free algorithm from [17].
Although push and pop are lock-free, memory management
operations such as allocation of additional blocks, uses coarse
locks on the queue. Some of the reasons for poor performance
of concurrent queues are the the following :
1) Use of coarse locks on the queue
2) Frequent dynamic memory allocation
3) Cache misses, especially in linked-list based implemen-
tations
Linked-list based lock-free queue algorithms perform more
computation per push or pop, because of pointer updates
using compare-and-swap (CAS) instructions which have to
be retried on failure. Depending on the algorithm each queue
operation may require two CAS operations (one for updating
next pointer and the other for updating head/tail values).
Besides CAS retries, cache misses and page faults affect
the performance of this implementation if the nodes of the
linked-list do not have spatial locality. Spatial locality can be
improved by allocating memory in blocks.
In this implementation we used arrays instead of linked-lists,
because head and tail pointers can be replaced by integers
whose values can be updated using fetch-add instructions.
To limit memory allocation, we allocated memory for the
queue in blocks and recycled empty blocks. The pointers to
blocks are stored in a vector. The head and tail values are
incremented monotonically during successful push and pop
operations. Empty blocks are recycled by moving them to the
end of the vector. When blocks are recycled, head and tail
values are adjusted by adding negative shift offsets. Unlike
linked-list based algorithms, the completion of a push or pop
operation is not well-defined in array based algorithms. In
this implementation we initialized arrays with special EMPTY
values to detect scenarios where a pop precedes a push to the
same array position. Other options are to use atomic data types
for value fields and/or valid bits for every queue position. Let
N = n1+n2, where n1 and n2 are the number of push and pop
operations. Let C be the size of a block. The number of blocks
allocated is at most ⌈n1
C ⌉ and at least max(1, ⌈n1−n2
C ⌉). The
allocation of blocks depends on the sequence of push and pop
operations. Although this analysis is for a sequential queue,
it can be extended to concurrent queues. Any execution of
this concurrent queue can be linearized to a sequence of push
and pop operations that match a sequential execution where
the linearization points are the atomic instructions that update
head and tail values.
Fig. 2. Lock-free Queue : Memory Allocation
The lock-free queue is shown in figure 2.
IV. EXPERIMENTS : UNBOUNDED LOCK-FREE QUEUES
We compared the performance of Boost and TBB lock-free
queues with our implementation. The queues were tested for
different workloads consisting of 100m and 1b operations on
integers. The workloads had approximately 50% push and pop
operations. We used queue arrays, i.e one queue per thread,
for testing these workloads. We pinned threads to CPUs in
the order of thread ids, e.g thread 0 was pinned to CPU0.
A node has 8 NUMA nodes each consisting of 16 CPUs,
i.e CPUs 0-15 belonged to NUMA node 0, 16-31 belong to
NUMA node 1 and so on. During memory allocation, we used
huge pages and the scalable memory allocator from TBB for
allocating blocks. During a push operation, a thread would
choose a random queue in its NUMA node and push an
integer to its tail and during pop operations threads pop from
local queues. All memory accesses by threads were limited
to their NUMA nodes. Our implementation showed strong
scaling and performed better than the boost implementation
for all workloads. The results from Boost queues are not
presented here for brevity, because it did not show strong
scaling with increasing thread counts. The block size was fixed
at 10000 integers for all experiments. The results presented
in table I and graph 3 compare the performance of TBB
concurrent queues and our implementation 3, referred to as
lkfree. All observations are averaged over 5 repetitions. The
TBB implementation is based on Linearizable Non-blocking
FIFO queues (LCRQ) [18]. To describe briefly, the LCRQ
algorithm uses a linked-list of array-based micro-queues, with
head and tail values managed using fetch-add instructions
which removes the overheads of CAS retries. However, the
queue algorithm does not manage memory. The differences
in performance between TBB and our implementation are
due to additional work performed in our algorithm to recycle
blocks, and retries in push and pop caused by CAS failures.
We implemented strong completion semantics by using CAS
instructions for updating head and tail values. Our implemen-
tation retries in the following scenarios :
1) CAS failures during concurrent head, tail updates.
2) push finds queue full, allocates new chunk and retries.
3) A pop or push detects recycling.
Both push and pop operations are blocked when blocks are
being recycled. Recycling is part of pop, i.e empty blocks are
detected and moved to the end of the block vector. We are
currently working on a version of our queue that uses fetch-
add instead of CAS for head, tail updates.
#threads 100mtbb(s) 100mlkfree(s) 1btbb(s) 1blkfree(s)
4 2.390836 3.23806 14.9945 27.73456
8 1.444602 2.033946 9.728728 17.8607
16 1.85208 2.378378 15.65188 23.70332
32 0.9079336 1.286334 7.565792 11.6011
64 0.6056262 0.6874498 3.532416 5.99861
128 0.340152 0.3819218 3.279696 3.279922
TABLE I
CONCURRENT QUEUE PERFORMANCE FOR 100MILLION AND 1BILLION
WORKLOADS
V. MEMORY MANAGEMENT
In this section we discuss the memory management module
used in our skiplist and hash table implementations. In multi-
threaded programs memory allocation causes overheads when
a large number of threads make concurrent calls to malloc and
free. This can affect the scalability of the program. A memory
manager that manages memory allocations and performs re-
cycling can improve scalability by improving average spatial
and temporal localities in the program. TBB has a scalable
memory allocator [2] but it is common to all modules in a
program. Besides using a scalable allocator, it is useful if
data structures manage their own memory for better page and
cache locality within each module. In the programs described
in this article we used a concurrent memory manager which
reduced the total number of calls to malloc by allocating
Fig. 3. Lock-free Queues : Performance
memory in blocks. Deletion of allocated blocks is performed
when the data structure goes out of scope. When a node is
deleted in the hash table or skiplist, its pointer is enqueued
to a concurrent lock-free queue. Work-sharing and recycling
are the two strategies used for memory management. Suppose
a program requires N entities. Requests for new entities are
made using new and requests for removal of entities are made
using delete. All news are matched by deletes. Let C be the
size of a block. The number of blocks allocated is at most
⌈N
C ⌉ and at least 1. Since entities are recycled, the number of
blocks allocated depends on the order of news and deletes in
the program. The maximum number of blocks are allocated
when all news precede deletes. The number of blocks allocated
is 1 when new and delete alternate. We can consider news and
deletes as two different sequences of lengths at most N. Of all
such sequences, valid sequences are those where the number
of deletes are less than or equal to the number of news. The
average number of blocks in use by this memory manager is
given by the equation below, where k is the number of new
requests and i is the number of delete requests :
PN
k=1
Pk
i=0⌈k−i
C ⌉
PN
i=1 i
(5)
Although the analysis is provided for a sequential program,
it is also valid for a concurrent program. The new and
delete calls made by a concurrent program to the memory
manager are linearizable. The linearization point for new is
the atomic operation that increments the current valid address
in a block or the pop operation that supplies a recycled node
from the lock-free queue. The linearization point for delete
is the push operation to the lock-free queue. Linearization
ensures the correctness of the memory manager. It guarantees
that concurrent new requests obtain unique memory locations.
If multiple threads detect an empty queue, it could lead to
allocation of extra blocks.
VI. EXPERIMENTS : SKIPLIST
#threads 50m 50mt 100m 100mt
4 89.017692 91.69174 196.526594 200.8902
8 50.413236 52.50484 111.841188 115.2442
16 49.00225 50.78602 104.447884 107.574
32 36.9470984 38.1242 73.222608 75.37776
64 36.6563142 37.44254 68.64533 70.32372
128 35.5183464 36.10894 71.81675 73.0598
TABLE II
SKIPLIST PERFORMANCE FOR 50% ADDITIONS,FINDS WORKLOADS
Fig. 4. Skiplist Performance for Workload with 50%Additions, Finds
#threads 50m 50mt 100m 100mt
4 89.583868 92.27312 200.3361 202.4474
8 50.72543 52.83592 112.89952 115.187
16 49.26203 51.0288 103.15732 105.8804
32 35.6960194 36.87684 74.870353 76.4256
64 34.4410388 35.2427 72.731135 73.78952
128 35.264177 35.85784 72.914244 73.28778
TABLE III
SKIPLIST PERFORMANCE FOR WORKLOAD WITH 50%
ADDITIONS,FINDS,1% DELETIONS
Fig. 5. Skiplist Performance for Workloads with 50%Additions,Finds and
1%Deletions
We partitioned the skiplist into one skiplist per NUMA
node (skiplists0-7). The keys used for skiplist operations were
64-bit unsigned integers generated using hash functions from
boost [13]. The key space was partitioned across skiplists using
the upper 3-bits. Let n be the number of skiplists, nu the
number of NUMA nodes in use and ncpu, the number of CPUs
per NUMA node. Let T be the number of threads, Si the ith
skiplist and nsi
, the NUMA node it is assigned to.
nu = ⌈
T
ncpu
⌉ (6)
nsi = Si mod nu (7)
We used lock-free queues, one per thread for distributing
keys. The queues distributed keys with upper 3-bits equal to Si
to a random thread in nsi
. For example, T = 32, ncpu = 16,
nu = 2, the 8 skiplists were divided into odd-even groups.
All keys for even-numbered skiplists were serviced by threads
pinned to NUMA node 0, while threads pinned to NUMA
node 1, accessed odd-numbered skiplists. For large skiplists,
this reduced the number of pages accessed per NUMA node
and lowered overheads from page faults.
We tested the concurrent skiplist with two workloads.
1) Workload1 : This workload consisted of 50% addition
and 50% find operations on 8 skiplists.
2) Workload2 : This workload consisted of 1% deletion
and the remaining operations were 50% addition and
50% find operations.
The concurrent skiplists showed strong scaling for all
workloads, although performance dropped at high thread
counts. The graph in figure 4 and table II have the results
for workload1 for 10m, 50m and 100m operations. The
graphs in figures 5 have the results for workload2 for 50m
and 100m operations. The observations for workload2 are
tabulated in table III. Tables II and III contain the time
for skiplist operations (columns 50m,100m) as well as total
time (columns 50mt,100mt); time to fill the queues and
perform skiplist operations. At high thread counts, throughput
reduced due to contention, cache misses and page faults. We
did not compare the performance of our concurrent skiplist
with a randomized skiplist because it will have different
maximum height (level) and total number of links from our
version. Randomized skiplists also do not require re-balancing
operations to maintain logn height criterion. The total work
per operation are different and a performance comparison
between these two implementations may not be reasonable.
Our current design has high contention due to locks acquired
in L shape during Addition and LL shape during Deletion.
Using lock-free re-balancing implementations will improve the
performance by reducing contention. We used the memory
manager described in the previous section for allocating and
recycling nodes in the skiplist. Each skiplist had its own local
memory manager. We also used huge pages and the scalable
tbbmalloc implementation instead of malloc in our evaluation.
Skiplists do not have spatial or temporal locality in their
traversals. A root to terminal node path can occupy several
cache lines. Following different paths leads to cache misses
which affect the performance of large skiplists at high thread
counts.
VII. HASH TABLES
Parallel hash tables are not preferred by many applications
in computational science for storing and retrieving data due
to random memory accesses, resizing and rehashing require-
ments, which could affect strong scaling of these applications.
Some of the high performance parallel hash tables we found
are implementations using cuckoo hashing [19], hopscotch
hashing [20] and cilk [21]. We chose cilk (TBB) for com-
parison in this section because it is a dynamic hash table
implementation with resizing, rehashing and data migration.
Concurrent cuckoo and hopscotch hash table implementations
do not have scalable resizing options.
In our initial evaluation, we found the following reasons for
poor performance of concurrent hash table implementations :
1) Resizing and rehashing : Most implementations of dy-
namic concurrent hash tables, lock the entire hash table
during these operations, which affects scalability for the
thread counts and workload sizes discussed in this paper.
Rehashing also leads to overheads from page faults and
cache misses caused by data migration.
2) Page faults and cache misses : The performance of large
concurrent hash tables suffer from page faults and cache
misses caused by random memory accesses, especially
on many-core nodes with multiple NUMA nodes.
3) Memory management : Large workloads which perform
several insertions and deletions have overheads from
concurrent calls to mallocs and frees from a large
number of threads.
Contention was not a major cause for poor performance
because it was limited to hash keys that map to the same
slot. Fine-grained read-write locks per slot, worked well in
all our tests. We resolved collisions using singly linked-lists
or binary trees per slot. We tested with lock-free linked-lists
instead, but did not find noticeable gains. We finalized three
different flavors of concurrent hash tables to cover the entire
range. All implementations are MWMR.
1) Fixed number of slots and binary trees per slot to resolve
collisions.
2) Two-level hash tables with binary trees to resolve colli-
sion.
3) Split-order hash tables with singly linked-lists.
Define key-space Kp = [0, 2m
) for m − bit keys. Kp was
partitioned equally across all available NUMA nodes with a
concurrent hash table per node. We used a vector of lock-free
queues, one per thread, for distributing keys to their NUMA
nodes and load balancing the workload. For n NUMA nodes,
log(n) bits from the MSB of the key were used to locate
the NUMA node. The key was pushed to a random queue
in the node. Threads popped keys from their local queues and
performed operations on the nearest hash table. For scalability,
we used the concurrent lock-free queues from the previous
section. Like in the previous section, the number of NUMA-
nodes in use depends on the number of threads. NUMA-nodes
varied from 1 (4 threads) to 8 (128 threads).
VIII. EXPERIMENTS : HASH TABLES
In this section we compare the performance of three concur-
rent hash table implementations for three random workloads
with 10m, 100m and 1b operations with 50% insert and
find operations. We used hash functions from Boost [13] to
generate hash values by scrambling 64-bit integers. Let k be
a key and H(k) its hash key. For hash table size M, the slot
s for k is provided by the equation :
s = H(k)modM (8)
Since the bits are randomized using H(k), we used power-
of-two values for M, so that the modulo operation is the last
log(M) bits of H(k). Let M be the number of slots and N the
number of entries in the hash table. For all implementations
discussed here, N > M the mean load per slot was N
M for the
values for M and N used in our tests [22]. The hash function
distributed values without clustering and all slots were load
balanced with approximately N
M entries.
We did not include erase in this workload because it
would reduce the total size of the hash table and skew the
cache misses and page faults we intend to capture in our
experiments. The three implementations are described in detail
here. The first version has fixed number of slots and binary
trees for resolving collisions. The total number of slots was
kept constant and partitioned equally across the NUMA nodes
in use. Although this implementation showed scalability with
increasing thread counts, its execution time for large workloads
was affected by large binary tree traversals per slot. The second
version used another level of hash tables per slot and binary
trees at the second level. Let M1 be the number of slots in
the hash table and M2, the number of slots in the second
level tables. This implementation used different subsets of bits
from H(k) for the two levels. The lower log(M1) bits were
used for computing first level slots and the next log(M2)
bits were used to locate second level slots. The two-level
implementation had read-write locks per slot at both levels.
However, exclusive write locks are required at the first level
only during expansion of second level tables. For operations
on second level tables, threads acquired shared read locks at
first level for more concurrency. The two-level hash table had
shorter binary trees and the implementation performed better
than the first version at the cost of increased slots and memory
allocation for slots. The graph in figure 6 and the observations
in table IV show the performance of the two concurrent hash
table implementations. The two-level hash tables performed
better than the fixed size hash table for all workloads. We
used 8192 slots for both tables for 10m and 100m workloads.
The two-level table had either 1 or 2048 slots at the second
level. The number of slots per NUMA node was 8192
n for n
NUMA nodes in both implementations.
In the two-level implementation, the second level tables
were expanded when the number of hash table collisions
increased beyond a threshold value (10 entries) and they
were removed when the number of entries fell below this
threshold. Different second level tables can be expanded
#threads fixed10m twolevel10m fixed100m twolevel100m
4 1.8080762 1.8143984 21.56307 12.077078
8 1.4035088 0.9598364 12.79544 6.297646
16 1.4310018 0.5916096 10.666476 3.901922
32 0.6556778 0.404464 5.624658 2.081128
64 0.3043472 0.3143486 2.946662 1.433568
128 0.19882468 0.30740326 1.52265 1.392154
TABLE IV
PERFORMANCE OF FIXED-SIZE HASH TABLES VS TWO-LEVEL HASH
TABLES
Fig. 6. Fixed-size Hash Tables vs Two-level Hash Tables
concurrently because of fine-grained read-write locks per first
level slot. Both hash table implementations used the memory
manager described in the previous section. The two-level
implementation used a memory manager per first level slot.
This improved page and cache locality in second level tables.
Like in other experiments, we used huge pages and TBB’s
malloc implementation. The use of read-write locks increased
concurrency, by allowing concurrent finds on the same slot
using shared read locks and requiring exclusive write locks
for insert and erase.
The third implementation is a concurrent split-order hash
table. This implementation differs from other hash table im-
plementations in this section because the slots are separate
from nodes. Nodes are stored in a singly linked-list in the
sorted order of their reverse keys (LSB to MSB). We fixed
an initial number of slots (seed) and doubled the number
of slots whenever occupancy was higher than a threshold
value. The maximum number of collisions per slot and seed
value are variables for the implementation. For n slots and m
maximum collisions per slot, the occupancy of the hash table
was computed as n∗m. When the occupancy > n∗m, n was
increased to 2∗n. The new slots were filled during subsequent
hash table operations. When a hash table operation found an
empty slot, it was filled by recursively traversing same slots in
prior allocations, until a non-empty slot was found. Same slots
in previous allocations were located using bit masks. When a
non-empty slot was found, the linked-list was split by inserting
a dummy node in the list where keys differ in their MSB
position as described in [23]. For example, for seed = 512
and n = 512, when the number of slots was increased to 1024,
the 9th
bit (log(512)) was used to identify the split location,
where bits are numbered from 0−63. Splitting performed the
required rehashing without data migration. The seed slots were
initialized with dummy nodes. Although the implementation
discussed in [23] is lock-free, our implementation used read-
write locks on the full hash table as well as per slot. In
our implementation, a large slot vector was allocated during
initialization. Although exclusive write locks are required, a
resizing operation doubles the number of slots and exits, which
is a low-cost operation. The default split-order hash table did
not perform well due to cache misses and page faults during
rehashing. An empty slot had to recursively access prior slots
which caused overheads for slots that are far apart. A two-level
hierarchical split-order hash table with small number of seed
slots and resizing operations performed per table showed better
spatial locality and strong scaling at large thread counts for
these workloads. A comparison of overheads caused by cache
misses are presented in the graph in figure 7 and tabulated
in table V. The maximum number of collisions for both split-
order implementations was 16 entries. The seed for split-order
hash table was 8192. For hierarchical split-order hash tables,
we used 256 split-order tables at the first level with seed size
64 at the second level. Like the other two implementations, the
split-order tables were also partitioned across NUMA nodes.
We used a single memory manager for the default split-order
implementation and a memory manager per first level table
for the two-level version.
#threads 10mspo 10mtwolevelspo
4 4.1893104 1.8829426
8 4.384854 0.9649104
16 8.3696894 0.4804762
32 4.0107974 0.242256
64 2.2309622 0.1543608
128 1.18745908 0.11367386
TABLE V
CACHE PERFORMANCE OF ONE-LEVEL AND TWO-LEVEL
SPLIT-ORDER (SPO) HASH TABLES
Fig. 7. Cache behaviour of One-level and Two-level Split-Order (SPO) Hash
Tables
#threads 1btbb 1btwolevel 1btwolevelspo 1btwolevelspo t
4 118.3161 232.7967 232.75736 276.90575
8 73.53966 104.2352 114.29366 140.70575
16 68.52222 51.59966 54.72358 80.473225
32 33.24686 22.75578 37.08276 54.1749
64 19.337302 12.067834 18.75504 38.469875
128 13.61186 9.09778 11.3915 24.60945
TABLE VI
PERFORMANCE OF THREE HASH TABLE (TBB,TWO-LEVEL,TWO-LEVEL
SPO) IMPLEMENTATIONS
Fig. 8. Comparison of Three Concurrent Hash Tables (TBB,two-level,two-
level SPO)
From our experiments, hierarchical hash tables have better
page and cache hits due to spatial and temporal localities. For
the remaining experiments we used two-level tables with one
expansion and two-level split-order tables. We compared our
implementation with the concurrent hash table implementation
from TBB [2]. The TBB implementation is a two-level hash
table with independent expansion and shrinking. Similar to
split-order hash tables, empty slots are filled recursively.
But unlike the split-order algorithm, rehashing traverses all
entries in a slot, removes and adds them to new slots. A
comparison of our two hash implementations with the TBB
implementation for the 1b workload is presented in the graph
in figure 8 and tabulated in table VI. In table VI column
1btwolevelspo is the time for hash table operations and
column 1btwolevelspo t is the total time which includes time
to fill the distribution queues. The concurrent hash table from
TBB performs better than our two-level concurrent skiplist at
low thread counts due to fast read-write locks from TBB [2].
We noticed synchronization overheads due to Boost [13] read-
write locks in our implementations with large workloads.
All three implementations scale well with increasing thread
counts. Depending on the available memory and the needs of
the workload, a programmer may choose any of them. Based
on our experiments, we are currently working on a hierarchical
lock-free split-order hash table.
IX. RELATED WORK
The concurrent random skiplist [24] from Herlihy et.al
is a scalable design with a lock-free Find implementation.
It requires acquiring locks at all levels for a node and its
predecessor for Addition and Deletion operations. Our im-
plementation performs more computation per operation for
maintaining balance criterion. The concurrent skiplist imple-
mentation provided by Java is based on lock-free linked-
lists [25], but the skiplist is randomized. The contention in
our design can be reduced using a lock-free implementation
and re-balancing operations similar to those described in [16].
There are several scalable lock-free designs of concurrent
binary search trees (BST) [26], [27], [28], [29]. Some of these
BST implementations have included lock-free re-balancing
operations, but with relaxations in the BST definition or
balance criteria. A survey of concurrent BST performance can
be found in [30]. Skiplists are more convenient than binary
search trees for range searches because of the terminal linked-
list. All keys that lie within a range can be found by locating
the minimum key and following pointers whereas a binary tree
will have to perform depth-first traversal on sub-trees.
LCRQ [18] is one of the fastest concurrent unbounded
queues available. Algorithms for concurrent bounded length
lock-free queues have shown scalability on GPUs [31].
Bounded length queue implementations work well for work-
loads which have alternate push and pop. They also have better
page and cache hits from limiting the buffer size. We did
not follow that direction because we did not want to make
assumptions regarding our workloads.
Concurrent cuckoo [19] and hopscotch [20] hash table
implementations have shown better performance than TBB’s
implementation for fixed size tables and low occupancy. Their
performance gains come from higher concurrency due to fine-
grained locks per slot and better spatial and temporal localities
that leads to higher cache hits. Other widely used concurrent
hash tables are those that perform open-addressing. These
implementations have higher concurrency and are lock-free.
But we did not consider open-addressing hash tables in this
paper because they have potential problems from clustering
which affect the computational complexity of table operations.
The expected cost of search operations can be as high as
πN
2
1
2
for nearly full tables with linear probing [22]. This is
expensive for the workload sizes discussed in this paper. A
detailed discussion of different types of hash tables and hash
functions can be found in [32]. Deletion in concurrent open-
addressing designs are performed lazily by marking deleted
entries. [33] discusses open-addressing implementations that
have scaled well for fixed size tables. However, they have not
discussed performance for workloads which require frequent
resizing, especially with an implementation that supports lazy
deletion.
Data structure implementations optimized for NUMA ar-
chitectures have been discussed in [34]. They have used
redundancy and combiners to reduce memory accesses from
remote nodes. But the authors of article [34] have not dis-
cussed the correctness of their implementation, especially
since the different NUMA nodes will have different views
of the same set of global operations. In our experiments,
we used lock-free queues for inter-node communication and
found that it does not affect the scalability of the program in
spite of memory accesses from remote NUMA nodes for push
operations. Although we used two data structures, i.e a lock-
free queue and skiplist or hash table, we allocated separate
memory blocks for them which provided isolation and spatial
and temporal localities in our programs. In our experiments,
we filled the queues first before performing operations on
the data structures which is also a reason for good locality.
Programs which fill queues while performing operations on
data structures may also be interesting experiments for this
design with memory managers that perform recycling.
X. CONCLUSIONS AND FUTURE WORK
This paper discusses the design of a concurrent determin-
istic 1-2-3-4 skiplist. We found the implementation to be
scalable on many-core nodes. To the best of our knowledge,
this is the first concurrent implementation of a deterministic
skiplist. This implementation has guaranteed O(logn) cost for
all operations, unlike random skiplists which have variable
number of links. It is easier to analyse the performance of
skiplists and compare them with BSTs or red-black trees
using a deterministic implementation. We have also provided
performance results for two other widely used data structures
on many-core nodes. We found non-NUMA memory accesses
and page faults from remote NUMA nodes to be the most
expensive overheads on AMD Milan [11]. We also discussed
methods for memory management in many-core nodes. As part
of future work, we intend to design and implement a lock-free
concurrent deterministic skiplist, and generalize it to concur-
rent lock-free (a, b) trees. In future, we plan to port these
implementations to GPUs which have higher concurrency and
different types of memory latencies compared to CPUs. These
implementations can be easily made distributed by adding
another level of partitioning and interprocess communication
via MPI or RPCs. Since the implementations are correct and
linearizable at process level, the overall correctness of the
distributed implementation is guaranteed.
REFERENCES
[1] C. Lameter, “Numa (non-uniform memory access): An overview:
Numa becomes more common because memory controllers get close to
execution units on microprocessors.” Queue, vol. 11, no. 7, p. 40–51,
jul 2013. [Online]. Available: https://doi.org/10.1145/2508834.2513149
[2] J. Reinders, Intel Threading Building Blocks, 1st ed. USA: O’Reilly
& Associates, Inc., 2007.
[3] 2023. [Online]. Available: https://www.top500.org
[4] S. Habib, J. Insley, D. Daniel, P. Fasel, Z. Lukić, V. Morozov, N. Fron-
tiere, H. Finkel, A. Pope, K. Heitmann, K. Kumaran, V. Vishwanath,
and T. Peterka, “Hacc: Extreme scaling and performance across diverse
architectures,” Communications of the ACM, vol. 60, pp. 97–104, 12
2016.
[5] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,
“Benchmarking cloud serving systems with ycsb,” in Proceedings of
the 1st ACM Symposium on Cloud Computing, ser. SoCC ’10. New
York, NY, USA: Association for Computing Machinery, 2010, p.
143–154. [Online]. Available: https://doi.org/10.1145/1807128.1807152
[6] M. Herlihy and N. Shavit, The Art of Multiprocessor Programming,
Revised Reprint, 1st ed. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 2012.
[7] R. Sedgewick, Algorithms in C - parts 1-4: fundamentals, data struc-
tures, sorting, searching (3. ed.). Addison-Wesley-Longman, 1998.
[8] J. I. Munro, T. Papadakis, and R. Sedgewick, “Deterministic skip lists,”
in Proceedings of the Third Annual ACM-SIAM Symposium on Discrete
Algorithms, ser. SODA ’92. USA: Society for Industrial and Applied
Mathematics, 1992, p. 367–375.
[9] D. Knuth, The Art Of Computer Programming, vol. 3: Sorting And
Searching. Addison-Wesley, 1973.
[10] delta, “— delta,” 2023, [Online; accessed 1-October-2012]. [Online].
Available: https://gateway.delta.ncsa.illinois.edu/wiki/?version=
[11] 2023. [Online]. Available: https://www.amd.com/en/processors/
epyc-7003-series
[12] W. Pugh, “Skip lists: A probabilistic alternative to balanced trees,”
Commun. ACM, vol. 33, no. 6, p. 668–676, jun 1990. [Online].
Available: https://doi.org/10.1145/78973.78977
[13] J. Siek, L.-Q. Lee, and A. Lumsdaine, The Boost Graph Library: User
Guide and Reference Manual. Addison-Wesley, 2002.
[14] K. Mehlhorn, Data Structures and Algorithms 1: Sorting and
Searching, ser. EATCS Monographs on Theoretical Computer Science.
Springer, 1984, vol. 1. [Online]. Available: https://doi.org/10.1007/
978-3-642-69672-5
[15] A. Srivastava and T. Brown, “Elimination (a,b)-trees with fast, durable
updates,” in Proceedings of the 27th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, ser. PPoPP ’22.
New York, NY, USA: Association for Computing Machinery, 2022, p.
416–430. [Online]. Available: https://doi.org/10.1145/3503221.3508441
[16] A. Braginsky and E. Petrank, “A lock-free b+tree,” in Proceedings
of the Twenty-Fourth Annual ACM Symposium on Parallelism in
Algorithms and Architectures, ser. SPAA ’12. New York, NY, USA:
Association for Computing Machinery, 2012, p. 58–67. [Online].
Available: https://doi.org/10.1145/2312005.2312016
[17] M. M. Michael and M. L. Scott, “Simple, fast, and practical non-
blocking and blocking concurrent queue algorithms,” in Proceedings
of the Fifteenth Annual ACM Symposium on Principles of Distributed
Computing, ser. PODC ’96. New York, NY, USA: Association
for Computing Machinery, 1996, p. 267–275. [Online]. Available:
https://doi.org/10.1145/248052.248106
[18] A. Morrison and Y. Afek, “Fast concurrent queues for x86 processors,”
in Proceedings of the 18th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, ser. PPoPP ’13. New York,
NY, USA: Association for Computing Machinery, 2013, p. 103–112.
[Online]. Available: https://doi.org/10.1145/2442516.2442527
[19] X. Li, D. G. Andersen, M. Kaminsky, and M. J. Freedman, “Algorithmic
improvements for fast concurrent cuckoo hashing,” in Proceedings of
the Ninth European Conference on Computer Systems, ser. EuroSys
’14. New York, NY, USA: Association for Computing Machinery,
2014. [Online]. Available: https://doi.org/10.1145/2592798.2592820
[20] M. Herlihy, N. Shavit, and M. Tzafrir, “Hopscotch hashing,” in Dis-
tributed Computing, G. Taubenfeld, Ed. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2008, pp. 350–364.
[21] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson,
K. H. Randall, and Y. Zhou, “Cilk: An efficient multithreaded runtime
system,” SIGPLAN Not., vol. 30, no. 8, p. 207–216, aug 1995. [Online].
Available: https://doi.org/10.1145/209937.209958
[22] R. Sedgewick and P. Flajolet, An introduction to the analysis of algo-
rithms. Addison-Wesley-Longman, 1996.
[23] O. Shalev and N. Shavit, “Split-ordered lists: Lock-free extensible hash
tables,” J. ACM, vol. 53, no. 3, p. 379–405, may 2006. [Online].
Available: https://doi.org/10.1145/1147954.1147958
[24] M. Herlihy, Y. Lev, V. Luchangco, and N. Shavit, “A provably correct
scalable concurrent skip list,” in Conference On Principles of Distributed
Systems (OPODIS). Citeseer, vol. 103, 2006.
[25] T. L. Harris, “A pragmatic implementation of non-blocking linked-lists,”
in Proceedings of the 15th International Conference on Distributed
Computing, ser. DISC ’01. Berlin, Heidelberg: Springer-Verlag, 2001,
p. 300–314.
[26] A. Ramachandran and N. Mittal, “A fast lock-free internal binary
search tree,” in Proceedings of the 16th International Conference on
Distributed Computing and Networking, ser. ICDCN ’15. New York,
NY, USA: Association for Computing Machinery, 2015. [Online].
Available: https://doi.org/10.1145/2684464.2684472
[27] F. Ellen, P. Fatourou, E. Ruppert, and F. van Breugel, “Non-blocking
binary search trees,” in Proceedings of the 29th ACM SIGACT-SIGOPS
Symposium on Principles of Distributed Computing, ser. PODC ’10.
New York, NY, USA: Association for Computing Machinery, 2010, p.
131–140. [Online]. Available: https://doi.org/10.1145/1835698.1835736
[28] T. Brown, F. Ellen, and E. Ruppert, “A general technique for non-
blocking trees,” in Proceedings of the 19th ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming, ser. PPoPP ’14.
New York, NY, USA: Association for Computing Machinery, 2014, p.
329–342. [Online]. Available: https://doi.org/10.1145/2555243.2555267
[29] N. G. Bronson, J. Casper, H. Chafi, and K. Olukotun, “A practical
concurrent binary search tree,” SIGPLAN Not., vol. 45, no. 5,
p. 257–268, jan 2010. [Online]. Available: https://doi.org/10.1145/
1837853.1693488
[30] M. Arbel-Raviv, T. Brown, and A. Morrison, “Getting to the root
of concurrent binary search tree performance,” in Proceedings of the
2018 USENIX Conference on Usenix Annual Technical Conference, ser.
USENIX ATC ’18. USA: USENIX Association, 2018, p. 295–306.
[31] B. Kerbl, M. Kenzel, J. H. Mueller, D. Schmalstieg, and M. Steinberger,
“The broker queue: A fast, linearizable fifo queue for fine-granular
work distribution on the gpu,” in Proceedings of the 2018 International
Conference on Supercomputing, ser. ICS ’18. New York, NY, USA:
Association for Computing Machinery, 2018, p. 76–85. [Online].
Available: https://doi.org/10.1145/3205289.3205291
[32] G. H. Gonnet and R. Baeza-Yates, Handbook of Algorithms and Data
Structures: In Pascal and C (2nd Ed.). USA: Addison-Wesley Longman
Publishing Co., Inc., 1991.
[33] T. Maier, P. Sanders, and R. Dementiev, “Concurrent hash tables: Fast
and general(?)!” ACM Trans. Parallel Comput., vol. 5, no. 4, feb 2019.
[Online]. Available: https://doi.org/10.1145/3309206
[34] I. Calciu, S. Sen, M. Balakrishnan, and M. K. Aguilera, “Black-
box concurrent data structures for numa architectures,” SIGPLAN
Not., vol. 52, no. 4, p. 207–221, apr 2017. [Online]. Available:
https://doi.org/10.1145/3093336.3037721

More Related Content

Similar to Concurrent Deterministic Skiplist and Data Structures Performance

User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxdickonsondorris
 
2 Important Data Structure Interview Questions
2 Important Data Structure Interview Questions2 Important Data Structure Interview Questions
2 Important Data Structure Interview QuestionsGeekster
 
Data structures - Introduction
Data structures - IntroductionData structures - Introduction
Data structures - IntroductionDeepaThirumurugan
 
02 linked list_20160217_jintaekseo
02 linked list_20160217_jintaekseo02 linked list_20160217_jintaekseo
02 linked list_20160217_jintaekseoJinTaek Seo
 
Data Structure the Basic Structure for Programming
Data Structure the Basic Structure for ProgrammingData Structure the Basic Structure for Programming
Data Structure the Basic Structure for Programmingpaperpublications3
 
Skipl List implementation - Part 1
Skipl List implementation - Part 1Skipl List implementation - Part 1
Skipl List implementation - Part 1Amrith Krishna
 
Ch 1 intriductions
Ch 1 intriductionsCh 1 intriductions
Ch 1 intriductionsirshad17
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceinventy
 
1.3 Linked List.pptx
1.3 Linked List.pptx1.3 Linked List.pptx
1.3 Linked List.pptxssuserd2f031
 
Data Structure the Basic Structure for Programming
Data Structure the Basic Structure for ProgrammingData Structure the Basic Structure for Programming
Data Structure the Basic Structure for Programmingpaperpublications3
 
Linked list in Data Structure and Algorithm
Linked list in Data Structure and Algorithm Linked list in Data Structure and Algorithm
Linked list in Data Structure and Algorithm KristinaBorooah
 
Linked list (introduction) 1
Linked list (introduction) 1Linked list (introduction) 1
Linked list (introduction) 1DrSudeshna
 
Data_structure.pptx
Data_structure.pptxData_structure.pptx
Data_structure.pptxpriya415376
 

Similar to Concurrent Deterministic Skiplist and Data Structures Performance (20)

User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
 
2 Important Data Structure Interview Questions
2 Important Data Structure Interview Questions2 Important Data Structure Interview Questions
2 Important Data Structure Interview Questions
 
Data Structure
Data StructureData Structure
Data Structure
 
Data structures - Introduction
Data structures - IntroductionData structures - Introduction
Data structures - Introduction
 
02 linked list_20160217_jintaekseo
02 linked list_20160217_jintaekseo02 linked list_20160217_jintaekseo
02 linked list_20160217_jintaekseo
 
Data Structure the Basic Structure for Programming
Data Structure the Basic Structure for ProgrammingData Structure the Basic Structure for Programming
Data Structure the Basic Structure for Programming
 
Skipl List implementation - Part 1
Skipl List implementation - Part 1Skipl List implementation - Part 1
Skipl List implementation - Part 1
 
Ch 1 intriductions
Ch 1 intriductionsCh 1 intriductions
Ch 1 intriductions
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Unit v
Unit vUnit v
Unit v
 
1.3 Linked List.pptx
1.3 Linked List.pptx1.3 Linked List.pptx
1.3 Linked List.pptx
 
Sorting algorithm
Sorting algorithmSorting algorithm
Sorting algorithm
 
Data Structure the Basic Structure for Programming
Data Structure the Basic Structure for ProgrammingData Structure the Basic Structure for Programming
Data Structure the Basic Structure for Programming
 
CSE 443 (1).pptx
CSE 443 (1).pptxCSE 443 (1).pptx
CSE 443 (1).pptx
 
Dl36675677
Dl36675677Dl36675677
Dl36675677
 
Linked list in Data Structure and Algorithm
Linked list in Data Structure and Algorithm Linked list in Data Structure and Algorithm
Linked list in Data Structure and Algorithm
 
Linked list (introduction) 1
Linked list (introduction) 1Linked list (introduction) 1
Linked list (introduction) 1
 
H010223640
H010223640H010223640
H010223640
 
Lu3520152020
Lu3520152020Lu3520152020
Lu3520152020
 
Data_structure.pptx
Data_structure.pptxData_structure.pptx
Data_structure.pptx
 

Recently uploaded

Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 

Recently uploaded (20)

Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 

Concurrent Deterministic Skiplist and Data Structures Performance

  • 1. Concurrent Deterministic Skiplist and Other DataStructures Aparna Sasidharan Illinois Institute of Technology Chicago,IL,USA aparnasasidharan2017@gmail.com Abstract—Skiplists are used in a variety of applications for storing data subject to order criteria. In this article we dis- cuss the design, analysis and performance of a concurrent deterministic skiplist on many-core NUMA nodes [1]. We also evaluate the performance of a concurrent lock-free unbounded queue implementation and three concurrent multi-reader, multi- writer (MWMR) hash table implementations and compare them with those from Intel’s Thread Building Blocks (TBB) library [2]. We focus on strategies for memory management that reduce page faults and cache misses for the memory access patterns in these data structures. This paper proposes hierarchical usage of concurrent data structures in programs to improve memory latencies by reducing memory accesses from remote NUMA nodes. I. MOTIVATION Most modern high performance machines have many-core CPUs with high core counts [3]. The CPUs may be used along- side accelerators for compute intensive applications. Many computational science applications have shown scalability [4] on many-core and heterogeneous nodes due to their regular memory access patterns which have natural spatial and tem- poral localities, low page faults and cache misses. Distributed data-intensive workloads [5] which perform point location and range searches have less spatial and temporal locali- ties compared to applications from computational science. These workloads do not scale as well as their counterparts in computational science on high performance machines. In this paper, we discuss the performance of three fundamental data structures that are used for storing and querying data in data-intensive applications on many-core NUMA nodes [1]. Performance can be improved if search applications use scal- able concurrent data structures. For example, using suitable functions that convert queries to keys, point location may use concurrent hash tables and range searches may use concurrent skiplists for query processing. The focus of this article is a concurrent deterministic skiplist. Skiplists can function as ordered sets and are building blocks for ordered multi-sets and heaps. Concurrent ran- domized skiplist implementations contain nodes with random number of links (heights) [6]. The costs of insertion, search and deletion are O(logn) with high probability. The O(logn) computational complexity depends on the random number generator used by the implementation for generating node heights. The random number generator should generate nodes of height j + 1 with probability 1 t j where t is the number of nodes to skip at level 1 and j + 1 is the level, if levels are numbered from 1 to L. If probabilities follow this distribution, there will be 1 t j ∗ N links per level j + 1 and all randomized skiplist operations will be O(logn) with high probability [7]. A deterministic skiplist is balanced and its costs for insertion, find and deletion are O(logn). Our implementation is a concurrent version of 1-2-3-4 skiplist [8]. The skiplist has log(N) levels and the number of links at each level is at least 1 4 the links at the previous lower level. Levels are numbered from bottom (0) to root (logn). Leaf nodes are the links at level 1. Leaf nodes contain pointers to the nodes of a linked-list which stores keys and data (terminal nodes). The algorithms are discussed in detail in the remaining sections, along with performance. Dictionaries or Hash Tables are useful data structures for organizing data based on keys, where the key universe is much larger than the number of keys used in a program [9]. Operations on hash tables such as insert, delete and find can be performed in constant time per key. But the performance of concurrent hash tables is affected by random memory accesses which cause cache misses and page faults in programs which use large workloads. This paper discusses the performance of three implementations of concurrent multi-writer multi- reader (MWMR) hash tables. Queues are widely used for load balancing workloads, stor- ing data and for communication. A poor queue implementation can affect the scalability of a program. This article discusses controlled memory allocation and recycling to reduce page faults in data structures using concurrent lock-free queues as a motivating example. All experiments in this paper were performed on the Delta Supercomputer from the National Center for Supercomputing Applications (NCSA) [10] which has AMD Milan many- core NUMA nodes [11]. The paper is organized as follows. Section II discusses the design and analysis of the concurrent deterministic skiplist. Sections III, IV and V discuss strategies for memory management in concurrent queues. Scalability of the concurrent skiplist is discussed in section VI. Hash tables and their performance are discussed in sections VII and VIII. Sections IX and X have the related work, conclusions and future work. arXiv:2309.09359v1 [cs.DC] 17 Sep 2023
  • 2. Fig. 1. Deterministic Skiplist with 3 levels II. SKIPLIST In this section we discuss a concurrent 1-2-3-4 skiplist based on the sequential version from Munro and Sedgewick [8]. A skiplist is an ordered linked-list with additional links. The extra links reduce search costs by providing shortcuts to different intervals in the linked-list. Without these links, search costs can be as high as the length of the list O(n). In a random skiplist [12], lookup nodes divide the linked-list into non- overlapping intervals of random lengths. Although interval lengths are random, the total number of links traversed to find a node in a list with n nodes is O(logn) with high probability. A deterministic 1-2-3-4 skiplist divides the linked- list into intervals of at least 2 nodes (1 interval) and at most 5 nodes (4 contiguous intervals), with a hierarchy of links to these intervals. The data structure is described in figure 1 for a terminal linked-list with 7 nodes. Each non-terminal node in the skiplist stores a key and pointers to the next and bottom nodes. The children of a node have keys ≤ to its key. This invariant is used to maintain order among keys. The skiplist shown in figure 1, has two levels, with level 1 being the leaf level. Any two consecutive levels in the skiplist maintain the invariant that keys at level l + 1 are subsets of keys at level l. Addition and deletion operations on the terminal linked-list cause addition or deletion of non-terminal nodes to maintain the O(logn) number of links to reach any node in the linked- list. These are referred to as re-balancing operations in this paper. They also include operations that adjust the skiplist height (change in the number of levels). Addition (Insertion) operation traverses the skiplist recur- sively until it finds a level 1 node with key ≤ the new key. Nodes on the path from the root to the leaf node are checked for 1-2-3-4 criterion and re-balancing adds a new node if a node has 5 children. This operation avoids the creation of nodes with ≥ 6 children post addition. These are local operations which require locking nodes in L shape, i.e a node and its children. The maximum number of nodes locked by a thread at any time during addition is at most 6. Addition maintains sorted order in the terminal linked- list and new nodes are inserted after checking for duplicates. Algorithms for addition of keys are provided in algorithms 1 and 2. The key of the root node (head) is the maximum key (264 − 1). All terminal and non-terminal linked-lists are terminated with sentinel tail nodes. If re-balancing added a node as the next neighbor of the root, then, the skiplist’s height should be increased, using algorithm 3. Also, terminal linked-list nodes have sentinel bottom nodes. Sentinel nodes point to themselves, thereby avoiding null pointers in the implementation. We used wide unsigned integers (128-bits) to store the key and next pointer in the same data type. The 64-bit key is stored in the upper half and the 64-bit pointer is stored in the lower half of the wide integer. We used the wide unsigned integer implementation from Boost library [13] for this purpose. Boost supports atomic operations on wide integers. Key and pointer fields were extracted from wide integers using bit masks. Nodes also contain one additional bit field : mark. The mark bit is set when a node is deleted and is no longer in the skiplist. Since key and next pointers are updated atomically during addition and deletion, our Find implementation is lock-free. In our Find implementation, we try to obtain a set of valid nodes in L shape and make traversal decisions based on their keys. If a node or its children change state, it will be detected at the next recursive step. If a node is deleted, it can be detected by reading its mark field. If the key of a node was decreased by a re-balancing operation, the search can continue to the right, by following its next pointer. Deletion operation deletes a key from the terminal linked- list. Removal of nodes with deleted keys from non-terminal linked-lists is performed lazily to reduce work per deletion. If a non-terminal node (except root) in the deletion path has exactly 2 children, then, the number of children is increased by re-balancing, to maintain the 1-2-3-4 criteria post deletion. Deletion may perform one of two re-balancing operations borrow or merge at non-terminal nodes along the path from the root to the terminal linked-list. The merge operation removes nodes from non-terminal linked-lists by merging the children of two neighboring nodes and removing one of them. In the implementation described in this paper, merge removes the node with the higher key. The borrow operation, borrows a node from a neighbor keeping the number of nodes constant. We implemented borrow as a post operation of merge to maintain the correctness of Find. Algorithms used by deletion are provided in algorithms 5 and 6. We haven’t provided the deletion algorithm because its control flow is similar to the addition algorithm, except that it locks a pair of adjacent nodes and their children following LL pattern at every level. The pair of nodes for the next level of recursion is chosen from the union of child nodes. The previous neighbor of the child node containing the key to be deleted is the preferred partner. If a child node does not have a previous neighbor, then, its next neighbor is chosen. Deletion of nodes can lead
  • 3. to a skiplist where the root node has a 0 length interval. This is detected and the skiplist’s height is decreased, using the DecreaseDepth algorithm provided in algorithm 6. All function implementations Addition, Find and Deletion have retry semantics i.e if a function exited prematurely due to one of the following conditions, then it retries : 1) Marked node. 2) The next neighbor of root is not tail. 3) Root node has 0 length interval. The DropKey function removes a node with matching key from the terminal linked-list. The AddNode function adds a new node to the terminal linked-list if duplicates do not exist. The CheckNodeKey function is used to update a node’s key if the keys of its children are < than its key. This happens if the child node with the highest key was removed by Deletion or merge. The Acquire function locks a node and Release unlocks it. Since these are simple functions, pseudo-codes for them are not provided here. In our implementation we used a memory manager that recycles deleted nodes. We used reference counters in every node to avoid the ABA problem [6]. Reference counters were incremented during every recycling operation performed on a node. All operations on this concurrent skiplist are linearizable. The linearization points in the algorithms are the statements where key and next pointer are updated, i.e a new node is added or an existing node is deleted. The space requirement for this data structure is O(n) nodes for the terminal linked-list and at most Plogn i=1 n 2i nodes for the non-terminal linked-lists. Other data structures that can be used in place of skiplists for operations such as point location, range search and ranking are balanced binary search trees (BST) such as red-black trees or AVL trees and their variations. BSTs store keys and data in internal and external nodes, which makes their cost per search operation at most log(n). Compared to binary trees, a 1-2-3-4 skiplist has higher concurrency due to its branching factor. Skiplist operations can be performed concurrently at every level by locking a few subset of nodes. However, search costs in skiplists implementation are never less than log(n). Deterministic skiplists are similar in design to multi-way trees or (a, b) trees, a ≥ 2 and b ≥ 2 ∗ a [14]. (a, b) trees have equivalent re-balancing operations such as node splitting, node sharing and node fusing. (a, b) tree implementations usually perform re-balancing bottom-up. The total number of re-balancing operations can be bounded using the bottom-up analysis described in [14]. Define c = min(min(2a − 1, ⌈b+1 2 ⌉) − a, b − max(2a − 1, ⌊b+1 2 ⌋)). c ≥ 1 and b ≥ 2∗a. Also define a balance function b(v) on the nodes v of the (a, b) tree as : b(v) = min(ρ(r) − a, b − ρ(v), c) (1) where ρ(v) is the arity or the number of children of node v and r is the root node. For the root node, b(r) is defined as min(ρ(r) − 2, b − ρ(r), c). The balance of a sub-tree T of height h (bh(T)) is the sum of the balance values of the nodes of T. Let T0 be the initial tree and Tn the tree Algorithm 1: Skiplist : Addition Input: Node n, Key key Output: Integer procedure Addition(n, key) Acquire(n) kn ← n.get keynext() nkey←kn[127:64] nnext←kn[63:0] if isMarked(n) then Release(n) return RETRY end if if isSentinel(n) then Release(n) return FALSE end if if isHead(n) ∧¬ isSentinel(nnext) then Release(n) return RETRY end if cnodes← null b ← n.bottom cnodes← AcquireChildren(n) CheckNodeKey(n,cnodes) if nkey<key then r← nnext ReleaseChildren(cnodes) Release(n) return Addition(r,key) end if AdditionRebalance(n) if isLeaf(n) then ret← AddNode(n,key) ReleaseChildren(cnodes) Release(n) return ret end if nn ← null for d ∈ cnodes do dkn ← d.get keynext() dkey←dkn[127:64] if key≤ dkey then nn ← d break end if end for ReleaseChildren(cnodes) Release(n) return Addition(nn,key) end procedure obtained after n operations (additions and deletions). Tree operations such as additions and deletions decrease the balance of nodes by a constant value. Balance can be restored by re- balancing operations. Since they are performed bottom-up, re- balancing operations at height h increase the balance at h and decrease the balance at h+1. Let anh be the number of node additions (splitting), mnh the number of node merges (fusion) and bnh the number of node borrows (share) at level h. The recurrence for balance bh(Tn) can defined as :
  • 4. Algorithm 2: Skiplist : Add Non-Terminal Node Input: Node n Output: void procedure AdditionRebalance(n) cnodes ← Children(n) nnodes ← Count(cnodes) if nnodes > 4 then nn ← NewNode() kn ← n.get keynext() nkey← kn[127:64] nnext←kn[63:0] nn.update keynext(nkey, nnext) nn.bottom ← cnodes[2] kn ← cnodes[1].get keynext() nkey←kn[127:64] n.update keynext(nkey, nn) end if end procedure Algorithm 3: Skiplist : Increase Depth Input: Output: procedure IncreaseDepth() n← head Acquire(n) kn ← n.get keynext() nkey←kn[127:64] nnext←kn[63:0] nn← nnext if isSentinel(nn) then Release(n) return end if d← NewNode() d.bottom← n.bottom d.update keynext(nkey, nnext) n.bottom← d n.update keynext(maxV alue, tail) Release(n) return end procedure bh(Tn) ≥ bh(T0)−(anh−1 +mnh−1)+(2c∗anh +(c+1)∗mnh) (2) Rearranging, and setting bh(T0) = 0 for h ≥ 1 for empty tree T0, we get, anh + mnh ≤ anh−1 + mnh−1 c + 1 + bh(Tn) c + 1 (3) bnh ≤ mnh−1 − mnh (4) Solving the recurrences, we can bound the total number of re-balancing operations up to height h in an (a, b) tree to 2∗(c+2)∗ n (c+1)h for n additions and deletions. This value de- creases exponentially with increasing value of h. The complete proof can be found in [14]. From the analysis of (a, b) trees, we observe that most re-balancing operations are performed at the lower levels of the data structure. Since the lower levels have higher concurrency, several re-balancing operations can Algorithm 4: Skiplist : Algorithm for lock-free Find Input: Node n,Key key Output: Integer procedure Find(n,key) if isMarked(n) then return RETRY end if d← n.bottom, cnodes← null keys← null,nn← null kn ← n.get keynext() nkey ← kn[127:64],nnext ← kn[63:0] if isHead(n) ∧¬ isSentinel(nnext) then return RETRY end if if isSentinel(n) then return FALSE end if if ¬ isMarked(n) ∧ isSentinel(d) then if nkey=key then return TRUE end if else if nkey>key then return FALSE end if end if end if if nkey < key then return Find(nnext,key) end if while true do dn ← d.get keynext() if isMarked(n) ∨ isMarked(d) then return RETRY end if if dn[127:64] ≤ nkey ∧¬ isSentinel(d) then cnodes.add(d), keys.add(dn[127 : 64]) end if if dn[127:64] ≥ nkey ∨ isSentinel(d) then break end if d← dn[63:0] end while if isHead(n) then nnodes← Count(cnodes) if keys[nnodes-1]̸= maxkey then return RETRY end if end if c← 0 for k∈ keys do if key≤ k then nn← cnodes[c] break end if c← c+1 end for if nn̸= null then return Find(nn,key) end if else return FALSE end if end procedure
  • 5. Algorithm 5: Skiplist : MergeBorrow Node Input: Node n1,Node n2,Key key Output: Integer procedure MergeBorrowNode(n1,n2,key) n1nodes← Children(n1) n2nodes← Children(n2) n1kn ← n1.get keynext() n1key ← n1kn[127:64] fn ← key ≤ n1key ? true : false b ← fn ∧ Count(n1nodes)=2 c ← ¬ fn ∧ Count(n2nodes)=2 if b ∨ c then n2kn ← n2.get keynext() n2k ← n2kn[127:64] n2n ← n2kn[63:0] n1.update keynext(n2k, n2n) MarkNode(n2) Release(n2) n2← null if b ∧ Count(n2nodes)>2 then nn← NewNode() nn.update keynext(n2k, n2n) nn.update bottom(n2nodes[1]) n2kn ← n2nodes[0].get keynext() n1.update keynext(n2kn[127 : 64], nn) Acquire(nn) n2← nn end if else if c ∧ Count(n1nodes)>2 then nn← NewNode() nn.update keynext(n2k, n2n) p ← Count(n1nodes) p1 ← p-1,p2← p-2 nn.update bottom(n1nodes[p1]) n2kn ← n1nodes[p2].get keynext() n1.update keynext(n2kn[127 : 64], nn) Acquire(nn) n2← nn end if end if end if end procedure be performed in parallel. In our concurrent skiplist design, re- balancing is performed proactively during top-down traversals. Alternatively, re-balancing can be performed using bottom-up traversals post skiplist operations, which is the implementation described by Munro and Sedgewick [8]. In both cases, the total number of re-balancing operations is linear in the number of additions and deletions. The recurrences described above hold for both top-down and bottom-up re-balancing if b > 2∗a+1. When b = 2 ∗ a or b = 2 ∗ a + 1 (1-2-3-4 skiplist), the above recurrences are not valid for top-down re-balancing. For example, for a = 2,b = 5, a node with 5 children, when split, will create nodes with 2 and 3 children, where the 2 nodes will require re-balancing during deletions. But the number of re-balancing operations will still be linear in the number of operations. We did not use bottom-up re-balancing because it requires two traversals per operation. It will be difficult to maintain the Algorithm 6: Skiplist : Decrease Depth Input: Output: procedure DecreaseDepth() n← head Acquire(n) kn ← n.get keynext() nkey← kn[127:64] b← n.bottom Acquire(b) kn ← b.get keynext() bkey ← kn[127:64] if nkey=bkey∧¬ isSentinel(b) then bb ← b.bottom if bb̸= b then Acquire(bb) n.bottom← bb MarkNode(b) Release(bb) end if end if Release(b) Release(n) end procedure correctness of lock-free Find along with addition and deletion traversals in opposite directions. Multi-way search trees [15] discusses a concurrent implementation which performs re- balancing bottom-up but the Find implementation uses locks. The lock-free B+ -tree design from [16] maintains correctness by deleting nodes after every re-balancing operation. III. UNBOUNDED LOCK-FREE QUEUES We chose lock-free queues for this discussion because they are useful for load balancing workloads within and across nodes in many-core processors. Boost provides a lock-free queue implementation that allocates memory in blocks. Their implementation is based on the lock-free algorithm from [17]. Although push and pop are lock-free, memory management operations such as allocation of additional blocks, uses coarse locks on the queue. Some of the reasons for poor performance of concurrent queues are the the following : 1) Use of coarse locks on the queue 2) Frequent dynamic memory allocation 3) Cache misses, especially in linked-list based implemen- tations Linked-list based lock-free queue algorithms perform more computation per push or pop, because of pointer updates using compare-and-swap (CAS) instructions which have to be retried on failure. Depending on the algorithm each queue operation may require two CAS operations (one for updating next pointer and the other for updating head/tail values). Besides CAS retries, cache misses and page faults affect the performance of this implementation if the nodes of the linked-list do not have spatial locality. Spatial locality can be improved by allocating memory in blocks. In this implementation we used arrays instead of linked-lists, because head and tail pointers can be replaced by integers
  • 6. whose values can be updated using fetch-add instructions. To limit memory allocation, we allocated memory for the queue in blocks and recycled empty blocks. The pointers to blocks are stored in a vector. The head and tail values are incremented monotonically during successful push and pop operations. Empty blocks are recycled by moving them to the end of the vector. When blocks are recycled, head and tail values are adjusted by adding negative shift offsets. Unlike linked-list based algorithms, the completion of a push or pop operation is not well-defined in array based algorithms. In this implementation we initialized arrays with special EMPTY values to detect scenarios where a pop precedes a push to the same array position. Other options are to use atomic data types for value fields and/or valid bits for every queue position. Let N = n1+n2, where n1 and n2 are the number of push and pop operations. Let C be the size of a block. The number of blocks allocated is at most ⌈n1 C ⌉ and at least max(1, ⌈n1−n2 C ⌉). The allocation of blocks depends on the sequence of push and pop operations. Although this analysis is for a sequential queue, it can be extended to concurrent queues. Any execution of this concurrent queue can be linearized to a sequence of push and pop operations that match a sequential execution where the linearization points are the atomic instructions that update head and tail values. Fig. 2. Lock-free Queue : Memory Allocation The lock-free queue is shown in figure 2. IV. EXPERIMENTS : UNBOUNDED LOCK-FREE QUEUES We compared the performance of Boost and TBB lock-free queues with our implementation. The queues were tested for different workloads consisting of 100m and 1b operations on integers. The workloads had approximately 50% push and pop operations. We used queue arrays, i.e one queue per thread, for testing these workloads. We pinned threads to CPUs in the order of thread ids, e.g thread 0 was pinned to CPU0. A node has 8 NUMA nodes each consisting of 16 CPUs, i.e CPUs 0-15 belonged to NUMA node 0, 16-31 belong to NUMA node 1 and so on. During memory allocation, we used huge pages and the scalable memory allocator from TBB for allocating blocks. During a push operation, a thread would choose a random queue in its NUMA node and push an integer to its tail and during pop operations threads pop from local queues. All memory accesses by threads were limited to their NUMA nodes. Our implementation showed strong scaling and performed better than the boost implementation for all workloads. The results from Boost queues are not presented here for brevity, because it did not show strong scaling with increasing thread counts. The block size was fixed at 10000 integers for all experiments. The results presented in table I and graph 3 compare the performance of TBB concurrent queues and our implementation 3, referred to as lkfree. All observations are averaged over 5 repetitions. The TBB implementation is based on Linearizable Non-blocking FIFO queues (LCRQ) [18]. To describe briefly, the LCRQ algorithm uses a linked-list of array-based micro-queues, with head and tail values managed using fetch-add instructions which removes the overheads of CAS retries. However, the queue algorithm does not manage memory. The differences in performance between TBB and our implementation are due to additional work performed in our algorithm to recycle blocks, and retries in push and pop caused by CAS failures. We implemented strong completion semantics by using CAS instructions for updating head and tail values. Our implemen- tation retries in the following scenarios : 1) CAS failures during concurrent head, tail updates. 2) push finds queue full, allocates new chunk and retries. 3) A pop or push detects recycling. Both push and pop operations are blocked when blocks are being recycled. Recycling is part of pop, i.e empty blocks are detected and moved to the end of the block vector. We are currently working on a version of our queue that uses fetch- add instead of CAS for head, tail updates. #threads 100mtbb(s) 100mlkfree(s) 1btbb(s) 1blkfree(s) 4 2.390836 3.23806 14.9945 27.73456 8 1.444602 2.033946 9.728728 17.8607 16 1.85208 2.378378 15.65188 23.70332 32 0.9079336 1.286334 7.565792 11.6011 64 0.6056262 0.6874498 3.532416 5.99861 128 0.340152 0.3819218 3.279696 3.279922 TABLE I CONCURRENT QUEUE PERFORMANCE FOR 100MILLION AND 1BILLION WORKLOADS V. MEMORY MANAGEMENT In this section we discuss the memory management module used in our skiplist and hash table implementations. In multi- threaded programs memory allocation causes overheads when a large number of threads make concurrent calls to malloc and free. This can affect the scalability of the program. A memory manager that manages memory allocations and performs re- cycling can improve scalability by improving average spatial and temporal localities in the program. TBB has a scalable memory allocator [2] but it is common to all modules in a program. Besides using a scalable allocator, it is useful if data structures manage their own memory for better page and cache locality within each module. In the programs described in this article we used a concurrent memory manager which reduced the total number of calls to malloc by allocating
  • 7. Fig. 3. Lock-free Queues : Performance memory in blocks. Deletion of allocated blocks is performed when the data structure goes out of scope. When a node is deleted in the hash table or skiplist, its pointer is enqueued to a concurrent lock-free queue. Work-sharing and recycling are the two strategies used for memory management. Suppose a program requires N entities. Requests for new entities are made using new and requests for removal of entities are made using delete. All news are matched by deletes. Let C be the size of a block. The number of blocks allocated is at most ⌈N C ⌉ and at least 1. Since entities are recycled, the number of blocks allocated depends on the order of news and deletes in the program. The maximum number of blocks are allocated when all news precede deletes. The number of blocks allocated is 1 when new and delete alternate. We can consider news and deletes as two different sequences of lengths at most N. Of all such sequences, valid sequences are those where the number of deletes are less than or equal to the number of news. The average number of blocks in use by this memory manager is given by the equation below, where k is the number of new requests and i is the number of delete requests : PN k=1 Pk i=0⌈k−i C ⌉ PN i=1 i (5) Although the analysis is provided for a sequential program, it is also valid for a concurrent program. The new and delete calls made by a concurrent program to the memory manager are linearizable. The linearization point for new is the atomic operation that increments the current valid address in a block or the pop operation that supplies a recycled node from the lock-free queue. The linearization point for delete is the push operation to the lock-free queue. Linearization ensures the correctness of the memory manager. It guarantees that concurrent new requests obtain unique memory locations. If multiple threads detect an empty queue, it could lead to allocation of extra blocks. VI. EXPERIMENTS : SKIPLIST #threads 50m 50mt 100m 100mt 4 89.017692 91.69174 196.526594 200.8902 8 50.413236 52.50484 111.841188 115.2442 16 49.00225 50.78602 104.447884 107.574 32 36.9470984 38.1242 73.222608 75.37776 64 36.6563142 37.44254 68.64533 70.32372 128 35.5183464 36.10894 71.81675 73.0598 TABLE II SKIPLIST PERFORMANCE FOR 50% ADDITIONS,FINDS WORKLOADS Fig. 4. Skiplist Performance for Workload with 50%Additions, Finds #threads 50m 50mt 100m 100mt 4 89.583868 92.27312 200.3361 202.4474 8 50.72543 52.83592 112.89952 115.187 16 49.26203 51.0288 103.15732 105.8804 32 35.6960194 36.87684 74.870353 76.4256 64 34.4410388 35.2427 72.731135 73.78952 128 35.264177 35.85784 72.914244 73.28778 TABLE III SKIPLIST PERFORMANCE FOR WORKLOAD WITH 50% ADDITIONS,FINDS,1% DELETIONS Fig. 5. Skiplist Performance for Workloads with 50%Additions,Finds and 1%Deletions We partitioned the skiplist into one skiplist per NUMA node (skiplists0-7). The keys used for skiplist operations were
  • 8. 64-bit unsigned integers generated using hash functions from boost [13]. The key space was partitioned across skiplists using the upper 3-bits. Let n be the number of skiplists, nu the number of NUMA nodes in use and ncpu, the number of CPUs per NUMA node. Let T be the number of threads, Si the ith skiplist and nsi , the NUMA node it is assigned to. nu = ⌈ T ncpu ⌉ (6) nsi = Si mod nu (7) We used lock-free queues, one per thread for distributing keys. The queues distributed keys with upper 3-bits equal to Si to a random thread in nsi . For example, T = 32, ncpu = 16, nu = 2, the 8 skiplists were divided into odd-even groups. All keys for even-numbered skiplists were serviced by threads pinned to NUMA node 0, while threads pinned to NUMA node 1, accessed odd-numbered skiplists. For large skiplists, this reduced the number of pages accessed per NUMA node and lowered overheads from page faults. We tested the concurrent skiplist with two workloads. 1) Workload1 : This workload consisted of 50% addition and 50% find operations on 8 skiplists. 2) Workload2 : This workload consisted of 1% deletion and the remaining operations were 50% addition and 50% find operations. The concurrent skiplists showed strong scaling for all workloads, although performance dropped at high thread counts. The graph in figure 4 and table II have the results for workload1 for 10m, 50m and 100m operations. The graphs in figures 5 have the results for workload2 for 50m and 100m operations. The observations for workload2 are tabulated in table III. Tables II and III contain the time for skiplist operations (columns 50m,100m) as well as total time (columns 50mt,100mt); time to fill the queues and perform skiplist operations. At high thread counts, throughput reduced due to contention, cache misses and page faults. We did not compare the performance of our concurrent skiplist with a randomized skiplist because it will have different maximum height (level) and total number of links from our version. Randomized skiplists also do not require re-balancing operations to maintain logn height criterion. The total work per operation are different and a performance comparison between these two implementations may not be reasonable. Our current design has high contention due to locks acquired in L shape during Addition and LL shape during Deletion. Using lock-free re-balancing implementations will improve the performance by reducing contention. We used the memory manager described in the previous section for allocating and recycling nodes in the skiplist. Each skiplist had its own local memory manager. We also used huge pages and the scalable tbbmalloc implementation instead of malloc in our evaluation. Skiplists do not have spatial or temporal locality in their traversals. A root to terminal node path can occupy several cache lines. Following different paths leads to cache misses which affect the performance of large skiplists at high thread counts. VII. HASH TABLES Parallel hash tables are not preferred by many applications in computational science for storing and retrieving data due to random memory accesses, resizing and rehashing require- ments, which could affect strong scaling of these applications. Some of the high performance parallel hash tables we found are implementations using cuckoo hashing [19], hopscotch hashing [20] and cilk [21]. We chose cilk (TBB) for com- parison in this section because it is a dynamic hash table implementation with resizing, rehashing and data migration. Concurrent cuckoo and hopscotch hash table implementations do not have scalable resizing options. In our initial evaluation, we found the following reasons for poor performance of concurrent hash table implementations : 1) Resizing and rehashing : Most implementations of dy- namic concurrent hash tables, lock the entire hash table during these operations, which affects scalability for the thread counts and workload sizes discussed in this paper. Rehashing also leads to overheads from page faults and cache misses caused by data migration. 2) Page faults and cache misses : The performance of large concurrent hash tables suffer from page faults and cache misses caused by random memory accesses, especially on many-core nodes with multiple NUMA nodes. 3) Memory management : Large workloads which perform several insertions and deletions have overheads from concurrent calls to mallocs and frees from a large number of threads. Contention was not a major cause for poor performance because it was limited to hash keys that map to the same slot. Fine-grained read-write locks per slot, worked well in all our tests. We resolved collisions using singly linked-lists or binary trees per slot. We tested with lock-free linked-lists instead, but did not find noticeable gains. We finalized three different flavors of concurrent hash tables to cover the entire range. All implementations are MWMR. 1) Fixed number of slots and binary trees per slot to resolve collisions. 2) Two-level hash tables with binary trees to resolve colli- sion. 3) Split-order hash tables with singly linked-lists. Define key-space Kp = [0, 2m ) for m − bit keys. Kp was partitioned equally across all available NUMA nodes with a concurrent hash table per node. We used a vector of lock-free queues, one per thread, for distributing keys to their NUMA nodes and load balancing the workload. For n NUMA nodes, log(n) bits from the MSB of the key were used to locate the NUMA node. The key was pushed to a random queue in the node. Threads popped keys from their local queues and performed operations on the nearest hash table. For scalability, we used the concurrent lock-free queues from the previous section. Like in the previous section, the number of NUMA- nodes in use depends on the number of threads. NUMA-nodes varied from 1 (4 threads) to 8 (128 threads).
  • 9. VIII. EXPERIMENTS : HASH TABLES In this section we compare the performance of three concur- rent hash table implementations for three random workloads with 10m, 100m and 1b operations with 50% insert and find operations. We used hash functions from Boost [13] to generate hash values by scrambling 64-bit integers. Let k be a key and H(k) its hash key. For hash table size M, the slot s for k is provided by the equation : s = H(k)modM (8) Since the bits are randomized using H(k), we used power- of-two values for M, so that the modulo operation is the last log(M) bits of H(k). Let M be the number of slots and N the number of entries in the hash table. For all implementations discussed here, N > M the mean load per slot was N M for the values for M and N used in our tests [22]. The hash function distributed values without clustering and all slots were load balanced with approximately N M entries. We did not include erase in this workload because it would reduce the total size of the hash table and skew the cache misses and page faults we intend to capture in our experiments. The three implementations are described in detail here. The first version has fixed number of slots and binary trees for resolving collisions. The total number of slots was kept constant and partitioned equally across the NUMA nodes in use. Although this implementation showed scalability with increasing thread counts, its execution time for large workloads was affected by large binary tree traversals per slot. The second version used another level of hash tables per slot and binary trees at the second level. Let M1 be the number of slots in the hash table and M2, the number of slots in the second level tables. This implementation used different subsets of bits from H(k) for the two levels. The lower log(M1) bits were used for computing first level slots and the next log(M2) bits were used to locate second level slots. The two-level implementation had read-write locks per slot at both levels. However, exclusive write locks are required at the first level only during expansion of second level tables. For operations on second level tables, threads acquired shared read locks at first level for more concurrency. The two-level hash table had shorter binary trees and the implementation performed better than the first version at the cost of increased slots and memory allocation for slots. The graph in figure 6 and the observations in table IV show the performance of the two concurrent hash table implementations. The two-level hash tables performed better than the fixed size hash table for all workloads. We used 8192 slots for both tables for 10m and 100m workloads. The two-level table had either 1 or 2048 slots at the second level. The number of slots per NUMA node was 8192 n for n NUMA nodes in both implementations. In the two-level implementation, the second level tables were expanded when the number of hash table collisions increased beyond a threshold value (10 entries) and they were removed when the number of entries fell below this threshold. Different second level tables can be expanded #threads fixed10m twolevel10m fixed100m twolevel100m 4 1.8080762 1.8143984 21.56307 12.077078 8 1.4035088 0.9598364 12.79544 6.297646 16 1.4310018 0.5916096 10.666476 3.901922 32 0.6556778 0.404464 5.624658 2.081128 64 0.3043472 0.3143486 2.946662 1.433568 128 0.19882468 0.30740326 1.52265 1.392154 TABLE IV PERFORMANCE OF FIXED-SIZE HASH TABLES VS TWO-LEVEL HASH TABLES Fig. 6. Fixed-size Hash Tables vs Two-level Hash Tables concurrently because of fine-grained read-write locks per first level slot. Both hash table implementations used the memory manager described in the previous section. The two-level implementation used a memory manager per first level slot. This improved page and cache locality in second level tables. Like in other experiments, we used huge pages and TBB’s malloc implementation. The use of read-write locks increased concurrency, by allowing concurrent finds on the same slot using shared read locks and requiring exclusive write locks for insert and erase. The third implementation is a concurrent split-order hash table. This implementation differs from other hash table im- plementations in this section because the slots are separate from nodes. Nodes are stored in a singly linked-list in the sorted order of their reverse keys (LSB to MSB). We fixed an initial number of slots (seed) and doubled the number of slots whenever occupancy was higher than a threshold value. The maximum number of collisions per slot and seed value are variables for the implementation. For n slots and m maximum collisions per slot, the occupancy of the hash table was computed as n∗m. When the occupancy > n∗m, n was increased to 2∗n. The new slots were filled during subsequent hash table operations. When a hash table operation found an empty slot, it was filled by recursively traversing same slots in prior allocations, until a non-empty slot was found. Same slots in previous allocations were located using bit masks. When a non-empty slot was found, the linked-list was split by inserting a dummy node in the list where keys differ in their MSB position as described in [23]. For example, for seed = 512
  • 10. and n = 512, when the number of slots was increased to 1024, the 9th bit (log(512)) was used to identify the split location, where bits are numbered from 0−63. Splitting performed the required rehashing without data migration. The seed slots were initialized with dummy nodes. Although the implementation discussed in [23] is lock-free, our implementation used read- write locks on the full hash table as well as per slot. In our implementation, a large slot vector was allocated during initialization. Although exclusive write locks are required, a resizing operation doubles the number of slots and exits, which is a low-cost operation. The default split-order hash table did not perform well due to cache misses and page faults during rehashing. An empty slot had to recursively access prior slots which caused overheads for slots that are far apart. A two-level hierarchical split-order hash table with small number of seed slots and resizing operations performed per table showed better spatial locality and strong scaling at large thread counts for these workloads. A comparison of overheads caused by cache misses are presented in the graph in figure 7 and tabulated in table V. The maximum number of collisions for both split- order implementations was 16 entries. The seed for split-order hash table was 8192. For hierarchical split-order hash tables, we used 256 split-order tables at the first level with seed size 64 at the second level. Like the other two implementations, the split-order tables were also partitioned across NUMA nodes. We used a single memory manager for the default split-order implementation and a memory manager per first level table for the two-level version. #threads 10mspo 10mtwolevelspo 4 4.1893104 1.8829426 8 4.384854 0.9649104 16 8.3696894 0.4804762 32 4.0107974 0.242256 64 2.2309622 0.1543608 128 1.18745908 0.11367386 TABLE V CACHE PERFORMANCE OF ONE-LEVEL AND TWO-LEVEL SPLIT-ORDER (SPO) HASH TABLES Fig. 7. Cache behaviour of One-level and Two-level Split-Order (SPO) Hash Tables #threads 1btbb 1btwolevel 1btwolevelspo 1btwolevelspo t 4 118.3161 232.7967 232.75736 276.90575 8 73.53966 104.2352 114.29366 140.70575 16 68.52222 51.59966 54.72358 80.473225 32 33.24686 22.75578 37.08276 54.1749 64 19.337302 12.067834 18.75504 38.469875 128 13.61186 9.09778 11.3915 24.60945 TABLE VI PERFORMANCE OF THREE HASH TABLE (TBB,TWO-LEVEL,TWO-LEVEL SPO) IMPLEMENTATIONS Fig. 8. Comparison of Three Concurrent Hash Tables (TBB,two-level,two- level SPO) From our experiments, hierarchical hash tables have better page and cache hits due to spatial and temporal localities. For the remaining experiments we used two-level tables with one expansion and two-level split-order tables. We compared our implementation with the concurrent hash table implementation from TBB [2]. The TBB implementation is a two-level hash table with independent expansion and shrinking. Similar to split-order hash tables, empty slots are filled recursively. But unlike the split-order algorithm, rehashing traverses all entries in a slot, removes and adds them to new slots. A comparison of our two hash implementations with the TBB implementation for the 1b workload is presented in the graph in figure 8 and tabulated in table VI. In table VI column 1btwolevelspo is the time for hash table operations and column 1btwolevelspo t is the total time which includes time to fill the distribution queues. The concurrent hash table from TBB performs better than our two-level concurrent skiplist at low thread counts due to fast read-write locks from TBB [2]. We noticed synchronization overheads due to Boost [13] read- write locks in our implementations with large workloads. All three implementations scale well with increasing thread counts. Depending on the available memory and the needs of the workload, a programmer may choose any of them. Based on our experiments, we are currently working on a hierarchical lock-free split-order hash table. IX. RELATED WORK The concurrent random skiplist [24] from Herlihy et.al is a scalable design with a lock-free Find implementation.
  • 11. It requires acquiring locks at all levels for a node and its predecessor for Addition and Deletion operations. Our im- plementation performs more computation per operation for maintaining balance criterion. The concurrent skiplist imple- mentation provided by Java is based on lock-free linked- lists [25], but the skiplist is randomized. The contention in our design can be reduced using a lock-free implementation and re-balancing operations similar to those described in [16]. There are several scalable lock-free designs of concurrent binary search trees (BST) [26], [27], [28], [29]. Some of these BST implementations have included lock-free re-balancing operations, but with relaxations in the BST definition or balance criteria. A survey of concurrent BST performance can be found in [30]. Skiplists are more convenient than binary search trees for range searches because of the terminal linked- list. All keys that lie within a range can be found by locating the minimum key and following pointers whereas a binary tree will have to perform depth-first traversal on sub-trees. LCRQ [18] is one of the fastest concurrent unbounded queues available. Algorithms for concurrent bounded length lock-free queues have shown scalability on GPUs [31]. Bounded length queue implementations work well for work- loads which have alternate push and pop. They also have better page and cache hits from limiting the buffer size. We did not follow that direction because we did not want to make assumptions regarding our workloads. Concurrent cuckoo [19] and hopscotch [20] hash table implementations have shown better performance than TBB’s implementation for fixed size tables and low occupancy. Their performance gains come from higher concurrency due to fine- grained locks per slot and better spatial and temporal localities that leads to higher cache hits. Other widely used concurrent hash tables are those that perform open-addressing. These implementations have higher concurrency and are lock-free. But we did not consider open-addressing hash tables in this paper because they have potential problems from clustering which affect the computational complexity of table operations. The expected cost of search operations can be as high as πN 2 1 2 for nearly full tables with linear probing [22]. This is expensive for the workload sizes discussed in this paper. A detailed discussion of different types of hash tables and hash functions can be found in [32]. Deletion in concurrent open- addressing designs are performed lazily by marking deleted entries. [33] discusses open-addressing implementations that have scaled well for fixed size tables. However, they have not discussed performance for workloads which require frequent resizing, especially with an implementation that supports lazy deletion. Data structure implementations optimized for NUMA ar- chitectures have been discussed in [34]. They have used redundancy and combiners to reduce memory accesses from remote nodes. But the authors of article [34] have not dis- cussed the correctness of their implementation, especially since the different NUMA nodes will have different views of the same set of global operations. In our experiments, we used lock-free queues for inter-node communication and found that it does not affect the scalability of the program in spite of memory accesses from remote NUMA nodes for push operations. Although we used two data structures, i.e a lock- free queue and skiplist or hash table, we allocated separate memory blocks for them which provided isolation and spatial and temporal localities in our programs. In our experiments, we filled the queues first before performing operations on the data structures which is also a reason for good locality. Programs which fill queues while performing operations on data structures may also be interesting experiments for this design with memory managers that perform recycling. X. CONCLUSIONS AND FUTURE WORK This paper discusses the design of a concurrent determin- istic 1-2-3-4 skiplist. We found the implementation to be scalable on many-core nodes. To the best of our knowledge, this is the first concurrent implementation of a deterministic skiplist. This implementation has guaranteed O(logn) cost for all operations, unlike random skiplists which have variable number of links. It is easier to analyse the performance of skiplists and compare them with BSTs or red-black trees using a deterministic implementation. We have also provided performance results for two other widely used data structures on many-core nodes. We found non-NUMA memory accesses and page faults from remote NUMA nodes to be the most expensive overheads on AMD Milan [11]. We also discussed methods for memory management in many-core nodes. As part of future work, we intend to design and implement a lock-free concurrent deterministic skiplist, and generalize it to concur- rent lock-free (a, b) trees. In future, we plan to port these implementations to GPUs which have higher concurrency and different types of memory latencies compared to CPUs. These implementations can be easily made distributed by adding another level of partitioning and interprocess communication via MPI or RPCs. Since the implementations are correct and linearizable at process level, the overall correctness of the distributed implementation is guaranteed. REFERENCES [1] C. Lameter, “Numa (non-uniform memory access): An overview: Numa becomes more common because memory controllers get close to execution units on microprocessors.” Queue, vol. 11, no. 7, p. 40–51, jul 2013. [Online]. Available: https://doi.org/10.1145/2508834.2513149 [2] J. Reinders, Intel Threading Building Blocks, 1st ed. USA: O’Reilly & Associates, Inc., 2007. [3] 2023. [Online]. Available: https://www.top500.org [4] S. Habib, J. Insley, D. Daniel, P. Fasel, Z. Lukić, V. Morozov, N. Fron- tiere, H. Finkel, A. Pope, K. Heitmann, K. Kumaran, V. Vishwanath, and T. Peterka, “Hacc: Extreme scaling and performance across diverse architectures,” Communications of the ACM, vol. 60, pp. 97–104, 12 2016. [5] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud serving systems with ycsb,” in Proceedings of the 1st ACM Symposium on Cloud Computing, ser. SoCC ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 143–154. [Online]. Available: https://doi.org/10.1145/1807128.1807152 [6] M. Herlihy and N. Shavit, The Art of Multiprocessor Programming, Revised Reprint, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2012.
  • 12. [7] R. Sedgewick, Algorithms in C - parts 1-4: fundamentals, data struc- tures, sorting, searching (3. ed.). Addison-Wesley-Longman, 1998. [8] J. I. Munro, T. Papadakis, and R. Sedgewick, “Deterministic skip lists,” in Proceedings of the Third Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA ’92. USA: Society for Industrial and Applied Mathematics, 1992, p. 367–375. [9] D. Knuth, The Art Of Computer Programming, vol. 3: Sorting And Searching. Addison-Wesley, 1973. [10] delta, “— delta,” 2023, [Online; accessed 1-October-2012]. [Online]. Available: https://gateway.delta.ncsa.illinois.edu/wiki/?version= [11] 2023. [Online]. Available: https://www.amd.com/en/processors/ epyc-7003-series [12] W. Pugh, “Skip lists: A probabilistic alternative to balanced trees,” Commun. ACM, vol. 33, no. 6, p. 668–676, jun 1990. [Online]. Available: https://doi.org/10.1145/78973.78977 [13] J. Siek, L.-Q. Lee, and A. Lumsdaine, The Boost Graph Library: User Guide and Reference Manual. Addison-Wesley, 2002. [14] K. Mehlhorn, Data Structures and Algorithms 1: Sorting and Searching, ser. EATCS Monographs on Theoretical Computer Science. Springer, 1984, vol. 1. [Online]. Available: https://doi.org/10.1007/ 978-3-642-69672-5 [15] A. Srivastava and T. Brown, “Elimination (a,b)-trees with fast, durable updates,” in Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 416–430. [Online]. Available: https://doi.org/10.1145/3503221.3508441 [16] A. Braginsky and E. Petrank, “A lock-free b+tree,” in Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, ser. SPAA ’12. New York, NY, USA: Association for Computing Machinery, 2012, p. 58–67. [Online]. Available: https://doi.org/10.1145/2312005.2312016 [17] M. M. Michael and M. L. Scott, “Simple, fast, and practical non- blocking and blocking concurrent queue algorithms,” in Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, ser. PODC ’96. New York, NY, USA: Association for Computing Machinery, 1996, p. 267–275. [Online]. Available: https://doi.org/10.1145/248052.248106 [18] A. Morrison and Y. Afek, “Fast concurrent queues for x86 processors,” in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’13. New York, NY, USA: Association for Computing Machinery, 2013, p. 103–112. [Online]. Available: https://doi.org/10.1145/2442516.2442527 [19] X. Li, D. G. Andersen, M. Kaminsky, and M. J. Freedman, “Algorithmic improvements for fast concurrent cuckoo hashing,” in Proceedings of the Ninth European Conference on Computer Systems, ser. EuroSys ’14. New York, NY, USA: Association for Computing Machinery, 2014. [Online]. Available: https://doi.org/10.1145/2592798.2592820 [20] M. Herlihy, N. Shavit, and M. Tzafrir, “Hopscotch hashing,” in Dis- tributed Computing, G. Taubenfeld, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 350–364. [21] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou, “Cilk: An efficient multithreaded runtime system,” SIGPLAN Not., vol. 30, no. 8, p. 207–216, aug 1995. [Online]. Available: https://doi.org/10.1145/209937.209958 [22] R. Sedgewick and P. Flajolet, An introduction to the analysis of algo- rithms. Addison-Wesley-Longman, 1996. [23] O. Shalev and N. Shavit, “Split-ordered lists: Lock-free extensible hash tables,” J. ACM, vol. 53, no. 3, p. 379–405, may 2006. [Online]. Available: https://doi.org/10.1145/1147954.1147958 [24] M. Herlihy, Y. Lev, V. Luchangco, and N. Shavit, “A provably correct scalable concurrent skip list,” in Conference On Principles of Distributed Systems (OPODIS). Citeseer, vol. 103, 2006. [25] T. L. Harris, “A pragmatic implementation of non-blocking linked-lists,” in Proceedings of the 15th International Conference on Distributed Computing, ser. DISC ’01. Berlin, Heidelberg: Springer-Verlag, 2001, p. 300–314. [26] A. Ramachandran and N. Mittal, “A fast lock-free internal binary search tree,” in Proceedings of the 16th International Conference on Distributed Computing and Networking, ser. ICDCN ’15. New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2684464.2684472 [27] F. Ellen, P. Fatourou, E. Ruppert, and F. van Breugel, “Non-blocking binary search trees,” in Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, ser. PODC ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 131–140. [Online]. Available: https://doi.org/10.1145/1835698.1835736 [28] T. Brown, F. Ellen, and E. Ruppert, “A general technique for non- blocking trees,” in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’14. New York, NY, USA: Association for Computing Machinery, 2014, p. 329–342. [Online]. Available: https://doi.org/10.1145/2555243.2555267 [29] N. G. Bronson, J. Casper, H. Chafi, and K. Olukotun, “A practical concurrent binary search tree,” SIGPLAN Not., vol. 45, no. 5, p. 257–268, jan 2010. [Online]. Available: https://doi.org/10.1145/ 1837853.1693488 [30] M. Arbel-Raviv, T. Brown, and A. Morrison, “Getting to the root of concurrent binary search tree performance,” in Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, ser. USENIX ATC ’18. USA: USENIX Association, 2018, p. 295–306. [31] B. Kerbl, M. Kenzel, J. H. Mueller, D. Schmalstieg, and M. Steinberger, “The broker queue: A fast, linearizable fifo queue for fine-granular work distribution on the gpu,” in Proceedings of the 2018 International Conference on Supercomputing, ser. ICS ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 76–85. [Online]. Available: https://doi.org/10.1145/3205289.3205291 [32] G. H. Gonnet and R. Baeza-Yates, Handbook of Algorithms and Data Structures: In Pascal and C (2nd Ed.). USA: Addison-Wesley Longman Publishing Co., Inc., 1991. [33] T. Maier, P. Sanders, and R. Dementiev, “Concurrent hash tables: Fast and general(?)!” ACM Trans. Parallel Comput., vol. 5, no. 4, feb 2019. [Online]. Available: https://doi.org/10.1145/3309206 [34] I. Calciu, S. Sen, M. Balakrishnan, and M. K. Aguilera, “Black- box concurrent data structures for numa architectures,” SIGPLAN Not., vol. 52, no. 4, p. 207–221, apr 2017. [Online]. Available: https://doi.org/10.1145/3093336.3037721