Writing High-Performance Software by Arvid Norberg

Writing High-Performance Software
arvid@bittorrent.com

Performance ⟺Longer Battery Life
(Not only for when things need to run fast)

Memory Cache
Typical memory cache hierarchy (Core i5 Sandy Bridge)

Main memory (16 GiB)
L3 (6 MiB)
L2 (256 kiB)

L2 (256 kiB)

L2 (256 kiB)

L2 (256 kiB)

L1 (32 kiB)

L1 (32 kiB)

L1 (32 kiB)

L1 (32 kiB)

Core 1

Core 2

Core 3

Core 4

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .

Memory Latency
Memory latencies Core i5 Sandy Bridge

register

0ns

0.125ns

0.25ns

0.375ns

0.5ns

http://www.7-cpu.com/cpu/IvyBridge.html


Memory Latency

register

L1 cache

0ns

0.325ns

0.65ns

0.975ns

1.3ns



Memory Latency

register

L1 cache

L2 cache

0ns

1ns

2ns

3ns

4ns



Memory Latency

register

L1 cache

L2 cache

L3 cache
0ns

3.75ns

7.5ns

11.25ns

15ns



Memory Latency

register
L1 cache

61.8 x

L2 cache
L3 cache
DRAM
0ns

22.5ns

45ns

67.5ns

90ns



Memory Latency

• When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU
usage, even if your bottleneck is waiting for memory)
• Memory access patterns is a signiﬁcant factor in performance
• Constant cache misses makes your program run up to 2 orders of
magnitude slower than constant cache hits



Memory Cache

The memory you requested

cache line

The memory pulled into the cache



Memory Latency

• CPUs prefetch memory automatically if they can recognize your access
pattern (sequential is easy)
• CPUs predict branches in order to prefetch instruction memory
• Memory access pattern is not only determined by data access but also
control ﬂow (indirect jumps stall execution on a memory lookup)



Memory Cache
For linear memory reads, the CPU will pre-fetch memory

64 bytes

64 bytes


64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes


Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss

64 bytes

64 bytes


64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes


Data Structures

• Array of pointers to objects and linked lists
more cache pressure / cache misses
• Array of objects
less cache pressure / cache hits



Data Structures

• One optimization is to refactor your single list of heterogenous objects
into one list per type.
• Objects would lay out sequentially in memory
• Virtual function dispatch could become static



Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();



Data Structures


std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();



Data Structures


Pointers need
dereferencing +
vtable lookup




Data Structures



Pointers need
dereferencing +
vtable lookup
Objects packed back-toback, sequential memory
access, no vtable lookup

Data Structures

• For heap allocated objects, put the most commonly used (“hot”) ﬁelds in
the ﬁrst cache line
• Avoid unnecessary padding



Data Structures

Padding
struct A [24 Bytes]
0: [int : 4] a
--- 4 Bytes padding --8: [void* : 8] b
16: [int : 4] c
--- 4 Bytes padding ---

struct A {
! int a;
! void* b;
! int c;
};

struct
0:
8:
12:

struct B {
! void* b;
! int a;
! int c;
};

B [16 Bytes]
[void* : 8] b
[int : 4] a
[int : 4] c



Context Switching

• One signiﬁcant source of cache misses is switching context, and
switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space



Context Switching

• Lower the cost of context switching by amortizing it over as much work
as possible
• Reduce the number of system calls by passing as much work as
possible per call
• Reduce thread wake-ups/sleeps by batching work



Context Switching

When a thread wakes up, do as much work as possible
before going to sleep
Drain the socket of received bytes
Drain the job queue



Context Switching (traffic analogy)
One car at a time



Context Switching (traffic analogy)
The whole queue at a time



Context Switching

• Every time the socket becomes readable, read and handle one request

buf = socket.read_one_request()
req = parse_request(buf)
handle_req(socket, req)



Context Switching

• Drain the socket each time it becomes readable
• Parse and handle each request that was receive
buf.append(socket.read_all())
req, buf = parse_request(buf)
while req != None:



Context Switching

• Write all responses in a single call at the end

buf.append(socket.read_all())
socket.cork()
while req != None:
socket.uncork()


Don’t ﬂush buﬀer to
socket until all
messages are handled


Socket Programming

• There are two ways to read from sockets
• Wait for readable event then read (POSIX)
• Read async. then wait for completion event (Win32)



Socket Programming

kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) {
int size = read(ev[i].ident, buffer
, buffer_size);
/* ... */
}



Socket Programming

Wait for the socket to
become readable

kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) {
int size = read(ev[i].ident, buffer
, buffer_size);
/* ... */
}

Copy data from kernel
space to user space


Socket Programming

WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initialization ]
int ret = WSARecv(s, &b, 1, &transferred
, &flags, &ov, nullptr);
WSAOVERLAPPED* ol;
ULONG_PTR* key;
BOOL r = GetQueuedCompletionStatus(port
, &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov
, &transferred, false, &flags);


Socket Programming

Initiate async. read into
buﬀer

WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initialization ]
int ret = WSARecv(s, &b, 1, &transferred
, &flags, &ov, nullptr);

WSAOVERLAPPED* ol;
Wait
ULONG_PTR* key;
BOOL r = GetQueuedCompletionStatus(port
, &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov
, &transferred, false, &flags);

for operations to
complete
Query status

Socket Programming

• Passing in a buﬀer up-front is preferable because:
• NIC driver can in theory receive data directly into your buﬀer and
save a copy
• If there is a memory copy, it can be done asynchronously, not
blocking your thread



Socket Programming

• Problem: What buﬀer size should be used?
• Too large will waste memory
• Too small will waste system calls
(since we need multiple calls to drain the socket)



Socket Programming

• Problem: What buffer size should be used?
• Start with some reasonable buffer size
• If an async read fills the whole buffer, increase size
• If an async read returns significantly less than the buffer size,
decrease size
Size adjustments should be proportional to the buffer size



Context Switching

Adapt batch size to the computer’s natural granularity
Higher load should lead to larger batches, fewer context switches and higher eﬃciency.
Use of magic numbers is a red ﬂag



Message Queues

• Events on message queues may come in batches
• Example: we receive one message per 16 kiB block read from disk.

void conn::on_disk_read(buffer const& buf) {
m_socket.write(&buf[0], buf.size());
}



Message Queues

• Problem: We want to ﬂush our sockets right before we go to sleep, i.e.
when we have drained the message queue, without starvation



Message Queues

m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
m_has_flush_msg = true;
m_queue.post(std::bind(&conn::flush
, this));
}
void conn::flush() {
m_socket.write(&m_buf[0], m_buf.size());
}


Message Queues
Instead of writing to the socket,
accumulate the buffers
m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
m_has_flush_msg = true;
m_queue.post(std::bind(&conn::flush
, this));
}
If there is no outstanding flush
message, post one
void conn::flush() {
m_socket.write(&m_buf[0], m_buf.size());
}
Flush all buffers when all messages have been handled


Message Queues

Flush message
FIFO
message
queue
Message handler



Writing High-Performance Software by Arvid Norberg

More Related Content

What's hot

Similar to Writing High-Performance Software by Arvid Norberg

Recently uploaded

Writing High-Performance Software by Arvid Norberg