SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Writing High-Performance Software by Arvid Norberg
4.
Memory Latency
Memory latencies Core i5 Sandy Bridge
register
0ns
0.125ns
0.25ns
0.375ns
0.5ns
http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
5.
Memory Latency
Memory latencies Core i5 Sandy Bridge
register
L1 cache
0ns
0.325ns
0.65ns
0.975ns
1.3ns
http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
6.
Memory Latency
Memory latencies Core i5 Sandy Bridge
register
L1 cache
L2 cache
0ns
1ns
2ns
3ns
4ns
http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
7.
Memory Latency
Memory latencies Core i5 Sandy Bridge
register
L1 cache
L2 cache
L3 cache
0ns
3.75ns
7.5ns
11.25ns
15ns
http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
8.
Memory Latency
Memory latencies Core i5 Sandy Bridge
register
L1 cache
61.8 x
L2 cache
L3 cache
DRAM
0ns
22.5ns
45ns
67.5ns
90ns
http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
9.
Memory Latency
• When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU
usage, even if your bottleneck is waiting for memory)
• Memory access patterns is a significant factor in performance
• Constant cache misses makes your program run up to 2 orders of
magnitude slower than constant cache hits
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
10.
Memory Cache
The memory you requested
cache line
The memory pulled into the cache
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
11.
Memory Latency
• CPUs prefetch memory automatically if they can recognize your access
pattern (sequential is easy)
• CPUs predict branches in order to prefetch instruction memory
• Memory access pattern is not only determined by data access but also
control flow (indirect jumps stall execution on a memory lookup)
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
12.
Memory Cache
For linear memory reads, the CPU will pre-fetch memory
64 bytes
64 bytes
BitTorrent, Inc. | Writing High-Performance Software
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
For Internal Presentation Purposes Only, Not For External Distribution .
13.
Memory Cache
For linear memory reads, the CPU will pre-fetch memory
64 bytes
64 bytes
BitTorrent, Inc. | Writing High-Performance Software
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
For Internal Presentation Purposes Only, Not For External Distribution .
14.
Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss
64 bytes
64 bytes
BitTorrent, Inc. | Writing High-Performance Software
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
For Internal Presentation Purposes Only, Not For External Distribution .
15.
Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss
64 bytes
64 bytes
BitTorrent, Inc. | Writing High-Performance Software
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
For Internal Presentation Purposes Only, Not For External Distribution .
17.
Data Structures
• Array of pointers to objects and linked lists
more cache pressure / cache misses
• Array of objects
less cache pressure / cache hits
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
18.
Data Structures
• One optimization is to refactor your single list of heterogenous objects
into one list per type.
• Objects would lay out sequentially in memory
• Virtual function dispatch could become static
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
19.
Data Structures
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
20.
Data Structures
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
21.
Data Structures
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
Pointers need
dereferencing +
vtable lookup
std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
22.
Data Structures
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();
BitTorrent, Inc. | Writing High-Performance Software
Pointers need
dereferencing +
vtable lookup
Objects packed back-toback, sequential memory
access, no vtable lookup
For Internal Presentation Purposes Only, Not For External Distribution .
23.
Data Structures
• For heap allocated objects, put the most commonly used (“hot”) fields in
the first cache line
• Avoid unnecessary padding
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
24.
Data Structures
Padding
struct A [24 Bytes]
0: [int : 4] a
--- 4 Bytes padding --8: [void* : 8] b
16: [int : 4] c
--- 4 Bytes padding ---
struct A {
! int a;
! void* b;
! int c;
};
struct
0:
8:
12:
struct B {
! void* b;
! int a;
! int c;
};
B [16 Bytes]
[void* : 8] b
[int : 4] a
[int : 4] c
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
26.
Context Switching
• One significant source of cache misses is switching context, and
switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
27.
Context Switching
• One significant source of cache misses is switching context, and
switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
28.
Context Switching
• Lower the cost of context switching by amortizing it over as much work
as possible
• Reduce the number of system calls by passing as much work as
possible per call
• Reduce thread wake-ups/sleeps by batching work
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
29.
Context Switching
When a thread wakes up, do as much work as possible
before going to sleep
Drain the socket of received bytes
Drain the job queue
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
30.
Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
31.
Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
32.
Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
33.
Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
34.
Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
35.
Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
36.
Context Switching (traffic analogy)
The whole queue at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
37.
Context Switching (traffic analogy)
The whole queue at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
38.
Context Switching (traffic analogy)
The whole queue at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
39.
Context Switching
• Every time the socket becomes readable, read and handle one request
buf = socket.read_one_request()
req = parse_request(buf)
handle_req(socket, req)
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
40.
Context Switching
• Drain the socket each time it becomes readable
• Parse and handle each request that was receive
buf.append(socket.read_all())
req, buf = parse_request(buf)
while req != None:
handle_req(socket, req)
req, buf = parse_request(buf)
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
41.
Context Switching
• Write all responses in a single call at the end
buf.append(socket.read_all())
socket.cork()
req, buf = parse_request(buf)
while req != None:
handle_req(socket, req)
req, buf = parse_request(buf)
socket.uncork()
BitTorrent, Inc. | Writing High-Performance Software
Don’t flush buffer to
socket until all
messages are handled
For Internal Presentation Purposes Only, Not For External Distribution .
42.
Socket Programming
• There are two ways to read from sockets
• Wait for readable event then read (POSIX)
• Read async. then wait for completion event (Win32)
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
43.
Socket Programming
kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) {
int size = read(ev[i].ident, buffer
, buffer_size);
/* ... */
}
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
44.
Socket Programming
Wait for the socket to
become readable
kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) {
int size = read(ev[i].ident, buffer
, buffer_size);
/* ... */
}
Copy data from kernel
space to user space
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
45.
Socket Programming
WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initialization ]
int ret = WSARecv(s, &b, 1, &transferred
, &flags, &ov, nullptr);
WSAOVERLAPPED* ol;
ULONG_PTR* key;
BOOL r = GetQueuedCompletionStatus(port
, &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov
, &transferred, false, &flags);
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
46.
Socket Programming
Initiate async. read into
buffer
WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initialization ]
int ret = WSARecv(s, &b, 1, &transferred
, &flags, &ov, nullptr);
WSAOVERLAPPED* ol;
Wait
ULONG_PTR* key;
BOOL r = GetQueuedCompletionStatus(port
, &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov
, &transferred, false, &flags);
BitTorrent, Inc. | Writing High-Performance Software
for operations to
complete
Query status
For Internal Presentation Purposes Only, Not For External Distribution .
47.
Socket Programming
• Passing in a buffer up-front is preferable because:
• NIC driver can in theory receive data directly into your buffer and
save a copy
• If there is a memory copy, it can be done asynchronously, not
blocking your thread
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
48.
Socket Programming
• Problem: What buffer size should be used?
• Too large will waste memory
• Too small will waste system calls
(since we need multiple calls to drain the socket)
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
49.
Socket Programming
• Problem: What buffer size should be used?
• Start with some reasonable buffer size
• If an async read fills the whole buffer, increase size
• If an async read returns significantly less than the buffer size,
decrease size
Size adjustments should be proportional to the buffer size
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
50.
Context Switching
Adapt batch size to the computer’s natural granularity
Higher load should lead to larger batches, fewer context switches and higher efficiency.
Use of magic numbers is a red flag
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
52.
Message Queues
• Events on message queues may come in batches
• Example: we receive one message per 16 kiB block read from disk.
void conn::on_disk_read(buffer const& buf) {
m_socket.write(&buf[0], buf.size());
}
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
53.
Message Queues
• Problem: We want to flush our sockets right before we go to sleep, i.e.
when we have drained the message queue, without starvation
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
54.
Message Queues
void conn::on_disk_read(buffer const& buf) {
m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
m_has_flush_msg = true;
m_queue.post(std::bind(&conn::flush
, this));
}
void conn::flush() {
m_socket.write(&m_buf[0], m_buf.size());
}
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
55.
Message Queues
Instead of writing to the socket,
accumulate the buffers
void conn::on_disk_read(buffer const& buf) {
m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
m_has_flush_msg = true;
m_queue.post(std::bind(&conn::flush
, this));
}
If there is no outstanding flush
message, post one
void conn::flush() {
m_socket.write(&m_buf[0], m_buf.size());
}
Flush all buffers when all messages have been handled
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
56.
Message Queues
Flush message
FIFO
message
queue
Message handler
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .