Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Writing High-Performance Software
arvid@bittorrent.com
Performance ⟺Longer Battery Life
(Not only for when things need to run fast)
Memory Cache
Typical memory cache hierarchy (Core i5 Sandy Bridge)

Main memory (16 GiB)
L3 (6 MiB)
L2 (256 kiB)

L2 (256 ...
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

0ns

0.125ns

0.25ns

0.375ns

0.5ns

http://www.7-cpu.com...
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

L1 cache

0ns

0.325ns

0.65ns

0.975ns

1.3ns

http://www...
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

L1 cache

L2 cache

0ns

1ns

2ns

3ns

4ns

http://www.7-...
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

L1 cache

L2 cache

L3 cache
0ns

3.75ns

7.5ns

11.25ns

...
Memory Latency
Memory latencies Core i5 Sandy Bridge

register
L1 cache

61.8 x

L2 cache
L3 cache
DRAM
0ns

22.5ns

45ns
...
Memory Latency

• When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU
usage, even if your bottleneck ...
Memory Cache

The memory you requested

cache line

The memory pulled into the cache

BitTorrent, Inc. | Writing High-Perf...
Memory Latency

• CPUs prefetch memory automatically if they can recognize your access
pattern (sequential is easy)
• CPUs...
Memory Cache
For linear memory reads, the CPU will pre-fetch memory

64 bytes

64 bytes

BitTorrent, Inc. | Writing High-P...
Memory Cache
For linear memory reads, the CPU will pre-fetch memory

64 bytes

64 bytes

BitTorrent, Inc. | Writing High-P...
Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss

64 bytes

64...
Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss

64 bytes

64...
Data Structures
Data Structures

• Array of pointers to objects and linked lists
more cache pressure / cache misses
• Array of objects
les...
Data Structures

• One optimization is to refactor your single list of heterogenous objects
into one list per type.
• Obje...
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();

BitTorrent, Inc. | Writing...
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();

std::vector<rectangle> rec...
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();

Pointers need
dereferencin...
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
std::vector<rectangle> rect...
Data Structures

• For heap allocated objects, put the most commonly used (“hot”) fields in
the first cache line
• Avoid unn...
Data Structures

Padding
struct A [24 Bytes]
0: [int : 4] a
--- 4 Bytes padding --8: [void* : 8] b
16: [int : 4] c
--- 4 B...
Context Switching
Context Switching

• One significant source of cache misses is switching context, and
switching the data set being worked o...
Context Switching

• One significant source of cache misses is switching context, and
switching the data set being worked o...
Context Switching

• Lower the cost of context switching by amortizing it over as much work
as possible
• Reduce the numbe...
Context Switching

When a thread wakes up, do as much work as possible
before going to sleep
Drain the socket of received ...
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal ...
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal ...
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal ...
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal ...
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal ...
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal ...
Context Switching (traffic analogy)
The whole queue at a time

BitTorrent, Inc. | Writing High-Performance Software

For I...
Context Switching (traffic analogy)
The whole queue at a time

BitTorrent, Inc. | Writing High-Performance Software

For I...
Context Switching (traffic analogy)
The whole queue at a time

BitTorrent, Inc. | Writing High-Performance Software

For I...
Context Switching

• Every time the socket becomes readable, read and handle one request

buf = socket.read_one_request()
...
Context Switching

• Drain the socket each time it becomes readable
• Parse and handle each request that was receive
buf.a...
Context Switching

• Write all responses in a single call at the end

buf.append(socket.read_all())
socket.cork()
req, buf...
Socket Programming

• There are two ways to read from sockets
• Wait for readable event then read (POSIX)
• Read async. th...
Socket Programming

kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events;...
Socket Programming

Wait for the socket to
become readable

kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 10...
Socket Programming

WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initializ...
Socket Programming

Initiate async. read into
buffer

WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;...
Socket Programming

• Passing in a buffer up-front is preferable because:
• NIC driver can in theory receive data directly ...
Socket Programming

• Problem: What buffer size should be used?
• Too large will waste memory
• Too small will waste system...
Socket Programming

• Problem: What buffer size should be used?
• Start with some reasonable buffer size
• If an async read ...
Context Switching

Adapt batch size to the computer’s natural granularity
Higher load should lead to larger batches, fewer...
Message Queues
Message Queues

• Events on message queues may come in batches
• Example: we receive one message per 16 kiB block read fro...
Message Queues

• Problem: We want to flush our sockets right before we go to sleep, i.e.
when we have drained the message ...
Message Queues

void conn::on_disk_read(buffer const& buf) {
m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
...
Message Queues
Instead of writing to the socket,
accumulate the buffers
void conn::on_disk_read(buffer const& buf) {
m_buf....
Message Queues

Flush message
FIFO
message
queue
Message handler

BitTorrent, Inc. | Writing High-Performance Software

Fo...
Contact
arvid@bittorrent.com
Upcoming SlideShare
Loading in …5
×

Writing High-Performance Software by Arvid Norberg

3,093 views

Published on

BitTorrent Chief Architect Arvid Norberg on Writing high-performance software.

Published in: Technology, Business

Writing High-Performance Software by Arvid Norberg

  1. 1. Writing High-Performance Software arvid@bittorrent.com
  2. 2. Performance ⟺Longer Battery Life (Not only for when things need to run fast)
  3. 3. Memory Cache Typical memory cache hierarchy (Core i5 Sandy Bridge) Main memory (16 GiB) L3 (6 MiB) L2 (256 kiB) L2 (256 kiB) L2 (256 kiB) L2 (256 kiB) L1 (32 kiB) L1 (32 kiB) L1 (32 kiB) L1 (32 kiB) Core 1 Core 2 Core 3 Core 4 BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  4. 4. Memory Latency Memory latencies Core i5 Sandy Bridge register 0ns 0.125ns 0.25ns 0.375ns 0.5ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  5. 5. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache 0ns 0.325ns 0.65ns 0.975ns 1.3ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  6. 6. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache L2 cache 0ns 1ns 2ns 3ns 4ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  7. 7. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache L2 cache L3 cache 0ns 3.75ns 7.5ns 11.25ns 15ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  8. 8. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache 61.8 x L2 cache L3 cache DRAM 0ns 22.5ns 45ns 67.5ns 90ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  9. 9. Memory Latency • When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU usage, even if your bottleneck is waiting for memory) • Memory access patterns is a significant factor in performance • Constant cache misses makes your program run up to 2 orders of magnitude slower than constant cache hits BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  10. 10. Memory Cache The memory you requested cache line The memory pulled into the cache BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  11. 11. Memory Latency • CPUs prefetch memory automatically if they can recognize your access pattern (sequential is easy) • CPUs predict branches in order to prefetch instruction memory • Memory access pattern is not only determined by data access but also control flow (indirect jumps stall execution on a memory lookup) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  12. 12. Memory Cache For linear memory reads, the CPU will pre-fetch memory 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  13. 13. Memory Cache For linear memory reads, the CPU will pre-fetch memory 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  14. 14. Memory Cache For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  15. 15. Memory Cache For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  16. 16. Data Structures
  17. 17. Data Structures • Array of pointers to objects and linked lists more cache pressure / cache misses • Array of objects less cache pressure / cache hits BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  18. 18. Data Structures • One optimization is to refactor your single list of heterogenous objects into one list per type. • Objects would lay out sequentially in memory • Virtual function dispatch could become static BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  19. 19. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  20. 20. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); std::vector<rectangle> rectangles; std::vector<circle> circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw(); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  21. 21. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); Pointers need dereferencing + vtable lookup std::vector<rectangle> rectangles; std::vector<circle> circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw(); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  22. 22. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); std::vector<rectangle> rectangles; std::vector<circle> circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw(); BitTorrent, Inc. | Writing High-Performance Software Pointers need dereferencing + vtable lookup Objects packed back-toback, sequential memory access, no vtable lookup For Internal Presentation Purposes Only, Not For External Distribution .
  23. 23. Data Structures • For heap allocated objects, put the most commonly used (“hot”) fields in the first cache line • Avoid unnecessary padding BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  24. 24. Data Structures Padding struct A [24 Bytes] 0: [int : 4] a --- 4 Bytes padding --8: [void* : 8] b 16: [int : 4] c --- 4 Bytes padding --- struct A { ! int a; ! void* b; ! int c; }; struct 0: 8: 12: struct B { ! void* b; ! int a; ! int c; }; B [16 Bytes] [void* : 8] b [int : 4] a [int : 4] c BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  25. 25. Context Switching
  26. 26. Context Switching • One significant source of cache misses is switching context, and switching the data set being worked on • Context switch • Thread / process switching • User space -> kernel space BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  27. 27. Context Switching • One significant source of cache misses is switching context, and switching the data set being worked on • Context switch • Thread / process switching • User space -> kernel space BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  28. 28. Context Switching • Lower the cost of context switching by amortizing it over as much work as possible • Reduce the number of system calls by passing as much work as possible per call • Reduce thread wake-ups/sleeps by batching work BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  29. 29. Context Switching When a thread wakes up, do as much work as possible before going to sleep Drain the socket of received bytes Drain the job queue BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  30. 30. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  31. 31. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  32. 32. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  33. 33. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  34. 34. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  35. 35. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  36. 36. Context Switching (traffic analogy) The whole queue at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  37. 37. Context Switching (traffic analogy) The whole queue at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  38. 38. Context Switching (traffic analogy) The whole queue at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  39. 39. Context Switching • Every time the socket becomes readable, read and handle one request buf = socket.read_one_request() req = parse_request(buf) handle_req(socket, req) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  40. 40. Context Switching • Drain the socket each time it becomes readable • Parse and handle each request that was receive buf.append(socket.read_all()) req, buf = parse_request(buf) while req != None: handle_req(socket, req) req, buf = parse_request(buf) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  41. 41. Context Switching • Write all responses in a single call at the end buf.append(socket.read_all()) socket.cork() req, buf = parse_request(buf) while req != None: handle_req(socket, req) req, buf = parse_request(buf) socket.uncork() BitTorrent, Inc. | Writing High-Performance Software Don’t flush buffer to socket until all messages are handled For Internal Presentation Purposes Only, Not For External Distribution .
  42. 42. Socket Programming • There are two ways to read from sockets • Wait for readable event then read (POSIX) • Read async. then wait for completion event (Win32) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  43. 43. Socket Programming kevent ev[100]; int events = kevent(queue, nullptr , 0, ev, 100, nullptr); for (int i = 0; i < events; ++i) { int size = read(ev[i].ident, buffer , buffer_size); /* ... */ } BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  44. 44. Socket Programming Wait for the socket to become readable kevent ev[100]; int events = kevent(queue, nullptr , 0, ev, 100, nullptr); for (int i = 0; i < events; ++i) { int size = read(ev[i].ident, buffer , buffer_size); /* ... */ } Copy data from kernel space to user space BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  45. 45. Socket Programming WSABUF b = { buffer_size, buffer }; DWORD transferred = 0, flags = 0; WSAOVERLAPPED ov; // [ initialization ] int ret = WSARecv(s, &b, 1, &transferred , &flags, &ov, nullptr); WSAOVERLAPPED* ol; ULONG_PTR* key; BOOL r = GetQueuedCompletionStatus(port , &transferred, &key, &ol, INFINITE); ret = WSAGetOverlappedResult(s, &ov , &transferred, false, &flags); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  46. 46. Socket Programming Initiate async. read into buffer WSABUF b = { buffer_size, buffer }; DWORD transferred = 0, flags = 0; WSAOVERLAPPED ov; // [ initialization ] int ret = WSARecv(s, &b, 1, &transferred , &flags, &ov, nullptr); WSAOVERLAPPED* ol; Wait ULONG_PTR* key; BOOL r = GetQueuedCompletionStatus(port , &transferred, &key, &ol, INFINITE); ret = WSAGetOverlappedResult(s, &ov , &transferred, false, &flags); BitTorrent, Inc. | Writing High-Performance Software for operations to complete Query status For Internal Presentation Purposes Only, Not For External Distribution .
  47. 47. Socket Programming • Passing in a buffer up-front is preferable because: • NIC driver can in theory receive data directly into your buffer and save a copy • If there is a memory copy, it can be done asynchronously, not blocking your thread BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  48. 48. Socket Programming • Problem: What buffer size should be used? • Too large will waste memory • Too small will waste system calls (since we need multiple calls to drain the socket) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  49. 49. Socket Programming • Problem: What buffer size should be used? • Start with some reasonable buffer size • If an async read fills the whole buffer, increase size • If an async read returns significantly less than the buffer size, decrease size Size adjustments should be proportional to the buffer size BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  50. 50. Context Switching Adapt batch size to the computer’s natural granularity Higher load should lead to larger batches, fewer context switches and higher efficiency. Use of magic numbers is a red flag BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  51. 51. Message Queues
  52. 52. Message Queues • Events on message queues may come in batches • Example: we receive one message per 16 kiB block read from disk. void conn::on_disk_read(buffer const& buf) { m_socket.write(&buf[0], buf.size()); } BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  53. 53. Message Queues • Problem: We want to flush our sockets right before we go to sleep, i.e. when we have drained the message queue, without starvation BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  54. 54. Message Queues void conn::on_disk_read(buffer const& buf) { m_buf.insert(m_buf.end(), buf); if (m_has_flush_msg) return; m_has_flush_msg = true; m_queue.post(std::bind(&conn::flush , this)); } void conn::flush() { m_socket.write(&m_buf[0], m_buf.size()); } BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  55. 55. Message Queues Instead of writing to the socket, accumulate the buffers void conn::on_disk_read(buffer const& buf) { m_buf.insert(m_buf.end(), buf); if (m_has_flush_msg) return; m_has_flush_msg = true; m_queue.post(std::bind(&conn::flush , this)); } If there is no outstanding flush message, post one void conn::flush() { m_socket.write(&m_buf[0], m_buf.size()); } Flush all buffers when all messages have been handled BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  56. 56. Message Queues Flush message FIFO message queue Message handler BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  57. 57. Contact arvid@bittorrent.com

×