SlideShare a Scribd company logo
1 of 57
Download to read offline
Writing High-Performance Software
arvid@bittorrent.com
Performance ⟺Longer Battery Life
(Not only for when things need to run fast)
Memory Cache
Typical memory cache hierarchy (Core i5 Sandy Bridge)

Main memory (16 GiB)
L3 (6 MiB)
L2 (256 kiB)

L2 (256 kiB)

L2 (256 kiB)

L2 (256 kiB)

L1 (32 kiB)

L1 (32 kiB)

L1 (32 kiB)

L1 (32 kiB)

Core 1

Core 2

Core 3

Core 4

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

0ns

0.125ns

0.25ns

0.375ns

0.5ns

http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

L1 cache

0ns

0.325ns

0.65ns

0.975ns

1.3ns

http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

L1 cache

L2 cache

0ns

1ns

2ns

3ns

4ns

http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

L1 cache

L2 cache

L3 cache
0ns

3.75ns

7.5ns

11.25ns

15ns

http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
Memory latencies Core i5 Sandy Bridge

register
L1 cache

61.8 x

L2 cache
L3 cache
DRAM
0ns

22.5ns

45ns

67.5ns

90ns

http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency

• When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU
usage, even if your bottleneck is waiting for memory)
• Memory access patterns is a significant factor in performance
• Constant cache misses makes your program run up to 2 orders of
magnitude slower than constant cache hits

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache

The memory you requested

cache line

The memory pulled into the cache

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency

• CPUs prefetch memory automatically if they can recognize your access
pattern (sequential is easy)
• CPUs predict branches in order to prefetch instruction memory
• Memory access pattern is not only determined by data access but also
control flow (indirect jumps stall execution on a memory lookup)

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
For linear memory reads, the CPU will pre-fetch memory

64 bytes

64 bytes

BitTorrent, Inc. | Writing High-Performance Software

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
For linear memory reads, the CPU will pre-fetch memory

64 bytes

64 bytes

BitTorrent, Inc. | Writing High-Performance Software

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss

64 bytes

64 bytes

BitTorrent, Inc. | Writing High-Performance Software

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss

64 bytes

64 bytes

BitTorrent, Inc. | Writing High-Performance Software

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures
Data Structures

• Array of pointers to objects and linked lists
more cache pressure / cache misses
• Array of objects
less cache pressure / cache hits

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

• One optimization is to refactor your single list of heterogenous objects
into one list per type.
• Objects would lay out sequentially in memory
• Virtual function dispatch could become static

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();

std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();

Pointers need
dereferencing +
vtable lookup

std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();

BitTorrent, Inc. | Writing High-Performance Software

Pointers need
dereferencing +
vtable lookup
Objects packed back-toback, sequential memory
access, no vtable lookup
For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

• For heap allocated objects, put the most commonly used (“hot”) fields in
the first cache line
• Avoid unnecessary padding

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

Padding
struct A [24 Bytes]
0: [int : 4] a
--- 4 Bytes padding --8: [void* : 8] b
16: [int : 4] c
--- 4 Bytes padding ---

struct A {
! int a;
! void* b;
! int c;
};

struct
0:
8:
12:

struct B {
! void* b;
! int a;
! int c;
};

B [16 Bytes]
[void* : 8] b
[int : 4] a
[int : 4] c

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching
Context Switching

• One significant source of cache misses is switching context, and
switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

• One significant source of cache misses is switching context, and
switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

• Lower the cost of context switching by amortizing it over as much work
as possible
• Reduce the number of system calls by passing as much work as
possible per call
• Reduce thread wake-ups/sleeps by batching work

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

When a thread wakes up, do as much work as possible
before going to sleep
Drain the socket of received bytes
Drain the job queue

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
The whole queue at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
The whole queue at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
The whole queue at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

• Every time the socket becomes readable, read and handle one request

buf = socket.read_one_request()
req = parse_request(buf)
handle_req(socket, req)

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

• Drain the socket each time it becomes readable
• Parse and handle each request that was receive
buf.append(socket.read_all())
req, buf = parse_request(buf)
while req != None:
handle_req(socket, req)
req, buf = parse_request(buf)

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

• Write all responses in a single call at the end

buf.append(socket.read_all())
socket.cork()
req, buf = parse_request(buf)
while req != None:
handle_req(socket, req)
req, buf = parse_request(buf)
socket.uncork()

BitTorrent, Inc. | Writing High-Performance Software

Don’t flush buffer to
socket until all
messages are handled

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

• There are two ways to read from sockets
• Wait for readable event then read (POSIX)
• Read async. then wait for completion event (Win32)

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) {
int size = read(ev[i].ident, buffer
, buffer_size);
/* ... */
}

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

Wait for the socket to
become readable

kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) {
int size = read(ev[i].ident, buffer
, buffer_size);
/* ... */
}

Copy data from kernel
space to user space
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initialization ]
int ret = WSARecv(s, &b, 1, &transferred
, &flags, &ov, nullptr);
WSAOVERLAPPED* ol;
ULONG_PTR* key;
BOOL r = GetQueuedCompletionStatus(port
, &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov
, &transferred, false, &flags);
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

Initiate async. read into
buffer

WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initialization ]
int ret = WSARecv(s, &b, 1, &transferred
, &flags, &ov, nullptr);

WSAOVERLAPPED* ol;
Wait
ULONG_PTR* key;
BOOL r = GetQueuedCompletionStatus(port
, &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov
, &transferred, false, &flags);
BitTorrent, Inc. | Writing High-Performance Software

for operations to
complete
Query status
For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

• Passing in a buffer up-front is preferable because:
• NIC driver can in theory receive data directly into your buffer and
save a copy
• If there is a memory copy, it can be done asynchronously, not
blocking your thread

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

• Problem: What buffer size should be used?
• Too large will waste memory
• Too small will waste system calls
(since we need multiple calls to drain the socket)

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

• Problem: What buffer size should be used?
• Start with some reasonable buffer size
• If an async read fills the whole buffer, increase size
• If an async read returns significantly less than the buffer size,
decrease size
Size adjustments should be proportional to the buffer size

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

Adapt batch size to the computer’s natural granularity
Higher load should lead to larger batches, fewer context switches and higher efficiency.
Use of magic numbers is a red flag

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues
Message Queues

• Events on message queues may come in batches
• Example: we receive one message per 16 kiB block read from disk.

void conn::on_disk_read(buffer const& buf) {
m_socket.write(&buf[0], buf.size());
}

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues

• Problem: We want to flush our sockets right before we go to sleep, i.e.
when we have drained the message queue, without starvation

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues

void conn::on_disk_read(buffer const& buf) {
m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
m_has_flush_msg = true;
m_queue.post(std::bind(&conn::flush
, this));
}
void conn::flush() {
m_socket.write(&m_buf[0], m_buf.size());
}
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues
Instead of writing to the socket,
accumulate the buffers
void conn::on_disk_read(buffer const& buf) {
m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
m_has_flush_msg = true;
m_queue.post(std::bind(&conn::flush
, this));
}
If there is no outstanding flush
message, post one
void conn::flush() {
m_socket.write(&m_buf[0], m_buf.size());
}
Flush all buffers when all messages have been handled
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues

Flush message
FIFO
message
queue
Message handler

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Contact
arvid@bittorrent.com

More Related Content

What's hot

Visual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 OverviewVisual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 Overview
bwullems
 

What's hot (7)

Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
JSR 335 / java 8 - update reference
JSR 335 / java 8 - update referenceJSR 335 / java 8 - update reference
JSR 335 / java 8 - update reference
 
Visual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 OverviewVisual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 Overview
 
GraphQL the holy contract between client and server
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and server
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance Techniques
 
50 Shades of Fail KScope16
50 Shades of Fail KScope1650 Shades of Fail KScope16
50 Shades of Fail KScope16
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 

Similar to Writing High-Performance Software by Arvid Norberg

Tuning Your SharePoint Environment
Tuning Your SharePoint EnvironmentTuning Your SharePoint Environment
Tuning Your SharePoint Environment
vmaximiuk
 
ShaREing Is Caring
ShaREing Is CaringShaREing Is Caring
ShaREing Is Caring
sporst
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew eng
CineSoft
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 

Similar to Writing High-Performance Software by Arvid Norberg (20)

Deep Dive into AWS Fargate - CON333 - re:Invent 2017
Deep Dive into AWS Fargate - CON333 - re:Invent 2017Deep Dive into AWS Fargate - CON333 - re:Invent 2017
Deep Dive into AWS Fargate - CON333 - re:Invent 2017
 
Tuning Your SharePoint Environment
Tuning Your SharePoint EnvironmentTuning Your SharePoint Environment
Tuning Your SharePoint Environment
 
Playing with shodan
Playing with shodanPlaying with shodan
Playing with shodan
 
Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)
 
ShaREing Is Caring
ShaREing Is CaringShaREing Is Caring
ShaREing Is Caring
 
NIKE Product Specification
NIKE Product SpecificationNIKE Product Specification
NIKE Product Specification
 
Pi Is For Python
Pi Is For PythonPi Is For Python
Pi Is For Python
 
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
 
Sharepoint Presentation
Sharepoint PresentationSharepoint Presentation
Sharepoint Presentation
 
Tech trends 2018 2019
Tech trends 2018 2019Tech trends 2018 2019
Tech trends 2018 2019
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew eng
 
Asynchronous web-development with Python
Asynchronous web-development with PythonAsynchronous web-development with Python
Asynchronous web-development with Python
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Using Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsUsing Event Streams in Serverless Applications
Using Event Streams in Serverless Applications
 
Sitecore Knowledge Transfer 2018 day-1
Sitecore  Knowledge Transfer 2018 day-1Sitecore  Knowledge Transfer 2018 day-1
Sitecore Knowledge Transfer 2018 day-1
 
SharePoint Development Workshop
SharePoint Development WorkshopSharePoint Development Workshop
SharePoint Development Workshop
 
Single Source of Truth for Network Automation
Single Source of Truth for Network AutomationSingle Source of Truth for Network Automation
Single Source of Truth for Network Automation
 
Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...
Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...
Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...
 
Welcome to pc hardware
Welcome to pc hardwareWelcome to pc hardware
Welcome to pc hardware
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 

Recently uploaded

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 

Recently uploaded (20)

Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 

Writing High-Performance Software by Arvid Norberg

  • 2. Performance ⟺Longer Battery Life (Not only for when things need to run fast)
  • 3. Memory Cache Typical memory cache hierarchy (Core i5 Sandy Bridge) Main memory (16 GiB) L3 (6 MiB) L2 (256 kiB) L2 (256 kiB) L2 (256 kiB) L2 (256 kiB) L1 (32 kiB) L1 (32 kiB) L1 (32 kiB) L1 (32 kiB) Core 1 Core 2 Core 3 Core 4 BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 4. Memory Latency Memory latencies Core i5 Sandy Bridge register 0ns 0.125ns 0.25ns 0.375ns 0.5ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 5. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache 0ns 0.325ns 0.65ns 0.975ns 1.3ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 6. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache L2 cache 0ns 1ns 2ns 3ns 4ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 7. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache L2 cache L3 cache 0ns 3.75ns 7.5ns 11.25ns 15ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 8. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache 61.8 x L2 cache L3 cache DRAM 0ns 22.5ns 45ns 67.5ns 90ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 9. Memory Latency • When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU usage, even if your bottleneck is waiting for memory) • Memory access patterns is a significant factor in performance • Constant cache misses makes your program run up to 2 orders of magnitude slower than constant cache hits BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 10. Memory Cache The memory you requested cache line The memory pulled into the cache BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 11. Memory Latency • CPUs prefetch memory automatically if they can recognize your access pattern (sequential is easy) • CPUs predict branches in order to prefetch instruction memory • Memory access pattern is not only determined by data access but also control flow (indirect jumps stall execution on a memory lookup) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 12. Memory Cache For linear memory reads, the CPU will pre-fetch memory 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  • 13. Memory Cache For linear memory reads, the CPU will pre-fetch memory 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  • 14. Memory Cache For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  • 15. Memory Cache For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  • 17. Data Structures • Array of pointers to objects and linked lists more cache pressure / cache misses • Array of objects less cache pressure / cache hits BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 18. Data Structures • One optimization is to refactor your single list of heterogenous objects into one list per type. • Objects would lay out sequentially in memory • Virtual function dispatch could become static BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 19. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 20. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); std::vector<rectangle> rectangles; std::vector<circle> circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw(); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 21. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); Pointers need dereferencing + vtable lookup std::vector<rectangle> rectangles; std::vector<circle> circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw(); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 22. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); std::vector<rectangle> rectangles; std::vector<circle> circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw(); BitTorrent, Inc. | Writing High-Performance Software Pointers need dereferencing + vtable lookup Objects packed back-toback, sequential memory access, no vtable lookup For Internal Presentation Purposes Only, Not For External Distribution .
  • 23. Data Structures • For heap allocated objects, put the most commonly used (“hot”) fields in the first cache line • Avoid unnecessary padding BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 24. Data Structures Padding struct A [24 Bytes] 0: [int : 4] a --- 4 Bytes padding --8: [void* : 8] b 16: [int : 4] c --- 4 Bytes padding --- struct A { ! int a; ! void* b; ! int c; }; struct 0: 8: 12: struct B { ! void* b; ! int a; ! int c; }; B [16 Bytes] [void* : 8] b [int : 4] a [int : 4] c BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 26. Context Switching • One significant source of cache misses is switching context, and switching the data set being worked on • Context switch • Thread / process switching • User space -> kernel space BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 27. Context Switching • One significant source of cache misses is switching context, and switching the data set being worked on • Context switch • Thread / process switching • User space -> kernel space BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 28. Context Switching • Lower the cost of context switching by amortizing it over as much work as possible • Reduce the number of system calls by passing as much work as possible per call • Reduce thread wake-ups/sleeps by batching work BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 29. Context Switching When a thread wakes up, do as much work as possible before going to sleep Drain the socket of received bytes Drain the job queue BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 30. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 31. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 32. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 33. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 34. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 35. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 36. Context Switching (traffic analogy) The whole queue at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 37. Context Switching (traffic analogy) The whole queue at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 38. Context Switching (traffic analogy) The whole queue at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 39. Context Switching • Every time the socket becomes readable, read and handle one request buf = socket.read_one_request() req = parse_request(buf) handle_req(socket, req) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 40. Context Switching • Drain the socket each time it becomes readable • Parse and handle each request that was receive buf.append(socket.read_all()) req, buf = parse_request(buf) while req != None: handle_req(socket, req) req, buf = parse_request(buf) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 41. Context Switching • Write all responses in a single call at the end buf.append(socket.read_all()) socket.cork() req, buf = parse_request(buf) while req != None: handle_req(socket, req) req, buf = parse_request(buf) socket.uncork() BitTorrent, Inc. | Writing High-Performance Software Don’t flush buffer to socket until all messages are handled For Internal Presentation Purposes Only, Not For External Distribution .
  • 42. Socket Programming • There are two ways to read from sockets • Wait for readable event then read (POSIX) • Read async. then wait for completion event (Win32) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 43. Socket Programming kevent ev[100]; int events = kevent(queue, nullptr , 0, ev, 100, nullptr); for (int i = 0; i < events; ++i) { int size = read(ev[i].ident, buffer , buffer_size); /* ... */ } BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 44. Socket Programming Wait for the socket to become readable kevent ev[100]; int events = kevent(queue, nullptr , 0, ev, 100, nullptr); for (int i = 0; i < events; ++i) { int size = read(ev[i].ident, buffer , buffer_size); /* ... */ } Copy data from kernel space to user space BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 45. Socket Programming WSABUF b = { buffer_size, buffer }; DWORD transferred = 0, flags = 0; WSAOVERLAPPED ov; // [ initialization ] int ret = WSARecv(s, &b, 1, &transferred , &flags, &ov, nullptr); WSAOVERLAPPED* ol; ULONG_PTR* key; BOOL r = GetQueuedCompletionStatus(port , &transferred, &key, &ol, INFINITE); ret = WSAGetOverlappedResult(s, &ov , &transferred, false, &flags); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 46. Socket Programming Initiate async. read into buffer WSABUF b = { buffer_size, buffer }; DWORD transferred = 0, flags = 0; WSAOVERLAPPED ov; // [ initialization ] int ret = WSARecv(s, &b, 1, &transferred , &flags, &ov, nullptr); WSAOVERLAPPED* ol; Wait ULONG_PTR* key; BOOL r = GetQueuedCompletionStatus(port , &transferred, &key, &ol, INFINITE); ret = WSAGetOverlappedResult(s, &ov , &transferred, false, &flags); BitTorrent, Inc. | Writing High-Performance Software for operations to complete Query status For Internal Presentation Purposes Only, Not For External Distribution .
  • 47. Socket Programming • Passing in a buffer up-front is preferable because: • NIC driver can in theory receive data directly into your buffer and save a copy • If there is a memory copy, it can be done asynchronously, not blocking your thread BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 48. Socket Programming • Problem: What buffer size should be used? • Too large will waste memory • Too small will waste system calls (since we need multiple calls to drain the socket) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 49. Socket Programming • Problem: What buffer size should be used? • Start with some reasonable buffer size • If an async read fills the whole buffer, increase size • If an async read returns significantly less than the buffer size, decrease size Size adjustments should be proportional to the buffer size BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 50. Context Switching Adapt batch size to the computer’s natural granularity Higher load should lead to larger batches, fewer context switches and higher efficiency. Use of magic numbers is a red flag BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 52. Message Queues • Events on message queues may come in batches • Example: we receive one message per 16 kiB block read from disk. void conn::on_disk_read(buffer const& buf) { m_socket.write(&buf[0], buf.size()); } BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 53. Message Queues • Problem: We want to flush our sockets right before we go to sleep, i.e. when we have drained the message queue, without starvation BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 54. Message Queues void conn::on_disk_read(buffer const& buf) { m_buf.insert(m_buf.end(), buf); if (m_has_flush_msg) return; m_has_flush_msg = true; m_queue.post(std::bind(&conn::flush , this)); } void conn::flush() { m_socket.write(&m_buf[0], m_buf.size()); } BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 55. Message Queues Instead of writing to the socket, accumulate the buffers void conn::on_disk_read(buffer const& buf) { m_buf.insert(m_buf.end(), buf); if (m_has_flush_msg) return; m_has_flush_msg = true; m_queue.post(std::bind(&conn::flush , this)); } If there is no outstanding flush message, post one void conn::flush() { m_socket.write(&m_buf[0], m_buf.size()); } Flush all buffers when all messages have been handled BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 56. Message Queues Flush message FIFO message queue Message handler BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .